Serverless Toronto - February 2020
Ian Whitestone
>>> df[df.bedrooms == 1].price.median()
$2,200
>>> df[df.bedrooms == 0].price.median()
$1,800
>>> df[df.housing_type == 'basement'].price.median()
$1,500
Inspired by a simple San Francisco apartment posting slack bot made by Vik Paruchuri...
...and more
"Run code without thinking about servers. Pay only for the compute time you consume."
1 million requests & 400,000 GB-seconds per month [π πΈ]
Could run a Ξ» with 250MB of RAM for 18.5 days straight..
Use Case: Periodically download some data, save to cloud storage (S3)
# Create virtualenv and install packages
β pipenv install requests
handler.py
import requests
import yaml
import main
def my_handler(event=None, context=None):
"""Kick off the desired function
Parameters
----------
event : dict, optional
AWS Lambda uses this parameter to pass in event data to the handler
context : LambdaContext, optional
AWS Lambda uses this parameter to provide runtime information
to your handler
"""
main.do_stuff() # and things
β tree
βββ Pipfile
βββ Pipfile.lock
βββ app
β βββ main.py
β βββ handler.py
β pipenv run pip show requests
Name: requests
Version: 2.22.0
Summary: Python HTTP for Humans.
Home-page: http://python-requests.org
Author: Kenneth Reitz
Author-email: me@kennethreitz.org
License: Apache 2.0
Location: /Users/ianwhitestone/.../virtualenvs/.../lib/python3.7/site-packages π
Requires: idna, urllib3, certifi, chardet
Required-by: zappa
β PACKAGES_DIR=/Users/ianwhitestone/.../virtualenvs/.../lib/python3.7/site-packages
β PROJECT_DIR=$(pwd)
β cd $PACKAGES_DIR
β zip -r ${PROJECT_DIR}/deployment-package.zip .
...
β cd ${PROJECT_DIR}/app
β zip -r ${PROJECT_DIR}/deployment-package.zip .
β aws iam create-role \
--role-name lambda_basic_role \
--assume-role-policy-document file://lambda_trust_policy.json
{
"Role": {
"Path": "/",
"RoleName": "lambda_basic_role",
"RoleId": "AROA......",
"Arn": "arn:aws:iam::<account_num>:role/lambda_basic_role",
"CreateDate": "2019-09-22T16:48:43Z",
"AssumeRolePolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "lambda.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
}
}
# Give it full access to S3
β aws iam attach-role-policy \
--role-name lambda_basic_role \
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
# And cloudwatch (logs)
β aws iam attach-role-policy \
--role-name lambda_basic_role \
--policy-arn arn:aws:iam::aws:policy/CloudWatchFullAccess
β aws lambda create-function \
--function-name download_stuff \
--runtime python3.7 \ π
--role arn:aws:iam::<account_num>:role/lambda_basic_role \
--handler handler.my_handler \
--zip-file fileb://../deployment-package.zip \
--memory-size 128 \
--timeout 900 # max timeout (15 minutes)
# Run it every hour
aws events put-rule \
--name "RunLambdaFunction" \
--schedule-expression "rate(1 hour)" \
--state "ENABLED"
# Add lambda function as target
aws events put-targets \
--rule "RunLambdaFunction" \
--targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:<account_num>:function:download_stuff"
Test it out...
β chmod -R 755 $PACKAGES_DIR
β chmod -R 755 $PROJECT_DIR
...rebuild our deployment packagement
β aws lambda update-function-code \
--function-name download_stuff \
--zip-file fileb://../deployment-package.zip
Try again...
Use Case: Periodically download some data, save to cloud storage (S3) database
# Create virtualenv and install packages
β pipenv install requests
β pipenv install psycopg2 # new dependency!
...rebuild our deployment packagement
...update our lambda function
"Zappa makes it super easy to build and deploy server-less, event-driven Python applications (including, but not limited to, WSGI web apps) on AWS Lambda + API Gateway"
Use Case: Periodically download some data, save to database
# Create virtualenv and install packages
β pipenv install requests
β pipenv install psycopg2
β pipenv install zappa # new dependency!
zappa_settings.json
{
"dev": {
"apigateway_enabled": false,
"aws_region": "us-east-1",
"profile_name": "default",
"project_name": "download_stuff",
"runtime": "python3.7",
"s3_bucket": "download_stuff",
"keep_warm": false,
"events": [{
"function": "main.do_stuff",
"expression": "rate(1 hour)"
}]
},
"prod": {
// config for production
}
}
(can be created step by step with zappa init
)
β zappa deploy dev
Calling deploy for stage dev..
Downloading and installing dependencies..
- psycopg2-binary==2.8.3: Using locally cached manylinux wheel
- sqlite==python3: Using precompiled lambda package
'python3.7'
Packaging project as zip.
Uploading zappa-cron-test-dev-1569183776.zip (9.5MiB)..
100%|βββββββββββββββββββββββββββββββββββββββββ| 9.97M/9.97M [00:21<00:00, 528KB/s]
Scheduling..
Scheduled zappa-cron-test-dev-test.run with expression rate(1 minute)!
Deployment complete!
For web apps, all of the above, and:
Easily view logs
# Show all logs
β zappa tail dev
Calling tail for stage dev..
[1569183806942] Instancing..
[1569183806943] [DEBUG] 2019-09-22T20:23:26.942Z 97e8-d0b23aaf17a0 Zappa Event:
{'time': '2019-09-22T20:23:24Z', 'detail-type': 'Scheduled Event', 'source': 'aws.events',
'region': 'us-east-1', 'detail': {}, 'version': '0',
'resources': ['arn:aws:events:us-east-1:<>:rule/zappa-cron-test-dev-test.run'],
'id': '75265076-af20-30ca-fd1e-b3fcbe478843', 'kwargs': {}}
[1569183806988] hello world!!
[1569183865861] [DEBUG] 2019-09-22T20:24:25.861Z 8064-931e09d761e6 Zappa Event:
{'time': '2019-09-22T20:24:24Z', 'detail-type': 'Scheduled Event', 'source': 'aws.events',
'region': 'us-east-1', 'detail': {}, 'version': '0',
'resources': ['arn:aws:events:us-east-1:<>:rule/zappa-cron-test-dev-test.run'],
'id': '823d2b37-6a85-c162-5084-1906492f4b93', 'kwargs': {}}
[1569183865861] hello world!!
Easily view logs
# Show logs from specific timeframe
β zappa tail dev --since 1m
# Show logs from specific timeframe and filter
β zappa tail batch_secondary_us_east_1 --since 1d --filter "ERROR"
Invoke raw commands on lambda for testing (avoid re-deploying)
β zappa invoke dev "import psycopg2; print('hello')" --raw
Calling invoke for stage dev..
[START] RequestId: e35516da-b71d-4452-9896-e622fe263d1f Version: $LATEST
Instancing..
[DEBUG] 2019-09-22T20:20:09.25Z e622fe263d1f Zappa Event:
{'raw_command': "import psycopg2; print('hello')"}
hello
[END] RequestId: e35516da-b71d-4452-9896-e622fe263d1f
[REPORT] RequestId: e35516da-b71d-4452-9896-e622fe263d1f
Duration: 198.44 ms
Billed Duration: 200 ms
Memory Size: 512 MB
Max Memory Used: 84 MB
Init Duration: 525.29 ms
Keep lambda "warm" with scheduled invocations
{"keep_warm": true}
settingOversized lambda deployment packages
{"slim_handler": true}
/tmp
directory at runtimezappa rollback prod -n 1
zappa undeploy prod
Not undergoing active development??
"app": {
"app_function": "domi.app.app",
"aws_region": "us-east-1",
"slim_handler": false,
"runtime": "python3.7",
"certificate_arn": "arn:aws:acm:us-east-1:XXXXXX:certificate/XXXXXX",
"domain": "domi.cloud",
"keep_warm": true,
"keep_warm_expression": "cron(0/3 12-4 ? * * *)",
"timeout_seconds": 3,
},
"batch_primary_us_east_1": {
"slim_handler": false,
"keep_warm": false,
"aws_region": "us-east-1",
"runtime": "python3.7",
"events": [
{
"function": "domi.apartments.handlers.get_all_listings",
"expression": "cron(0 */2 * * ? *)"
},
{
"function": "domi.apartments.handlers.process_new_listings",
"expression": "cron(15 */2 * * ? *)"
},
{
"function": "domi.apartments.handlers.check_listing_statuses",
"expression": "cron(15 */2 * * ? *)"
},
],
"timeout_seconds": 900,
}
SELECT listings.*
FROM listings, user_regions
WHERE
ST_Contains(user_regions.geom, listings.geom)
AND bedrooms >= 1
AND bathrooms >= 1
AND ...
from geoalchemy2 import Geometry
from sqlalchemy import Column, Integer
class Listing(BASE):
__tablename__ = "listings"
id = Column(Integer, primary_key=True)
geom = Column(Geometry(geometry_type="POINT", srid=4326))
bedrooms = Column(Integer)
class UserRegion(BASE):
__tablename__ = "user_regions"
id = Column(Integer, primary_key=True)
user_id = Column(Integer, ForeignKey("users.id"))
geom = Column(Geometry(geometry_type="POLYGON", srid=4326))
from models import Listing, UserRegion, SESSION
from sqlalchemy import func
listings = (
SESSION.query(
Listing.id,
Listing.source,
Listing.price,
...
)
.join(
UserRegion,
and_(
UserRegion.user_id == 123,
func.ST_Contains(UserRegion.geom, Listing.geom),
),
)
Goal: Get an expected price distribution based on the type of apartment
Theory: Cluster similar listings and use actual price distribution of cluster
Each variable is treated as having the same impact on price (after scaling)
Theory: Linear regression to predict mean, calculate prediction interval to get range of expected values
Calculating the prediction interval relies on the homoscedasticity assumption, which states that the variance around the regression line is the same for all values of the predictor variable.
We can quickly see this does not hold true..
Theory: Quantile regression to predict p25 & p75
Price falls between p25
and p75
--> typical
Price falls below p25
--> low
Price falls above p75
--> high
We start with some standard features:
price ~ bedrooms + bathrooms + size + is_furnished + ...
Automatically cluster each point into an "area"
from sklearn.cluster import KMeans
X = df[['lat', 'long']].values
km = KMeans(20, init='k-means++')
km.fit(X)
clusters = km.predict(X) # classify points into 1 of 20 clusters
price ~ bedrooms + bathrooms + size + is_furnished + ... + cluster_0 + cluster_1 + ...
Arbitrary boundaries result in similar points being treated differently
annoy (Approximate Nearest Neighbors Oh Yeah)
(can also use scikit-learn)
>>> from annoy import AnnoyIndex
# build the tree
>>> featurees = ["lat_scaled", "long_scaled", "bedrooms_scaled"]
>>> tree = AnnoyIndex(len(features), "euclidean")
>>> for index, row in df[features].iterrows():
tree.add_item(index, row.values)
>>> tree.build(10)
...
# search da tree
>>> apartment_index = 1 # index of apartment to search
>>> tree.get_nns_by_item(apartment_index, 51) # get 50 closest points
[1, 23412, 424, 794, 12, 939, 58, 3, ...]
price ~ bedrooms + bathrooms + size + is_furnished + ... + nn_50_avg_price + ...
Add exponentially decayed weighting to each point based on distance
"price_rank_primary": {
"project_name": "domi",
"slim_handler": true,
"memory_size": 3000,
"apigateway_enabled": false,
"keep_warm": false,
"aws_region": "us-east-1",
"runtime": "python3.7",
"events": [
{
"function": "domi.apartments.price_rank.price_rank",
"expression": "cron(0 */2 * * ? *)"
}
],
"timeout_seconds": 900,
},
User's don't want a black box, otherwise they won't trust it. Give them context!
"$3,250 is normal"
versus
"$3,250 is typical for this type of listing. Listings with the same number of bedrooms, bathrooms and similar square footage and location typically have price ranges between $3,175 and $4,200"
Note the rounding..."price ranges between $3,183.23 and $4,177.69" just seems sketchy
Giving users an easy way to visualize where the price falls also provides additional context
Is everything is going as expected?
Application performance can be diagnosed by checking data
expect_column_values_to_not_be_null
expect_column_values_to_match_regex
expect_column_values_to_be_unique
expect_column_values_to_match_strftime_format
expect_table_row_count_to_be_between
expect_column_median_to_be_between
expectations.json
{
"data_asset_name": "yesterdays_craigslist_listings",
"expectation_suite_name": "default",
"expectations": [
{
"expectation_type": "expect_table_row_count_to_be_between",
"kwargs": {
"min_value": 300
}
}
],
}
run_data_checks.py
from domi.db import DB_ENGINE
from great_expectations.dataset import SqlAlchemyDataset
sql_query = """
SELECT id
FROM {tablename}
WHERE TRUE
AND DATE_TRUNC('day', created_at) = CURRENT_DATE - INTERVAL '1' DAY
"""
new_sql_dataset = SqlAlchemyDataset(custom_sql=sql_query, engine=db_engine)
validation_results = new_sql_dataset.validate(expectation_suite="expectations.json")
if validation_results["success"]:
...
from domi.handlers import process_new_listings
from domi.db import SESSION
# πeverything instantiated above here is shared across future function invocations
def lambda_handler(event, context):
process_new_listings()
# Automatically ensure all transactions are succesfully committed,
# or rolled back if not
def commit_session(_raise=True):
if not SESSION:
return
try:
SESSION.commit()
except Exception as e:
SESSION.rollback()
if _raise:
raise
def session_committer(func):
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
finally:
commit_session()
return wrapper
# Use decorator on any function doing database transactions
@session_committer
def process_new_listings():
...
{"slim_handler": true}
saves deployment package to S3/tmp
directory at runtimezip
callback to remove large packages from deployment packages"callbacks": { // Call custom functions during the local Zappa deployment/update process
"settings": "my_app.settings_callback", // After loading the settings
"zip": "my_app.zip_callback", // After creating the package
"post": "my_app.post_callback", // After command has executed
}
"app": {
"app_function": "domi.app.app",
"aws_region": "us-east-1",
"runtime": "python3.7",
"certificate_arn": "arn:aws:acm:us-east-1:XXXXXX:certificate/XXXXXX",
"domain": "domi.cloud",
"keep_warm": true,
"keep_warm_expression": "cron(0/3 12-4 ? * * *)",
"timeout_seconds": 3,
// updated settings
"slim_handler": false,
"regex_excludes": [
"pandas", "scipy", "numpy", "PIL", "statsmodels", "matplotlib"
],
"callbacks": {
"zip": "zappa_package_cleaner.main"
},
},
See blog post for more details.
try:
# when running locally this will import succesfully
# when running on lambda, this will fail and fallback to pre-compiled version
from annoy import AnnoyIndex
except:
from lambda_annoy import AnnoyIndex
In the (hopefully not too distant) future, zappa will support deployment package creation with docker.
ianwhitestone.work/AWS-Serverless-Deployments-With-Github-Actions
import statsmodels.formula.api as smf
mod = smf.quantreg('foodexp ~ income', data) # uses patsy model formulas
res = mod.fit(q=.5)
print(res.summary())