PyCon Canada - November 2019
Ian Whitestone
>>> df[df.bedrooms == 1].price.median()
$2,200
>>> df[df.bedrooms == 0].price.median()
$1,800
>>> df[df.housing_type == 'basement'].price.median()
$1,500
Inspired by a simple San Francisco apartment posting slack bot made by Vik Paruchuri...
"Zappa makes it super easy to build and deploy server-less, event-driven Python applications (including, but not limited to, WSGI web apps) on AWS Lambda + API Gateway"
zappa_settings.json
{
"web_scraper": {
"project_name": "domi",
"runtime": "python3.7",
"s3_bucket": "domi",
"events": [{
"function": "main.run_my_program",
"expression": "rate(1 hour)"
}]
}
}
→ zappa deploy web_scraper
Calling deploy for stage web_scraper..
Downloading and installing dependencies..
- psycopg2-binary==2.8.3: Using locally cached manylinux wheel
- sqlite==python3: Using precompiled lambda package
'python3.7'
Packaging project as zip.
Uploading zappa-domi-web-scraper-1569183776.zip (9.5MiB)..
100%|█████████████████████████████████████████| 9.97M/9.97M [00:21<00:00, 528KB/s]
Scheduling..
Scheduled zappa-domi-web-scraper.run with expression rate(1 hour)!
Deployment complete!
data = {
"config": {
"adData": {
"price": {"amount": "2000000"},
"title": "Gorgeous 2 Bed 2 Bath Fully Furnished Executive Condo",
"description": "Stunning Executive Fully Furnished Lower Penthouse...",
...
"media": [
{
"type": "image",
"href": "https://i.ebayimg.com/00/s/NjAwWDgwMA==/z/BTsAAOSwhZhdpb8X/$_59.JPG",
},
...
],
"adLocation": {"latitude": 43.6500917, "longitude": -79.38737379999999},
"adAttributes": [
{
"machineKey": "numberbedrooms",
"machineValue": "2.5",
...
}
...
],
...
}
}
price = data["config"]["adData"]["price"]["amount"]
price = int(price) / 1000
latitude = data["config"]["adData"]["adLocation"]["latitude"]
longitude = data["config"]["adData"]["adLocation"]["longitude"]
imgs = [
media["href"] for media in data["config"]
]
>>> from glom import glom
>>> data = {
'config': {
'adData': {
'price': {'amount': '2000000'}
},
...
}
}
>>> glom(data, 'config.adData.price.amount')
'2000000'
>>> from glom import glom
>>> data = {
'config': {
'adData': {
'price': {'amount': '2000000'}
},
...
}
}
>>> glom(data, ('config.adData.price.amount', lambda x: int(x) / 1000))
2000
>>> from glom import glom
>>> data = {
'config': {
'adData': {
'price': {'amount': '2000000'}
},
"adLocation": {
"latitude": 43.6500917,
"longitude": -79.38737379999999
},
...
}
}
>>> spec = {
"price": ("config.adData.price.amount", lambda x: int(x)/1000),
"latitude": "config.adData.adLocation.latitude",
"longitude": "config.adData.adLocation.longitude"
}
>>> glom(data, spec)
{
"price": 2000,
"latitude": 43.6500917,
"longitude": -79.38737379999999
}
SELECT listings.*
FROM listings, user_regions
WHERE
ST_Contains(user_regions.geom, listings.geom)
AND bedrooms >= 1
AND bathrooms >= 1
AND ...
from geoalchemy2 import Geometry
from sqlalchemy import Column, Integer
class Listing(BASE):
__tablename__ = "listings"
id = Column(Integer, primary_key=True)
geom = Column(Geometry(geometry_type="POINT", srid=4326))
bedrooms = Column(Integer)
class UserRegion(BASE):
__tablename__ = "user_regions"
id = Column(Integer, primary_key=True)
user_id = Column(Integer, ForeignKey("users.id"))
geom = Column(Geometry(geometry_type="POLYGON", srid=4326))
from models import Listing, UserRegion, SESSION
from sqlalchemy import func
listings = (
SESSION.query(
Listing.id,
Listing.source,
Listing.price,
...
)
.join(
UserRegion,
and_(
UserRegion.user_id == 123,
func.ST_Contains(UserRegion.geom, Listing.geom),
),
)
Goal: Get an expected price distribution based on the type of apartment
Theory: Cluster similar listings and use actual price distribution of cluster
Each variable is treated as having the same impact on price (after scaling)
Theory: Linear regression to predict mean, calculate prediction interval to get range of expected values
Calculating the prediction interval relies on the homoscedasticity assumption, which states that the variance around the regression line is the same for all values of the predictor variable.
We can quickly see this does not hold true..
Theory: Quantile regression to predict p25 & p75
Price falls between p25
and p75
--> typical
Price falls below p25
--> low
Price falls above p75
--> high
We start with some standard features:
price ~ bedrooms + bathrooms + size + is_furnished + ...
Automatically cluster each point into an "area"
from sklearn.cluster import KMeans
X = df[['lat', 'long']].values
km = KMeans(20, init='k-means++')
km.fit(X)
clusters = km.predict(X) # classify points into 1 of 20 clusters
price ~ bedrooms + bathrooms + size + is_furnished + ... + cluster_0 + cluster_1 + ...
Arbitrary boundaries result in similar points being treated differently
annoy (Approximate Nearest Neighbors Oh Yeah)
(can also use scikit-learn)
>>> from annoy import AnnoyIndex
# build the tree
>>> featurees = ["lat_scaled", "long_scaled", "bedrooms_scaled"]
>>> tree = AnnoyIndex(len(features), "euclidean")
>>> for index, row in df[features].iterrows():
tree.add_item(index, row.values)
>>> tree.build(10)
...
# search da tree
>>> apartment_index = 1 # index of apartment to search
>>> tree.get_nns_by_item(apartment_index, 51) # get 50 closest points
[1, 23412, 424, 794, 12, 939, 58, 3, ...]
price ~ bedrooms + bathrooms + size + is_furnished + ... + nn_50_avg_price + ...
User's don't want a black box, otherwise they won't trust it. Give them context!
"$3,250 is normal"
versus
"$3,250 is typical for this type of listing. Listings with the same number of bedrooms, bathrooms and similar square footage and location typically have price ranges between $3,175 and $4,200"
Note the rounding..."price ranges between $3,183.23 and $4,177.69" just seems sketchy
Giving users an easy way to visualize where the price falls also provides additional context
# normal (orange - middle) line
x = (lower, upper)
y = (1, 1)
ax.plot(x, y, linestyle="-", c="#FBBC06", linewidth=5.0, solid_capstyle="round")
# cheap (green - lower) line
x = (min_price, lower - spacing)
y = (1, 1)
ax.plot(x, y, linestyle="-", c="#34A853", linewidth=5.0, solid_capstyle="round")
# expensive (red - upper) line
x = (upper + spacing, max_price)
y = (1, 1)
ax.plot(x, y, linestyle="-", c="#EA4334", linewidth=5.0, solid_capstyle="round")
# upper bound text
ax.text(upper - spacing * 2, 0.9925, f"${upper}", fontsize=10, c="gray")
# lower bound text
ax.text(lower - spacing * 2.75, 0.9925, f"${lower}", fontsize=10, c="gray")
# price marker
ax.plot(
price,
1,
marker="o",
markersize=12,
fillstyle="full",
c="w",
markeredgewidth=2.5,
markeredgecolor="#1A73E8",
)
# tooltip triangle marker
ax.plot(
price,
1.0038,
marker="v",
markersize=7,
fillstyle="full",
c="#1A73E8",
markeredgewidth=0.5,
markeredgecolor="#1A73E8",
)
# rectangle textbox
rect = patches.FancyBboxPatch(
xy=(rectangle_start, 1.0045),
width=rectangle_width,
height=0.0075,
edgecolor="#1A73E8",
facecolor="#1A73E8",
joinstyle="round",
capstyle="round",
boxstyle=patches.BoxStyle("Round", pad=0.000, rounding_size=0),
)
ax.add_patch(rect)
# price rank text
ax.text(
text_start,
1.0096,
f"${display_price} is {price_rank}",
fontsize=10,
verticalalignment="top",
c="w",
fontweight="bold",
)
import statsmodels.formula.api as smf
mod = smf.quantreg('foodexp ~ income', data) # uses patsy model formulas
res = mod.fit(q=.5)
print(res.summary())
1 million requests & 400,000 GB-seconds per month [🙅💸]
Could run a λ with 250MB of RAM for 18.5 days straight..