Making real-estate liquid with iterative model development

OPENDOOR August 5th 2015 Ian Wong (@ihat) [email protected] Making real-estate
liquid with iterative model development

PRICING ALGORITHMS

What is the fair market value for a home?

PHX TRANSACTIONS IN THE PAST YEAR

Iterate  with the right building blocks  to reduce error

What’s challenging?

Houses are heterogenous

Capturing what matters

Sparse observations of the market

Real-world data (read: messy)

Data Source 1 vendor_id 123 street1 6428 BLACK HILL unit_number
NULL city PHOENIX living_area 0 (home attributes) … Data Source 2 vendor_id 456 street1 6428 W BLACK HILL unit_number BEAUTIFUL! MUST SEE!!!! city PHOENIX living_area 1000 (home attributes) … Right address: 6428 W Black Hill Rd, Phoenix, AZ (Who says? USPS)

CHALLENGES ERROR

ITERATE

X (features) y (labels) t (time) a historical transaction new
home to be priced ? x’ y’ t’

X ytrain t Xtrain Xtest ytest start ttest end ttest
start ttrain end ttrain

X, y f Model Trainer In principle

In practice X, y f Model Trainer e Validation y
Fetch & transform data configdata fetch • Features • Timestamps • Label • Transformation • Filters • Imputation configtrainer • Hyperparameters • How to treat features? configvalidation • Validation set filter • Error metric definition

X, y f e Model Trainer Validation y, configvalidation Fetch
& transform data configdata fetch configtrainer How do we improve?

MEASURE ERROR TRY SOMETHING NEW MEASURE ERROR Features Models Sample
selection

Iterative model development 1. It’s all about the error 2.
Optimize for iteration speed 3. Let modelers do what they do best 4. Prioritize work based on expected ROI

BUILDING BLOCKS

Model Training Experiments Task Abstraction Feature Generation feature_fns

An experiment ≈ a run of model training 1. Fetches
signals X, y 2. Trains a model f 3. Creates predictions ŷ and validation errors e 4. Persists X, y, f, ŷ, e 5. Keep a record of the above

FETCH SIGNALS FILTER TRANSFORM TRAIN VALIDATE PERSIST

experiment_config = { 'training_start_date': '2005-06-01', 'training_end_date': ‘2014-05-31', 'model_config': model_config, 'target':
'l_close_price', ‘train_filter': train_filter, 'validation_start_date': '2014-06-01', 'validation_end_date': ‘2015-06-01' ‘validation_filter’: validation_filter, } > run_experiment(experiment_config) FETCH SIGNALS FILTER TRANSFORM TRAIN VALIDATE PERSIST

model_config = { 'model_type': 'linear_regression', 'features': [ 'living_area', 'num_bedrooms', 'num_bathrooms',
'neighborhood' ], 'feature_types': { 'living_area': 'numerical', 'num_bedrooms': 'categorical', 'num_bathrooms': 'categorical', 'neighborhood': 'categorical' }, 'model_params': { 'l1': 3e-6, 'l2': 3e-6, 'loss': 'squared_loss' } } FETCH SIGNALS FILTER TRANSFORM TRAIN VALIDATE PERSIST

filter_config = { 'not_null': ['living_area'], 'value': { 'max' : {
‘num_bathrooms': 5 } } } FETCH SIGNALS FILTER TRANSFORM TRAIN VALIDATE PERSIST

Experiments are first-class citizens

ValuationExperiment id 3 experiment_config {…} valuation_model_id 10 batch_signal_id 200 experiment_type
linear_regression ValuationModel id 50 model_config {…} model_s3_slug … model_type linear_regression BatchSignal id 50 start_timestamp 2010-06-01 end_timestamp 2015-06-01 signals […] s3_slug … ValuationError id 50 errorable_id {…} mae … (other eval stats) … num_records … time_range … predictions_s3_slug …

A quick story about

Model Training Experiments Task Abstraction Feature Generation feature_fns

X (features) y (labels) t (time) a historical transaction new
home to be priced ? x’ y’ t’

Declarative  Feature Generation compute for at features [f] entities [e]
timestamps [t]

def compute_features(features, entity_ids, timestamps): # compute relevant independent feature_fns #
combine results # think of it as a “router” @feature_fn(features.LIVING_AREA, entities.ADDRESS, TIME_INDEPENDENT) def compute_living_area(features, entity_ids, timestamps): # easy: look-up

@feature_fn(features :- [feature], entities :- [entity], timestamp :- {TIME_INDEPENDENT, TIME_DEPENDENT}])
The feature_fn decorator specifies 1. Supported features 2. Supported entities 3. Whether the features are static or dynamic relative to time

@feature_fn(features.LIVING_AREA, entities.ADDRESS, TIME_INDEPENDENT) def compute_living_area(features, entity_ids, timestamps): # easy: look-up
@feature_fn(features.TRAILING_30D_HPI, entities.POSTAL_CODE, TIME_DEPENDENT) def compute_trailing_30d_hpi(features, entity_ids, ts): # a bit more involved

A quick story about

model_config = { 'model_type': 'linear_regression', 'features': [ 'living_area', 'num_bedrooms', 'num_bathrooms',
'neighborhood' ], 'feature_types': { 'living_area': 'numerical', 'num_bedrooms': 'categorical', 'num_bathrooms': 'categorical', 'neighborhood': 'categorical' }, 'model_params': { 'l1': 3e-6, 'l2': 3e-6, 'loss': 'squared_loss' } } In one of our first models,  we had a config similar to:

We looked at the errors (stored via experiments on s3)
X, y, e, f, {config} Error analysis Hypotheses …and plotted the errors on a map

Got our hands on some shapefiles

@feature_fn(features.GOLF_LOT, entities.ADDRESS, TIME_INDEPENDENT) def is_golf_lot(features, entity_ids, timestamps): # check lat
lon against buffered shapefile Added a feature function

Updated the model config model_config = { 'model_type': 'linear_regression', 'features':
[ 'living_area', 'num_bedrooms', 'num_bathrooms', 'neighborhoods', ‘is_golf_lot’, ], 'feature_types': { 'living_area': 'numerical', 'num_bedrooms': 'categorical', 'num_bathrooms': 'categorical', 'neighborhood': 'categorical', ‘is_golf_lot’: 'categorical', }, 'model_params': { 'l1': 3e-6, 'l2': 3e-6, 'loss': 'squared_loss' } }

Ran an experiment > run_experiment(experiment_config) … and watched the error
drop. Rinse and repeat.

Q & A Ian Wong (@ihat) [email protected]

Making real-estate liquid with iterative model ...

Making real-estate liquid with iterative model development

More Decks by Ian Wong

Other Decks in Research

Featured

Transcript