Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making real-estate liquid with iterative model development

Ian Wong
August 05, 2015

Making real-estate liquid with iterative model development

Slides from talks given at SF Bay Area Machine Learning Meetup and OpenLate @ OpenDNS:

Talk excerpt:

Homes are Americans' most valuable yet least liquid asset. Selling a home on the market takes months of hassle and uncertainty. Founded in 2014, Opendoor removes all the friction from the transaction by providing homeowners with instant offers to buy their homes. Sellers can choose when they want to move out, and close the sale through a streamlined experience online.

At the core of the service is a collection of pricing algorithms that infer the fair market value for houses. In this talk, we’ll explore some of the challenges in dealing with real-estate data, and ways to address them. We’ll dive into aspects of putting together modular and declarative data pipelines for model training and feature generation, as well as recipes for reproducible model research.

Ian Wong

August 05, 2015
Tweet

More Decks by Ian Wong

Other Decks in Research

Transcript

  1. Data Source 1 vendor_id 123 street1 6428 BLACK HILL unit_number

    NULL city PHOENIX living_area 0 (home attributes) … Data Source 2 vendor_id 456 street1 6428 W BLACK HILL unit_number BEAUTIFUL! MUST SEE!!!! city PHOENIX living_area 1000 (home attributes) … Right address: 6428 W Black Hill Rd, Phoenix, AZ (Who says? USPS)
  2. In practice X, y f Model Trainer e Validation y

    Fetch & transform data configdata fetch • Features • Timestamps • Label • Transformation • Filters • Imputation configtrainer • Hyperparameters • How to treat features? configvalidation • Validation set filter • Error metric definition
  3. X, y f e Model Trainer Validation y, configvalidation Fetch

    & transform data configdata fetch configtrainer How do we improve?
  4. Iterative model development 1. It’s all about the error 2.

    Optimize for iteration speed 3. Let modelers do what they do best 4. Prioritize work based on expected ROI
  5. An experiment ≈ a run of model training 1. Fetches

    signals X, y 2. Trains a model f 3. Creates predictions ŷ and validation errors e 4. Persists X, y, f, ŷ, e 5. Keep a record of the above
  6. experiment_config = { 'training_start_date': '2005-06-01', 'training_end_date': ‘2014-05-31', 'model_config': model_config, 'target':

    'l_close_price', ‘train_filter': train_filter, 'validation_start_date': '2014-06-01', 'validation_end_date': ‘2015-06-01' ‘validation_filter’: validation_filter, } > run_experiment(experiment_config) FETCH SIGNALS FILTER TRANSFORM TRAIN VALIDATE PERSIST
  7. model_config = { 'model_type': 'linear_regression', 'features': [ 'living_area', 'num_bedrooms', 'num_bathrooms',

    'neighborhood' ], 'feature_types': { 'living_area': 'numerical', 'num_bedrooms': 'categorical', 'num_bathrooms': 'categorical', 'neighborhood': 'categorical' }, 'model_params': { 'l1': 3e-6, 'l2': 3e-6, 'loss': 'squared_loss' } } FETCH SIGNALS FILTER TRANSFORM TRAIN VALIDATE PERSIST
  8. filter_config = { 'not_null': ['living_area'], 'value': { 'max' : {

    ‘num_bathrooms': 5 } } } FETCH SIGNALS FILTER TRANSFORM TRAIN VALIDATE PERSIST
  9. ValuationExperiment id 3 experiment_config {…} valuation_model_id 10 batch_signal_id 200 experiment_type

    linear_regression ValuationModel id 50 model_config {…} model_s3_slug … model_type linear_regression BatchSignal id 50 start_timestamp 2010-06-01 end_timestamp 2015-06-01 signals […] s3_slug … ValuationError id 50 errorable_id {…} mae … (other eval stats) … num_records … time_range … predictions_s3_slug …
  10. def compute_features(features, entity_ids, timestamps): # compute relevant independent feature_fns #

    combine results # think of it as a “router” @feature_fn(features.LIVING_AREA, entities.ADDRESS, TIME_INDEPENDENT) def compute_living_area(features, entity_ids, timestamps): # easy: look-up
  11. @feature_fn(features :- [feature], entities :- [entity], timestamp :- {TIME_INDEPENDENT, TIME_DEPENDENT}])

    The feature_fn decorator specifies 1. Supported features 2. Supported entities 3. Whether the features are static or dynamic relative to time
  12. @feature_fn(features.LIVING_AREA, entities.ADDRESS, TIME_INDEPENDENT) def compute_living_area(features, entity_ids, timestamps): # easy: look-up

    @feature_fn(features.TRAILING_30D_HPI, entities.POSTAL_CODE, TIME_DEPENDENT) def compute_trailing_30d_hpi(features, entity_ids, ts): # a bit more involved
  13. model_config = { 'model_type': 'linear_regression', 'features': [ 'living_area', 'num_bedrooms', 'num_bathrooms',

    'neighborhood' ], 'feature_types': { 'living_area': 'numerical', 'num_bedrooms': 'categorical', 'num_bathrooms': 'categorical', 'neighborhood': 'categorical' }, 'model_params': { 'l1': 3e-6, 'l2': 3e-6, 'loss': 'squared_loss' } } In one of our first models,
 we had a config similar to:
  14. We looked at the errors (stored via experiments on s3)

    X, y, e, f, {config} Error analysis Hypotheses …and plotted the errors on a map
  15. Updated the model config model_config = { 'model_type': 'linear_regression', 'features':

    [ 'living_area', 'num_bedrooms', 'num_bathrooms', 'neighborhoods', ‘is_golf_lot’, ], 'feature_types': { 'living_area': 'numerical', 'num_bedrooms': 'categorical', 'num_bathrooms': 'categorical', 'neighborhood': 'categorical', ‘is_golf_lot’: 'categorical', }, 'model_params': { 'l1': 3e-6, 'l2': 3e-6, 'loss': 'squared_loss' } }