Making real-estate liquid with iterative model development

7c2038734171a0731cbe1094ade20a51?s=47 Ian Wong
August 05, 2015

Making real-estate liquid with iterative model development

Slides from talks given at SF Bay Area Machine Learning Meetup and OpenLate @ OpenDNS:

Talk excerpt:

Homes are Americans' most valuable yet least liquid asset. Selling a home on the market takes months of hassle and uncertainty. Founded in 2014, Opendoor removes all the friction from the transaction by providing homeowners with instant offers to buy their homes. Sellers can choose when they want to move out, and close the sale through a streamlined experience online.

At the core of the service is a collection of pricing algorithms that infer the fair market value for houses. In this talk, we’ll explore some of the challenges in dealing with real-estate data, and ways to address them. We’ll dive into aspects of putting together modular and declarative data pipelines for model training and feature generation, as well as recipes for reproducible model research.

7c2038734171a0731cbe1094ade20a51?s=128

Ian Wong

August 05, 2015
Tweet

Transcript

  1. OPENDOOR August 5th 2015 Ian Wong (@ihat) ian@opendoor.com Making real-estate

    liquid with iterative model development
  2. None
  3. None
  4. None
  5. None
  6. None
  7. None
  8. None
  9. PRICING ALGORITHMS

  10. What is the fair market value for a home?

  11. PHX TRANSACTIONS IN THE PAST YEAR

  12. Iterate
 with the right building blocks
 to reduce error

  13. What’s challenging?

  14. Houses are heterogenous

  15. None
  16. None
  17. Capturing what matters

  18. None
  19. Sparse observations of the market

  20. None
  21. Real-world data (read: messy)

  22. Data Source 1 vendor_id 123 street1 6428 BLACK HILL unit_number

    NULL city PHOENIX living_area 0 (home attributes) … Data Source 2 vendor_id 456 street1 6428 W BLACK HILL unit_number BEAUTIFUL! MUST SEE!!!! city PHOENIX living_area 1000 (home attributes) … Right address: 6428 W Black Hill Rd, Phoenix, AZ (Who says? USPS)
  23. CHALLENGES ERROR

  24. Iterate
 with the right building blocks
 to reduce error

  25. ITERATE

  26. X (features) y (labels) t (time) a historical transaction new

    home to be priced ? x’ y’ t’
  27. X ytrain t Xtrain Xtest ytest start ttest end ttest

    start ttrain end ttrain
  28. X, y f Model Trainer In principle

  29. In practice X, y f Model Trainer e Validation y

    Fetch & transform data configdata fetch • Features • Timestamps • Label • Transformation • Filters • Imputation configtrainer • Hyperparameters • How to treat features? configvalidation • Validation set filter • Error metric definition
  30. X, y f e Model Trainer Validation y, configvalidation Fetch

    & transform data configdata fetch configtrainer How do we improve?
  31. None
  32. MEASURE ERROR TRY SOMETHING NEW MEASURE ERROR Features Models Sample

    selection
  33. Iterative model development 1. It’s all about the error 2.

    Optimize for iteration speed 3. Let modelers do what they do best 4. Prioritize work based on expected ROI
  34. BUILDING BLOCKS

  35. Model Training Experiments Task Abstraction Feature Generation feature_fns

  36. An experiment ≈ a run of model training 1. Fetches

    signals X, y 2. Trains a model f 3. Creates predictions ŷ and validation errors e 4. Persists X, y, f, ŷ, e 5. Keep a record of the above
  37. FETCH SIGNALS FILTER TRANSFORM TRAIN VALIDATE PERSIST

  38. experiment_config = { 'training_start_date': '2005-06-01', 'training_end_date': ‘2014-05-31', 'model_config': model_config, 'target':

    'l_close_price', ‘train_filter': train_filter, 'validation_start_date': '2014-06-01', 'validation_end_date': ‘2015-06-01' ‘validation_filter’: validation_filter, } > run_experiment(experiment_config) FETCH SIGNALS FILTER TRANSFORM TRAIN VALIDATE PERSIST
  39. model_config = { 'model_type': 'linear_regression', 'features': [ 'living_area', 'num_bedrooms', 'num_bathrooms',

    'neighborhood' ], 'feature_types': { 'living_area': 'numerical', 'num_bedrooms': 'categorical', 'num_bathrooms': 'categorical', 'neighborhood': 'categorical' }, 'model_params': { 'l1': 3e-6, 'l2': 3e-6, 'loss': 'squared_loss' } } FETCH SIGNALS FILTER TRANSFORM TRAIN VALIDATE PERSIST
  40. filter_config = { 'not_null': ['living_area'], 'value': { 'max' : {

    ‘num_bathrooms': 5 } } } FETCH SIGNALS FILTER TRANSFORM TRAIN VALIDATE PERSIST
  41. Experiments are first-class citizens

  42. ValuationExperiment id 3 experiment_config {…} valuation_model_id 10 batch_signal_id 200 experiment_type

    linear_regression ValuationModel id 50 model_config {…} model_s3_slug … model_type linear_regression BatchSignal id 50 start_timestamp 2010-06-01 end_timestamp 2015-06-01 signals […] s3_slug … ValuationError id 50 errorable_id {…} mae … (other eval stats) … num_records … time_range … predictions_s3_slug …
  43. A quick story about

  44. Model Training Experiments Task Abstraction Feature Generation feature_fns

  45. X (features) y (labels) t (time) a historical transaction new

    home to be priced ? x’ y’ t’
  46. Declarative
 Feature Generation compute for at features [f] entities [e]

    timestamps [t]
  47. def compute_features(features, entity_ids, timestamps): # compute relevant independent feature_fns #

    combine results # think of it as a “router” @feature_fn(features.LIVING_AREA, entities.ADDRESS, TIME_INDEPENDENT) def compute_living_area(features, entity_ids, timestamps): # easy: look-up
  48. @feature_fn(features :- [feature], entities :- [entity], timestamp :- {TIME_INDEPENDENT, TIME_DEPENDENT}])

    The feature_fn decorator specifies 1. Supported features 2. Supported entities 3. Whether the features are static or dynamic relative to time
  49. @feature_fn(features.LIVING_AREA, entities.ADDRESS, TIME_INDEPENDENT) def compute_living_area(features, entity_ids, timestamps): # easy: look-up

    @feature_fn(features.TRAILING_30D_HPI, entities.POSTAL_CODE, TIME_DEPENDENT) def compute_trailing_30d_hpi(features, entity_ids, ts): # a bit more involved
  50. A quick story about

  51. model_config = { 'model_type': 'linear_regression', 'features': [ 'living_area', 'num_bedrooms', 'num_bathrooms',

    'neighborhood' ], 'feature_types': { 'living_area': 'numerical', 'num_bedrooms': 'categorical', 'num_bathrooms': 'categorical', 'neighborhood': 'categorical' }, 'model_params': { 'l1': 3e-6, 'l2': 3e-6, 'loss': 'squared_loss' } } In one of our first models,
 we had a config similar to:
  52. We looked at the errors (stored via experiments on s3)

    X, y, e, f, {config} Error analysis Hypotheses …and plotted the errors on a map
  53. None
  54. Got our hands on some shapefiles

  55. None
  56. @feature_fn(features.GOLF_LOT, entities.ADDRESS, TIME_INDEPENDENT) def is_golf_lot(features, entity_ids, timestamps): # check lat

    lon against buffered shapefile Added a feature function
  57. Updated the model config model_config = { 'model_type': 'linear_regression', 'features':

    [ 'living_area', 'num_bedrooms', 'num_bathrooms', 'neighborhoods', ‘is_golf_lot’, ], 'feature_types': { 'living_area': 'numerical', 'num_bedrooms': 'categorical', 'num_bathrooms': 'categorical', 'neighborhood': 'categorical', ‘is_golf_lot’: 'categorical', }, 'model_params': { 'l1': 3e-6, 'l2': 3e-6, 'loss': 'squared_loss' } }
  58. Ran an experiment > run_experiment(experiment_config) … and watched the error

    drop. Rinse and repeat.
  59. Iterate
 with the right building blocks
 to reduce error

  60. Q & A Ian Wong (@ihat) ian@opendoor.com