Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ivory - A Data Store for Data Science

Ambiata
October 20, 2014

Ivory - A Data Store for Data Science

Ambiata

October 20, 2014
Tweet

More Decks by Ambiata

Other Decks in Technology

Transcript

  1. Feature vectors © Ambiata 2014 0.00 3 3001 1.00 634.83

    16 4670 0.6875 15.12 2 - 0.50 33.56 2 - 1.00 98.34 12 3303 0.8333 523.81 23 2046 0.4782 1086.05 17 - 1.00 224.81 9 - 0.2222 78.21 2 2134 0.50 126.48 4 - 0.0 1 3 1 1 4 1 2 1 1 1 M - F M F - F F M - gender balance purchases zipcode prop_online num_accs 89340218 feature instance 48149407 18452274 07499337 62948721 93754723 00272446 13374497 31989993 46474236
  2. Data set B Data set C Data set D Feature

    Eng Model train Score A typical workflow Multiple data sources: • Transaction logs • Database snapshots • Segmentation models • 3rd-party data © Ambiata 2014 Feature engineering: • Data source are joined • Instances are created • Features are engineered The cool stuff: • Models are built • Instances are scored
  3. Data set A Data set B Data set C Data

    set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Multiple data science projects Feature engineering is in a silo - no reuse between model builds © Ambiata 2014
  4. • Continually receiving data • Want to leverage a history

    of all this data • Continually training + scoring • Data may need to be corrected • Need to extend data model on-the-fly © Ambiata 2014
  5. © Ambiata 2014 SERVING LAYER New data stream Query All

    data Precomputed views Stream processing Incremental views BATCH LAYER SPEED LAYER Real-time store
  6. © Ambiata 2014 New data stream Query All data Feature

    view Stream processing Incremental views Real-time store
  7. © Ambiata 2014 New data stream Query All data Feature

    view Stream processing Incremental views Real-time store Model train and score
  8. Data set A Data set B Data set C Data

    set D Data set E Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Ivory © Ambiata 2014 A shared feature view asset
  9. © Ambiata 2014 New data stream Query All data Ivory

    Stream processing Incremental views Real-time store Model train and score An immutable, batch-oriented data store