Ivory - A Data Store for Data Science

6715c59578f5761d363cbfdc63d8f889?s=47 Ambiata
October 20, 2014

Ivory - A Data Store for Data Science

6715c59578f5761d363cbfdc63d8f889?s=128

Ambiata

October 20, 2014
Tweet

Transcript

  1. IVORY A DATA STORE FOR DATA SCIENCE http://github.com/ambiata/ivory © Ambiata

    2014
  2. DATA SCIENCE IN THE REAL WORLD © Ambiata 2014

  3. PROBLEM #1 © Ambiata 2014

  4. “DATA WRANGLING” © Ambiata 2014

  5. WHAT WE START WITH © Ambiata 2014

  6. © Ambiata 2014

  7. WHAT WE NEED © Ambiata 2014

  8. Feature vectors © Ambiata 2014 0.00 3 3001 1.00 634.83

    16 4670 0.6875 15.12 2 - 0.50 33.56 2 - 1.00 98.34 12 3303 0.8333 523.81 23 2046 0.4782 1086.05 17 - 1.00 224.81 9 - 0.2222 78.21 2 2134 0.50 126.48 4 - 0.0 1 3 1 1 4 1 2 1 1 1 M - F M F - F F M - gender balance purchases zipcode prop_online num_accs 89340218 feature instance 48149407 18452274 07499337 62948721 93754723 00272446 13374497 31989993 46474236
  9. Data set B Data set C Data set D Feature

    Eng Model train Score A typical workflow Multiple data sources: • Transaction logs • Database snapshots • Segmentation models • 3rd-party data © Ambiata 2014 Feature engineering: • Data source are joined • Instances are created • Features are engineered The cool stuff: • Models are built • Instances are scored
  10. Feature preparation Modelling 85% 15% © Ambiata 2014

  11. Data set A Data set B Data set C Data

    set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Multiple data science projects Feature engineering is in a silo - no reuse between model builds © Ambiata 2014
  12. PROBLEM #2 © Ambiata 2014

  13. “LAB TO FACTORY” AKA DEV OPS © Ambiata 2014

  14. • Continually receiving data • Want to leverage a history

    of all this data • Continually training + scoring • Data may need to be corrected • Need to extend data model on-the-fly © Ambiata 2014
  15. LAMBDA ARCHITECTURE © Ambiata 2014

  16. © Ambiata 2014 query = function(all data)

  17. © Ambiata 2014 New data stream Query Magical query engine

  18. © Ambiata 2014 SERVING LAYER New data stream Query All

    data Precomputed views Stream processing Incremental views BATCH LAYER SPEED LAYER Real-time store
  19. © Ambiata 2014 New data stream Query All data Feature

    view Stream processing Incremental views Real-time store
  20. © Ambiata 2014 New data stream Query All data Feature

    view Stream processing Incremental views Real-time store Model train and score
  21. IVORY © Ambiata 2014

  22. Data set A Data set B Data set C Data

    set D Data set E Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Ivory © Ambiata 2014 A shared feature view asset
  23. © Ambiata 2014 New data stream Query All data Ivory

    Stream processing Incremental views Real-time store Model train and score An immutable, batch-oriented data store
  24. © Ambiata 2014 Feature vectors Ivory An extensible data model,

    backed by HDFS/S3 HDFS / S3
  25. Apache V2 Licence github.com/ambiata/ivory © Ambiata 2014