Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ivory

Mark Hibberd
September 19, 2014

 Ivory

Ivory is a scalable and extensible data store for storing facts and extracting features. It can be used within a large machine learning pipeline for normalising data and providing feeds to model training and scoring pipelines.

Some interesting properties of Ivory are it:

- Has no moving parts - just files on disk;
- Is optimised for scans not random access;
- Is extensible along the dimension of features;
- Is scalable by using HDFS or S3 as a backing store;
- Is an immutable data store allowing version "roll backs".

Mark Hibberd

September 19, 2014
Tweet

More Decks by Mark Hibberd

Other Decks in Science

Transcript

  1. Feature vectors © Ambiata 2014 0.00 3 3001 1.00 634.83

    16 4670 0.6875 15.12 2 - 0.50 33.56 2 - 1.00 98.34 12 3303 0.8333 523.81 23 2046 0.4782 1086.05 17 - 1.00 224.81 9 - 0.2222 78.21 2 2134 0.50 126.48 4 - 0.0 1 3 1 1 4 1 2 1 1 1 M - F M F - F F M - gender balance purchases zipcode prop_online num_accs 89340218 feature instance 48149407 18452274 07499337 62948721 93754723 00272446 13374497 31989993 46474236
  2. Data set B Data set C Data set D Feature

    Eng Train Model Score A typical workflow Multiple data sources: • Transaction logs • Database snapshots • Segmentation models • 3rd-party data © Ambiata 2014 Feature engineering: • Data source are joined • Instances are created • Features are engineered The cool stuff: • Models are built • Instances are scored
  3. Data set A Data set B Data set C Data

    set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train Score Train Score Train Score Multiple data science projects Feature engineering is in a silo - no reuse between model builds © Ambiata 2014
  4. Data set A Data set B Data set C Data

    set D Data set E Feature Eng A Train Model 1 Score Train Model 2 Score Train Model 3 Score The value for a shared asset Feature Store Feature Eng B Feature Eng C Feature Eng D Feature Eng E Feature engineering is done once Features are reused across model builds © Ambiata 2014
  5. IVORY A scalable and extensible data store for storing facts

    and extracting features © Ambiata 2014
  6. MOTIVATIONS • Features for model training + scoring • Scalability

    - lots of features • Extensible features sets • Feature sparseness • Feature integrity in a messy “big data” world • Recovering from operational errors © Ambiata 2014
  7. KEY IDEAS • A sparse fact data model • Specific

    entity views to extract features - snapshots + chords • No moving parts - just files • Scale achieved by using HDFS or S3 as a backing store • Immutability built-in © Ambiata 2014
  8. customer-1 balance 634 @ 2014-02-01 single “fact” Fact: Entity -

    Attribute - Value - Time ! The value of a feature (attribute) for a given entity known to be valid from a certain point in time.
  9. customer-1 balance 634 @ 2014-02-01 customer-2 customer-3 customer-4 469 @

    2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 © Ambiata 2014 scalable
  10. customer-2 customer-3 customer-4 customer-1 gender balance purchases zipcode 634 @

    2014-02-01 extensible 469 @ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 ‘M’ @ 2012-01-01 3 @ 2014-03-27 ‘4670’ @ 2009-05-13 © Ambiata 2014
  11. 736 @ 2014-01-01 3 @ 2014-02-19 184 @ 2014-02-01 312

    @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 Sparse 469 @ 2014-02-01 © Ambiata 2014
  12. SNAPSHOTS • Attribute values for entities at a point in

    time • Same time for all entities • Select latest attribute values with respect to time • Used in preparing instances for scoring © Ambiata 2014
  13. 736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312

    @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-02-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 snapshot @ 2014-03-01 469 @ 2014-02-01 © Ambiata 2014
  14. CHORDS • Attribute values for entities at a point in

    time • Different times for different entities • Select latest attribute values with respect to time • Used in preparing instances for training © Ambiata 2014
  15. 736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312

    @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 469 @ 2014-02-01 customer2 @ 2014-03-01 customer4 @ 2014-01-01 © Ambiata 2014
  16. customer-2 @ 2014-03-01 customer-4 @ 2014-01-01 gender sav.balance cc.purchases postcode

    ‘M’ 312 6 ‘4670’ ‘3001’ 1876 © Ambiata 2014
  17. REPOSITORY • A single class of “entity” (e.g. customer) •

    Stores facts • Stores dictionary of fact attributes • Versioned © Ambiata 2014
  18. • A collection of facts • Treated as a single

    atomic unit • The unit of inclusion (ingestion) • The unit of exclusion FACTSETS © Ambiata 2014
  19. customer-1 accounts:sav.balance 634 2014-02-01 customer-2 accounts:sav.balance 184 2014-02-01 cusomter-2 accounts:sav.balance

    312 2014-03-01 customer-2 accounts:cc.puchases 4 2014-02-04 customer-3 demographics:gender F 2007-04-01 customer-4 demographics:postcode 3001 2011-03-14 © Ambiata 2014
  20. DICTIONARY • Description and metadata of all attributes that can

    be ascribed to facts in the repository • Versioned © Ambiata 2014
  21. namespace name encoding type description demographics gender string categorical Gender

    demographics postcode string categorical Post-code, zip-code accounts sav.balance double numerical Balance of savings account accounts cc.purchases int numerical Number of credit-card purchases © Ambiata 2014
  22. 1 2 3 4 5 0 create repository import dictionary

    ingest factset ingest factset import dictionary ingest factset snapshot snapshot chord
  23. Data set A Data set B Data set C Data

    set D Data set E Train Score Train Score Train Score Fact Store Feature engineering is integrated and lazily generated on extraction Source data loaded directly © Ambiata 2014
  24. LAZY FEATURE GENERATION • Lazily generate features derived from existing

    facts on extract (chord/snapshot) • Derived “meta” features (i.e. ‘select’) • Windowing functions (e.g. “average over last 3 months”) © Ambiata 2014
  25. REPOSITORY FORKING • Low-cost cloning/forking of repositories: • “master” production

    repo • “experimental” cloned repo • Allow a data scientist to join production features with their own without affecting production operations © Ambiata 2014