Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ivory - An Introduction

Ambiata
July 11, 2014

Ivory - An Introduction

An introduction to the Ivory feature store.

Ambiata

July 11, 2014
Tweet

More Decks by Ambiata

Other Decks in Technology

Transcript

  1. Data set A Data set B Data set C Data

    set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature engineering is in a silo - no reuse between model builds
  2. Data set A Data set B Data set C Data

    set D Data set E Feature Eng A Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature Store ! (Ivory) Feature Eng B Feature Eng C Feature Eng D Feature Eng E Feature engineering is done once Features are reused across model builds
  3. Motivations • Features for model training + scoring • Scalability

    - lots of features • Extensible features sets • Feature sparseness • Feature integrity in a messy “big data” world • Recovering from operational errors © Ambiata 2014
  4. Key ideas • A sparse fact data model • Specific

    entity views - snapshots + chords • No moving parts - just files + tools • Immutability built-in © Ambiata 2014
  5. customer-1 sav.balance 634 @ 2014-02-01 single “fact” Fact: Entity -

    Attribute - Value - Time ! The value of a feature (attribute) for a given entity known to be valid from a certain point in time. © Ambiata 2014
  6. customer-1 sav.balance 634 @ 2014-02-01 scalable customer-2 customer-3 customer-4 469

    @ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 © Ambiata 2014
  7. customer-2 customer-3 customer-4 customer-1 gender sav.balance cc.purchases postcode 634 @

    2014-02-01 extensible 469 @ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 ‘M’ @ 2012-01-01 3 @ 2014-03-27 ‘4670’ @ 2009-05-13 © Ambiata 2014
  8. 736 @ 2014-01-01 3 @ 2014-02-19 184 @ 2014-02-01 312

    @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 Sparse 469 @ 2014-02-01 © Ambiata 2014
  9. Snapshot views • Attribute values for entities at a point

    in time • Same time for all entities • Select latest attribute values with respect to time • Used in preparing instances for scoring © Ambiata 2014
  10. 736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312

    @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-02-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 snapshot @ 2014-03-01 469 @ 2014-02-01 © Ambiata 2014
  11. Chord views • Attribute values for entities at a point

    in time • Different times for different entities • Select latest attribute values with respect to time • Used in preparing instances for training © Ambiata 2014
  12. 736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312

    @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 469 @ 2014-02-01 customer2 @ 2014-03-01! customer4 @ 2014-01-01 © Ambiata 2014
  13. customer-2 @ 2014-03-01 customer-4 @ 2014-01-01 gender sav.balance cc.purchases postcode

    ‘M’ 312 6 ‘4670’ ‘3001’ 1876 © Ambiata 2014
  14. Repository • A single class of “entity” (e.g. customer) •

    Single on-disk hierarchy (e.g. HDFS) • Stores facts • Stores dictionary of fact attributes • Tools to get data in and out (e.g. MapReduce) • Versioned © Ambiata 2014
  15. Dictionary • Description and metadata of all attributes that can

    be ascribed to facts in the repository • Versioned © Ambiata 2014
  16. namespace name encoding type description demographics gender string categorical Gender

    demographics postcode string categorical Post-code, zip-code accounts sav.balance double numerical Balance of savings account accounts cc.purchases int numerical Number of credit-card purchases © Ambiata 2014
  17. Factsets • A collection of facts • Treated as a

    single atomic unit • The unit of inclusion (ingestion) • The unit of exclusion © Ambiata 2014
  18. customer-1 accounts:sav.balance 634 2014-02-01 customer-2 accounts:sav.balance 184 2014-02-01 cusomter-2 accounts:sav.balance

    312 2014-03-01 customer-2 accounts:cc.puchases 4 2014-02-04 customer-3 demographics:gender F 2007-04-01 customer-4 demographics:postcode 3001 2011-03-14 © Ambiata 2014
  19. 2 1 0 hdfs:///prod/ivory/customers dict[1] store[0] { a } factset-a

    dict[0] store version store references factset-a © Ambiata 2014
  20. 2 1 0 hdfs:///prod/ivory/customers dict[1] store[0] { a } factset-a

    dict[0] factset ingest! factset © Ambiata 2014
  21. 3 1 1 hdfs:///prod/ivory/customers dict[1] store[1] { b, a }

    factset-b dict[0] factset-a store[0] { a } © Ambiata 2014
  22. 3 1 1 hdfs:///prod/ivory/customers dict[1] store[1] { b, a }

    factset-b dict[0] factset-a store[0] { a } factset ingest! factset © Ambiata 2014
  23. 4 1 2 hdfs:///prod/ivory/customers dict[1] store[2] { c, b, a

    } factset-c dict[0] factset-b store[1] { b, a } factset-a store[0] { a } © Ambiata 2014
  24. 4 1 2 hdfs:///prod/ivory/customers dict[1] store[2] { c, b, a

    } dict[0] factset-b store[1] { b, a } factset-a store[0] { a } remove! factset-b © Ambiata 2014 factset-c
  25. 5 1 3 hdfs:///prod/ivory/customers dict[1] store[3] { c, a }

    factset-b factset-a store[2] { c, b, a } store[1] { b, a } store[0] { a } dict[0] factset-b removed © Ambiata 2014 factset-c
  26. Extensible dictionary • Support for rich attribute types (e.g. structs,

    arrays) • Arbitrary attribute metadata • Specification of valid attribute values • e.g. ‘M’ and ‘F’ only for gender • Improved validation • Improved on-disk representation • Useful for downstream applications, e.g. plots
  27. Lazy feature generation • Lazily generate features derived from existing

    facts on extract (chord/snapshot) • Derived “meta” features (i.e. ‘select’) • Windowing functions (e.g. “average over last 3 months”) • Row-level features
  28. Data set A Data set B Data set C Data

    set D Data set E Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature Store ! (Ivory) Feature engineering is integrated and lazily generated on extraction Source data loaded directly Feature Eng ! (Ivory)
  29. Repository forking • Low-cost cloning/forking of repositories: • “master” production

    repo • “experimental” cloned repo • Allow a data scientist to join production features with their own without affecting production operations
  30. Other filesystems • Support for repository metadata and fact sets

    to be on different file systems • Support HDFS, S3 and POSIX