Ivory - An Introduction

6715c59578f5761d363cbfdc63d8f889?s=47 Ambiata
July 11, 2014

Ivory - An Introduction

An introduction to the Ivory feature store.

6715c59578f5761d363cbfdc63d8f889?s=128

Ambiata

July 11, 2014
Tweet

Transcript

  1. Ivory http://github.com/ambiata/ivory © Ambiata 2014

  2. Ivory A data store for features © Ambiata 2014

  3. Data set A Data set B Data set C Data

    set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature engineering is in a silo - no reuse between model builds
  4. Data set A Data set B Data set C Data

    set D Data set E Feature Eng A Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature Store ! (Ivory) Feature Eng B Feature Eng C Feature Eng D Feature Eng E Feature engineering is done once Features are reused across model builds
  5. Motivations • Features for model training + scoring • Scalability

    - lots of features • Extensible features sets • Feature sparseness • Feature integrity in a messy “big data” world • Recovering from operational errors © Ambiata 2014
  6. Key ideas • A sparse fact data model • Specific

    entity views - snapshots + chords • No moving parts - just files + tools • Immutability built-in © Ambiata 2014
  7. “Fact” data model © Ambiata 2014

  8. customer-1 sav.balance 634 @ 2014-02-01 single “fact” Fact: Entity -

    Attribute - Value - Time ! The value of a feature (attribute) for a given entity known to be valid from a certain point in time. © Ambiata 2014
  9. customer-1 sav.balance 634 @ 2014-02-01 scalable customer-2 customer-3 customer-4 469

    @ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 © Ambiata 2014
  10. customer-2 customer-3 customer-4 customer-1 gender sav.balance cc.purchases postcode 634 @

    2014-02-01 extensible 469 @ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 ‘M’ @ 2012-01-01 3 @ 2014-03-27 ‘4670’ @ 2009-05-13 © Ambiata 2014
  11. 736 @ 2014-01-01 3 @ 2014-02-19 184 @ 2014-02-01 312

    @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 Sparse 469 @ 2014-02-01 © Ambiata 2014
  12. Entity views © Ambiata 2014

  13. Snapshot views • Attribute values for entities at a point

    in time • Same time for all entities • Select latest attribute values with respect to time • Used in preparing instances for scoring © Ambiata 2014
  14. 736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312

    @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-02-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 snapshot @ 2014-03-01 469 @ 2014-02-01 © Ambiata 2014
  15. customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ 312

    4 ‘4670’ ‘F’ ‘3001’ 1966 634 469 © Ambiata 2014
  16. Chord views • Attribute values for entities at a point

    in time • Different times for different entities • Select latest attribute values with respect to time • Used in preparing instances for training © Ambiata 2014
  17. 736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312

    @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 469 @ 2014-02-01 customer2 @ 2014-03-01! customer4 @ 2014-01-01 © Ambiata 2014
  18. customer-2 @ 2014-03-01 customer-4 @ 2014-01-01 gender sav.balance cc.purchases postcode

    ‘M’ 312 6 ‘4670’ ‘3001’ 1876 © Ambiata 2014
  19. Ivory concepts © Ambiata 2014

  20. Repository • A single class of “entity” (e.g. customer) •

    Single on-disk hierarchy (e.g. HDFS) • Stores facts • Stores dictionary of fact attributes • Tools to get data in and out (e.g. MapReduce) • Versioned © Ambiata 2014
  21. - - - hdfs:///prod/ivory/customers single! on-disk! hierarchy © Ambiata 2014

  22. Dictionary • Description and metadata of all attributes that can

    be ascribed to facts in the repository • Versioned © Ambiata 2014
  23. namespace name encoding type description demographics gender string categorical Gender

    demographics postcode string categorical Post-code, zip-code accounts sav.balance double numerical Balance of savings account accounts cc.purchases int numerical Number of credit-card purchases © Ambiata 2014
  24. - - - hdfs:///prod/ivory/customers dict import! dictionary © Ambiata 2014

  25. 0 0 - hdfs:///prod/ivory/customers dict[0] dictionary! version repository! version ©

    Ambiata 2014
  26. 0 0 - hdfs:///prod/ivory/customers dict[0] dict update! dictionary © Ambiata

    2014
  27. 1 1 - hdfs:///prod/ivory/customers dict[1] dict[0] latest dictionary version ©

    Ambiata 2014
  28. Factsets • A collection of facts • Treated as a

    single atomic unit • The unit of inclusion (ingestion) • The unit of exclusion © Ambiata 2014
  29. customer-1 accounts:sav.balance 634 2014-02-01 customer-2 accounts:sav.balance 184 2014-02-01 cusomter-2 accounts:sav.balance

    312 2014-03-01 customer-2 accounts:cc.puchases 4 2014-02-04 customer-3 demographics:gender F 2007-04-01 customer-4 demographics:postcode 3001 2011-03-14 © Ambiata 2014
  30. 1 1 - hdfs:///prod/ivory/customers dict[1] factset dict[0] ingest! factset ©

    Ambiata 2014
  31. 1 1 - hdfs:///prod/ivory/customers dict[1] factset-a dict[0] © Ambiata 2014

  32. Stores • A prioritised sequence of fact-sets • Versioned ©

    Ambiata 2014
  33. 2 1 0 hdfs:///prod/ivory/customers dict[1] store[0] { a } factset-a

    dict[0] store version store references factset-a © Ambiata 2014
  34. 2 1 0 hdfs:///prod/ivory/customers dict[1] store[0] { a } factset-a

    dict[0] factset ingest! factset © Ambiata 2014
  35. 3 1 1 hdfs:///prod/ivory/customers dict[1] store[1] { b, a }

    factset-b dict[0] factset-a store[0] { a } © Ambiata 2014
  36. 3 1 1 hdfs:///prod/ivory/customers dict[1] store[1] { b, a }

    factset-b dict[0] factset-a store[0] { a } factset ingest! factset © Ambiata 2014
  37. 4 1 2 hdfs:///prod/ivory/customers dict[1] store[2] { c, b, a

    } factset-c dict[0] factset-b store[1] { b, a } factset-a store[0] { a } © Ambiata 2014
  38. 4 1 2 hdfs:///prod/ivory/customers dict[1] store[2] { c, b, a

    } dict[0] factset-b store[1] { b, a } factset-a store[0] { a } remove! factset-b © Ambiata 2014 factset-c
  39. 5 1 3 hdfs:///prod/ivory/customers dict[1] store[3] { c, a }

    factset-b factset-a store[2] { c, b, a } store[1] { b, a } store[0] { a } dict[0] factset-b removed © Ambiata 2014 factset-c
  40. Internals

  41. • On-disk hierarchy • On-disk EAVT representation incl. compression •

    Internal snapshots
  42. User experience • Consistent command line tooling • Version metadata

    • Workflows • End-to-end example system
  43. Extensible dictionary • Support for rich attribute types (e.g. structs,

    arrays) • Arbitrary attribute metadata • Specification of valid attribute values • e.g. ‘M’ and ‘F’ only for gender • Improved validation • Improved on-disk representation • Useful for downstream applications, e.g. plots
  44. Lazy feature generation • Lazily generate features derived from existing

    facts on extract (chord/snapshot) • Derived “meta” features (i.e. ‘select’) • Windowing functions (e.g. “average over last 3 months”) • Row-level features
  45. Data set A Data set B Data set C Data

    set D Data set E Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature Store ! (Ivory) Feature engineering is integrated and lazily generated on extraction Source data loaded directly Feature Eng ! (Ivory)
  46. Repository forking • Low-cost cloning/forking of repositories: • “master” production

    repo • “experimental” cloned repo • Allow a data scientist to join production features with their own without affecting production operations
  47. Other filesystems • Support for repository metadata and fact sets

    to be on different file systems • Support HDFS, S3 and POSIX