Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ivory - Concepts

Ambiata
October 20, 2014

Ivory - Concepts

Ambiata

October 20, 2014
Tweet

More Decks by Ambiata

Other Decks in Technology

Transcript

  1. IVORY A scalable and extensible data store for storing facts

    and extracting features © Ambiata 2014
  2. REPOSITORY • Storing and extracting data for a single class

    of entity, e.g.: • customer • account • asset © Ambiata 2014
  3. customer-1 balance 634 @ 2014-02-01 single “fact” Fact: Entity -

    Attribute - Value - Time The value of a feature (attribute) for a given entity known to be valid from a certain point in time. © Ambiata 2014
  4. customer-1 balance 634 @ 2014-02-01 customer-2 customer-3 customer-4 469 @

    2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 © Ambiata 2014 scalable
  5. customer-2 customer-3 customer-4 customer-1 gender balance purchases zipcode 634 @

    2014-02-01 extensible 469 @ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 ‘M’ @ 2012-01-01 3 @ 2014-03-27 ‘4670’ @ 2009-05-13 © Ambiata 2014
  6. 736 @ 2014-01-01 3 @ 2014-02-19 184 @ 2014-02-01 312

    @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender balance purchases zipcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 Sparse 469 @ 2014-02-01 © Ambiata 2014
  7. • Facts are ingested in atomic units called factsets •

    Facts in a factset can span any set of: • entities • attributes • dates/times © Ambiata 2014
  8. customer-1 balance 634 2014-02-01 customer-3 balance 184 2014-02-01 customer-4 purchases

    4 2014-02-04 cusomter-2 balance 312 2014-03-01 customer-3 gender F 2007-04-01 customer-2 zipcode 3001 2011-03-14 © Ambiata 2014
  9. • Any attribute that is ingested must be declared in

    the repository’s dictionary • Dictionary stores metadata for each attribute • Updated dictionaries can be imported into a repository at any time © Ambiata 2014
  10. namespace name encoding type description demographics gender string categorical Gender

    demographics zipcode string categorical Post-code, zip-code accounts balance double numerical Balance of savings account accounts purchases int numerical Number of credit-card purchases © Ambiata 2014
  11. © Ambiata 2014 0.00 3 3001 634.83 16 4670 15.12

    2 - 33.56 2 - 98.34 12 3303 523.81 23 2046 1086.05 17 - 224.81 9 - 78.21 2 2134 126.48 4 - M - F M F - F F M - gender balance purchases zipcode 89340218 feature instance 48149407 18452274 07499337 62948721 93754723 00272446 13374497 31989993 46474236
  12. SNAPSHOTS • Attribute values for entities at a point in

    time • Same time for all entities • Select latest attribute values with respect to that time • Typically used in preparing instances for scoring © Ambiata 2014
  13. 736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312

    @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender balance purchases zipcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-02-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 snapshot @ 2014-03-01 469 @ 2014-02-01 © Ambiata 2014
  14. customer-1 customer-2 customer-3 customer-4 gender balance purchases zipcode ‘M’ 312

    4 ‘4670’ ‘F’ ‘3001’ 1966 634 469 © Ambiata 2014
  15. • It is assumed snapshots run periodically - e.g. daily,

    weekly • Ivory exploits this assumption to improve the runtime of successive snapshots
  16. CHORDS • Attribute values for entities at a point in

    time • Different times for different entities • Select latest attribute values with respect to the times • Typically used in preparing instances for training © Ambiata 2014
  17. 736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312

    @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender balance purchases zipcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 469 @ 2014-02-01 customer2 @ 2014-03-01 customer4 @ 2014-01-01 © Ambiata 2014
  18. customer-2 @ 2014-03-01 customer-4 @ 2014-01-01 gender balance purchases postcode

    ‘M’ 312 6 ‘4670’ ‘3001’ 1876 © Ambiata 2014
  19. 184 @ 2014-02-01 312 @ 2014-03-01 customer-2 balance max.balance.4M 276

    @ 2014-04-01 ? © Ambiata 2014 Maximum balance over last 4 months can be derived from set of balance facts
  20. base fact derived facts balance Maximum balance over the last

    month Mean balance over the last 2 months Balance gradient over the last 3 months purchase Number of purchases in the last 3 weeks Proportion of supermarket purchases in the last 2 weeks zipcode Number of times the zipcode has change in the last 5 years Longest period where the zipcode has not changed in the last 5 years © Ambiata 2014
  21. • Ivory represents derived facts as virtual features • Virtual

    features are declared in the dictionary • Specify expressions against base facts • Are computed lazily when features extracted © Ambiata 2014
  22. name source expression window max.balance.4M balance max 4 month mean.balance.6M

    balance mean 6 months num.purchases.3W purchase count 3 weeks changes.zipcode.5Y zipcode num_flips 5 years © Ambiata 2014
  23. • A commit is recorded for any repository change: •

    factset ingestions • dictionary imports • The repository at a given commit is an immutable data store • Snapshot and chord can be done at a specific commit © Ambiata 2014
  24. 1 2 3 4 5 0 create repository import dictionary

    ingest factset ingest factset import dictionary ingest factset snapshot snapshot chord © Ambiata 2014
  25. • Repository • Commit • Dictionary • Factset • Base

    fact • Virtual feature • Snapshot • Chord © Ambiata 2014