Slide 1

Slide 1 text

Ivory http://github.com/ambiata/ivory © Ambiata 2014

Slide 2

Slide 2 text

Ivory A data store for features © Ambiata 2014

Slide 3

Slide 3 text

Data set A Data set B Data set C Data set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature engineering is in a silo - no reuse between model builds

Slide 4

Slide 4 text

Data set A Data set B Data set C Data set D Data set E Feature Eng A Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature Store ! (Ivory) Feature Eng B Feature Eng C Feature Eng D Feature Eng E Feature engineering is done once Features are reused across model builds

Slide 5

Slide 5 text

Motivations • Features for model training + scoring • Scalability - lots of features • Extensible features sets • Feature sparseness • Feature integrity in a messy “big data” world • Recovering from operational errors © Ambiata 2014

Slide 6

Slide 6 text

Key ideas • A sparse fact data model • Specific entity views - snapshots + chords • No moving parts - just files + tools • Immutability built-in © Ambiata 2014

Slide 7

Slide 7 text

“Fact” data model © Ambiata 2014

Slide 8

Slide 8 text

customer-1 sav.balance 634 @ 2014-02-01 single “fact” Fact: Entity - Attribute - Value - Time ! The value of a feature (attribute) for a given entity known to be valid from a certain point in time. © Ambiata 2014

Slide 9

Slide 9 text

customer-1 sav.balance 634 @ 2014-02-01 scalable customer-2 customer-3 customer-4 469 @ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 © Ambiata 2014

Slide 10

Slide 10 text

customer-2 customer-3 customer-4 customer-1 gender sav.balance cc.purchases postcode 634 @ 2014-02-01 extensible 469 @ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 ‘M’ @ 2012-01-01 3 @ 2014-03-27 ‘4670’ @ 2009-05-13 © Ambiata 2014

Slide 11

Slide 11 text

736 @ 2014-01-01 3 @ 2014-02-19 184 @ 2014-02-01 312 @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 Sparse 469 @ 2014-02-01 © Ambiata 2014

Slide 12

Slide 12 text

Entity views © Ambiata 2014

Slide 13

Slide 13 text

Snapshot views • Attribute values for entities at a point in time • Same time for all entities • Select latest attribute values with respect to time • Used in preparing instances for scoring © Ambiata 2014

Slide 14

Slide 14 text

736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312 @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-02-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 snapshot @ 2014-03-01 469 @ 2014-02-01 © Ambiata 2014

Slide 15

Slide 15 text

customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ 312 4 ‘4670’ ‘F’ ‘3001’ 1966 634 469 © Ambiata 2014

Slide 16

Slide 16 text

Chord views • Attribute values for entities at a point in time • Different times for different entities • Select latest attribute values with respect to time • Used in preparing instances for training © Ambiata 2014

Slide 17

Slide 17 text

736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312 @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 469 @ 2014-02-01 customer2 @ 2014-03-01! customer4 @ 2014-01-01 © Ambiata 2014

Slide 18

Slide 18 text

customer-2 @ 2014-03-01 customer-4 @ 2014-01-01 gender sav.balance cc.purchases postcode ‘M’ 312 6 ‘4670’ ‘3001’ 1876 © Ambiata 2014

Slide 19

Slide 19 text

Ivory concepts © Ambiata 2014

Slide 20

Slide 20 text

Repository • A single class of “entity” (e.g. customer) • Single on-disk hierarchy (e.g. HDFS) • Stores facts • Stores dictionary of fact attributes • Tools to get data in and out (e.g. MapReduce) • Versioned © Ambiata 2014

Slide 21

Slide 21 text

- - - hdfs:///prod/ivory/customers single! on-disk! hierarchy © Ambiata 2014

Slide 22

Slide 22 text

Dictionary • Description and metadata of all attributes that can be ascribed to facts in the repository • Versioned © Ambiata 2014

Slide 23

Slide 23 text

namespace name encoding type description demographics gender string categorical Gender demographics postcode string categorical Post-code, zip-code accounts sav.balance double numerical Balance of savings account accounts cc.purchases int numerical Number of credit-card purchases © Ambiata 2014

Slide 24

Slide 24 text

- - - hdfs:///prod/ivory/customers dict import! dictionary © Ambiata 2014

Slide 25

Slide 25 text

0 0 - hdfs:///prod/ivory/customers dict[0] dictionary! version repository! version © Ambiata 2014

Slide 26

Slide 26 text

0 0 - hdfs:///prod/ivory/customers dict[0] dict update! dictionary © Ambiata 2014

Slide 27

Slide 27 text

1 1 - hdfs:///prod/ivory/customers dict[1] dict[0] latest dictionary version © Ambiata 2014

Slide 28

Slide 28 text

Factsets • A collection of facts • Treated as a single atomic unit • The unit of inclusion (ingestion) • The unit of exclusion © Ambiata 2014

Slide 29

Slide 29 text

customer-1 accounts:sav.balance 634 2014-02-01 customer-2 accounts:sav.balance 184 2014-02-01 cusomter-2 accounts:sav.balance 312 2014-03-01 customer-2 accounts:cc.puchases 4 2014-02-04 customer-3 demographics:gender F 2007-04-01 customer-4 demographics:postcode 3001 2011-03-14 © Ambiata 2014

Slide 30

Slide 30 text

1 1 - hdfs:///prod/ivory/customers dict[1] factset dict[0] ingest! factset © Ambiata 2014

Slide 31

Slide 31 text

1 1 - hdfs:///prod/ivory/customers dict[1] factset-a dict[0] © Ambiata 2014

Slide 32

Slide 32 text

Stores • A prioritised sequence of fact-sets • Versioned © Ambiata 2014

Slide 33

Slide 33 text

2 1 0 hdfs:///prod/ivory/customers dict[1] store[0] { a } factset-a dict[0] store version store references factset-a © Ambiata 2014

Slide 34

Slide 34 text

2 1 0 hdfs:///prod/ivory/customers dict[1] store[0] { a } factset-a dict[0] factset ingest! factset © Ambiata 2014

Slide 35

Slide 35 text

3 1 1 hdfs:///prod/ivory/customers dict[1] store[1] { b, a } factset-b dict[0] factset-a store[0] { a } © Ambiata 2014

Slide 36

Slide 36 text

3 1 1 hdfs:///prod/ivory/customers dict[1] store[1] { b, a } factset-b dict[0] factset-a store[0] { a } factset ingest! factset © Ambiata 2014

Slide 37

Slide 37 text

4 1 2 hdfs:///prod/ivory/customers dict[1] store[2] { c, b, a } factset-c dict[0] factset-b store[1] { b, a } factset-a store[0] { a } © Ambiata 2014

Slide 38

Slide 38 text

4 1 2 hdfs:///prod/ivory/customers dict[1] store[2] { c, b, a } dict[0] factset-b store[1] { b, a } factset-a store[0] { a } remove! factset-b © Ambiata 2014 factset-c

Slide 39

Slide 39 text

5 1 3 hdfs:///prod/ivory/customers dict[1] store[3] { c, a } factset-b factset-a store[2] { c, b, a } store[1] { b, a } store[0] { a } dict[0] factset-b removed © Ambiata 2014 factset-c

Slide 40

Slide 40 text

Internals

Slide 41

Slide 41 text

• On-disk hierarchy • On-disk EAVT representation incl. compression • Internal snapshots

Slide 42

Slide 42 text

User experience • Consistent command line tooling • Version metadata • Workflows • End-to-end example system

Slide 43

Slide 43 text

Extensible dictionary • Support for rich attribute types (e.g. structs, arrays) • Arbitrary attribute metadata • Specification of valid attribute values • e.g. ‘M’ and ‘F’ only for gender • Improved validation • Improved on-disk representation • Useful for downstream applications, e.g. plots

Slide 44

Slide 44 text

Lazy feature generation • Lazily generate features derived from existing facts on extract (chord/snapshot) • Derived “meta” features (i.e. ‘select’) • Windowing functions (e.g. “average over last 3 months”) • Row-level features

Slide 45

Slide 45 text

Data set A Data set B Data set C Data set D Data set E Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature Store ! (Ivory) Feature engineering is integrated and lazily generated on extraction Source data loaded directly Feature Eng ! (Ivory)

Slide 46

Slide 46 text

Repository forking • Low-cost cloning/forking of repositories: • “master” production repo • “experimental” cloned repo • Allow a data scientist to join production features with their own without affecting production operations

Slide 47

Slide 47 text

Other filesystems • Support for repository metadata and fact sets to be on different file systems • Support HDFS, S3 and POSIX