Slide 1

Slide 1 text

IVORY http://github.com/ambiata/ivory © Ambiata 2014

Slide 2

Slide 2 text

DATA SCIENCE IN THE REAL WORLD

Slide 3

Slide 3 text

THE END GAME

Slide 4

Slide 4 text

Feature vectors © Ambiata 2014 0.00 3 3001 1.00 634.83 16 4670 0.6875 15.12 2 - 0.50 33.56 2 - 1.00 98.34 12 3303 0.8333 523.81 23 2046 0.4782 1086.05 17 - 1.00 224.81 9 - 0.2222 78.21 2 2134 0.50 126.48 4 - 0.0 1 3 1 1 4 1 2 1 1 1 M - F M F - F F M - gender balance purchases zipcode prop_online num_accs 89340218 feature instance 48149407 18452274 07499337 62948721 93754723 00272446 13374497 31989993 46474236

Slide 5

Slide 5 text

WHAT WE START WITH

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

THE REALITY Feature preparation Modelling 85% 15%

Slide 8

Slide 8 text

Data set B Data set C Data set D Feature Eng Train Model Score A typical workflow Multiple data sources: • Transaction logs • Database snapshots • Segmentation models • 3rd-party data © Ambiata 2014 Feature engineering: • Data source are joined • Instances are created • Features are engineered The cool stuff: • Models are built • Instances are scored

Slide 9

Slide 9 text

Data set A Data set B Data set C Data set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train Score Train Score Train Score Multiple data science projects Feature engineering is in a silo - no reuse between model builds © Ambiata 2014

Slide 10

Slide 10 text

Data set A Data set B Data set C Data set D Data set E Feature Eng A Train Model 1 Score Train Model 2 Score Train Model 3 Score The value for a shared asset Feature Store Feature Eng B Feature Eng C Feature Eng D Feature Eng E Feature engineering is done once Features are reused across model builds © Ambiata 2014

Slide 11

Slide 11 text

IVORY A scalable and extensible data store for storing facts and extracting features © Ambiata 2014

Slide 12

Slide 12 text

MOTIVATIONS • Features for model training + scoring • Scalability - lots of features • Extensible features sets • Feature sparseness • Feature integrity in a messy “big data” world • Recovering from operational errors © Ambiata 2014

Slide 13

Slide 13 text

KEY IDEAS • A sparse fact data model • Specific entity views to extract features - snapshots + chords • No moving parts - just files • Scale achieved by using HDFS or S3 as a backing store • Immutability built-in © Ambiata 2014

Slide 14

Slide 14 text

“FACT” DATA MODEL © Ambiata 2014

Slide 15

Slide 15 text

customer-1 balance 634 @ 2014-02-01 single “fact” Fact: Entity - Attribute - Value - Time ! The value of a feature (attribute) for a given entity known to be valid from a certain point in time.

Slide 16

Slide 16 text

customer-1 balance 634 @ 2014-02-01 customer-2 customer-3 customer-4 469 @ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 © Ambiata 2014 scalable

Slide 17

Slide 17 text

customer-2 customer-3 customer-4 customer-1 gender balance purchases zipcode 634 @ 2014-02-01 extensible 469 @ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 ‘M’ @ 2012-01-01 3 @ 2014-03-27 ‘4670’ @ 2009-05-13 © Ambiata 2014

Slide 18

Slide 18 text

736 @ 2014-01-01 3 @ 2014-02-19 184 @ 2014-02-01 312 @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 Sparse 469 @ 2014-02-01 © Ambiata 2014

Slide 19

Slide 19 text

FEATURE EXTRACTION © Ambiata 2014

Slide 20

Slide 20 text

SNAPSHOTS • Attribute values for entities at a point in time • Same time for all entities • Select latest attribute values with respect to time • Used in preparing instances for scoring © Ambiata 2014

Slide 21

Slide 21 text

736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312 @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-02-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 snapshot @ 2014-03-01 469 @ 2014-02-01 © Ambiata 2014

Slide 22

Slide 22 text

customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ 312 4 ‘4670’ ‘F’ ‘3001’ 1966 634 469 © Ambiata 2014

Slide 23

Slide 23 text

CHORDS • Attribute values for entities at a point in time • Different times for different entities • Select latest attribute values with respect to time • Used in preparing instances for training © Ambiata 2014

Slide 24

Slide 24 text

736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312 @ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 469 @ 2014-02-01 customer2 @ 2014-03-01 customer4 @ 2014-01-01 © Ambiata 2014

Slide 25

Slide 25 text

customer-2 @ 2014-03-01 customer-4 @ 2014-01-01 gender sav.balance cc.purchases postcode ‘M’ 312 6 ‘4670’ ‘3001’ 1876 © Ambiata 2014

Slide 26

Slide 26 text

IVORY CONCEPTS © Ambiata 2014

Slide 27

Slide 27 text

REPOSITORY • A single class of “entity” (e.g. customer) • Stores facts • Stores dictionary of fact attributes • Versioned © Ambiata 2014

Slide 28

Slide 28 text

• A collection of facts • Treated as a single atomic unit • The unit of inclusion (ingestion) • The unit of exclusion FACTSETS © Ambiata 2014

Slide 29

Slide 29 text

customer-1 accounts:sav.balance 634 2014-02-01 customer-2 accounts:sav.balance 184 2014-02-01 cusomter-2 accounts:sav.balance 312 2014-03-01 customer-2 accounts:cc.puchases 4 2014-02-04 customer-3 demographics:gender F 2007-04-01 customer-4 demographics:postcode 3001 2011-03-14 © Ambiata 2014

Slide 30

Slide 30 text

DICTIONARY • Description and metadata of all attributes that can be ascribed to facts in the repository • Versioned © Ambiata 2014

Slide 31

Slide 31 text

namespace name encoding type description demographics gender string categorical Gender demographics postcode string categorical Post-code, zip-code accounts sav.balance double numerical Balance of savings account accounts cc.purchases int numerical Number of credit-card purchases © Ambiata 2014

Slide 32

Slide 32 text

1 2 3 4 5 0 create repository import dictionary ingest factset ingest factset import dictionary ingest factset snapshot snapshot chord

Slide 33

Slide 33 text

VIRTUAL FEATURES

Slide 34

Slide 34 text

Data set A Data set B Data set C Data set D Data set E Train Score Train Score Train Score Fact Store Feature engineering is integrated and lazily generated on extraction Source data loaded directly © Ambiata 2014

Slide 35

Slide 35 text

LAZY FEATURE GENERATION • Lazily generate features derived from existing facts on extract (chord/snapshot) • Derived “meta” features (i.e. ‘select’) • Windowing functions (e.g. “average over last 3 months”) © Ambiata 2014

Slide 36

Slide 36 text

ROADMAP

Slide 37

Slide 37 text

REPOSITORY FORKING • Low-cost cloning/forking of repositories: • “master” production repo • “experimental” cloned repo • Allow a data scientist to join production features with their own without affecting production operations © Ambiata 2014