Ivory - An Introduction

Ivory http://github.com/ambiata/ivory © Ambiata 2014

Ivory A data store for features © Ambiata 2014

Data set A Data set B Data set C Data
set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature engineering is in a silo - no reuse between model builds

set D Data set E Feature Eng A Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature Store ! (Ivory) Feature Eng B Feature Eng C Feature Eng D Feature Eng E Feature engineering is done once Features are reused across model builds

Motivations • Features for model training + scoring • Scalability
- lots of features • Extensible features sets • Feature sparseness • Feature integrity in a messy “big data” world • Recovering from operational errors © Ambiata 2014

Key ideas • A sparse fact data model • Speciﬁc
entity views - snapshots + chords • No moving parts - just ﬁles + tools • Immutability built-in © Ambiata 2014

“Fact” data model © Ambiata 2014

customer-1 sav.balance 634 @ 2014-02-01 single “fact” Fact: Entity -
Attribute - Value - Time ! The value of a feature (attribute) for a given entity known to be valid from a certain point in time. © Ambiata 2014

customer-1 sav.balance 634 @ 2014-02-01 scalable customer-2 customer-3 customer-4 469
@ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 © Ambiata 2014

customer-2 customer-3 customer-4 customer-1 gender sav.balance cc.purchases postcode 634 @
2014-02-01 extensible 469 @ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 ‘M’ @ 2012-01-01 3 @ 2014-03-27 ‘4670’ @ 2009-05-13 © Ambiata 2014

736 @ 2014-01-01 3 @ 2014-02-19 184 @ 2014-02-01 312
@ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 Sparse 469 @ 2014-02-01 © Ambiata 2014

Entity views © Ambiata 2014

Snapshot views • Attribute values for entities at a point
in time • Same time for all entities • Select latest attribute values with respect to time • Used in preparing instances for scoring © Ambiata 2014

736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312
@ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-02-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 snapshot @ 2014-03-01 469 @ 2014-02-01 © Ambiata 2014

customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ 312
4 ‘4670’ ‘F’ ‘3001’ 1966 634 469 © Ambiata 2014

Chord views • Attribute values for entities at a point
in time • Different times for different entities • Select latest attribute values with respect to time • Used in preparing instances for training © Ambiata 2014

736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312
@ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender sav.balance cc.purchases postcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 469 @ 2014-02-01 customer2 @ 2014-03-01! customer4 @ 2014-01-01 © Ambiata 2014

customer-2 @ 2014-03-01 customer-4 @ 2014-01-01 gender sav.balance cc.purchases postcode
‘M’ 312 6 ‘4670’ ‘3001’ 1876 © Ambiata 2014

Ivory concepts © Ambiata 2014

Repository • A single class of “entity” (e.g. customer) •
Single on-disk hierarchy (e.g. HDFS) • Stores facts • Stores dictionary of fact attributes • Tools to get data in and out (e.g. MapReduce) • Versioned © Ambiata 2014

- - - hdfs:///prod/ivory/customers single! on-disk! hierarchy © Ambiata 2014

Dictionary • Description and metadata of all attributes that can
be ascribed to facts in the repository • Versioned © Ambiata 2014

namespace name encoding type description demographics gender string categorical Gender
demographics postcode string categorical Post-code, zip-code accounts sav.balance double numerical Balance of savings account accounts cc.purchases int numerical Number of credit-card purchases © Ambiata 2014

- - - hdfs:///prod/ivory/customers dict import! dictionary © Ambiata 2014

0 0 - hdfs:///prod/ivory/customers dict[0] dictionary! version repository! version ©
Ambiata 2014

0 0 - hdfs:///prod/ivory/customers dict[0] dict update! dictionary © Ambiata
2014

1 1 - hdfs:///prod/ivory/customers dict[1] dict[0] latest dictionary version ©
Ambiata 2014

Factsets • A collection of facts • Treated as a
single atomic unit • The unit of inclusion (ingestion) • The unit of exclusion © Ambiata 2014

customer-1 accounts:sav.balance 634 2014-02-01 customer-2 accounts:sav.balance 184 2014-02-01 cusomter-2 accounts:sav.balance
312 2014-03-01 customer-2 accounts:cc.puchases 4 2014-02-04 customer-3 demographics:gender F 2007-04-01 customer-4 demographics:postcode 3001 2011-03-14 © Ambiata 2014

1 1 - hdfs:///prod/ivory/customers dict[1] factset dict[0] ingest! factset ©
Ambiata 2014

1 1 - hdfs:///prod/ivory/customers dict[1] factset-a dict[0] © Ambiata 2014

Stores • A prioritised sequence of fact-sets • Versioned ©
Ambiata 2014

2 1 0 hdfs:///prod/ivory/customers dict[1] store[0] { a } factset-a
dict[0] store version store references factset-a © Ambiata 2014

3 1 1 hdfs:///prod/ivory/customers dict[1] store[1] { b, a }
factset-b dict[0] factset-a store[0] { a } factset ingest! factset © Ambiata 2014

4 1 2 hdfs:///prod/ivory/customers dict[1] store[2] { c, b, a
} factset-c dict[0] factset-b store[1] { b, a } factset-a store[0] { a } © Ambiata 2014

4 1 2 hdfs:///prod/ivory/customers dict[1] store[2] { c, b, a
} dict[0] factset-b store[1] { b, a } factset-a store[0] { a } remove! factset-b © Ambiata 2014 factset-c

5 1 3 hdfs:///prod/ivory/customers dict[1] store[3] { c, a }
factset-b factset-a store[2] { c, b, a } store[1] { b, a } store[0] { a } dict[0] factset-b removed © Ambiata 2014 factset-c

Internals

• On-disk hierarchy • On-disk EAVT representation incl. compression •
Internal snapshots

User experience • Consistent command line tooling • Version metadata
• Workﬂows • End-to-end example system

Extensible dictionary • Support for rich attribute types (e.g. structs,
arrays) • Arbitrary attribute metadata • Speciﬁcation of valid attribute values • e.g. ‘M’ and ‘F’ only for gender • Improved validation • Improved on-disk representation • Useful for downstream applications, e.g. plots

Lazy feature generation • Lazily generate features derived from existing
facts on extract (chord/snapshot) • Derived “meta” features (i.e. ‘select’) • Windowing functions (e.g. “average over last 3 months”) • Row-level features

set D Data set E Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature Store ! (Ivory) Feature engineering is integrated and lazily generated on extraction Source data loaded directly Feature Eng ! (Ivory)

Repository forking • Low-cost cloning/forking of repositories: • “master” production
repo • “experimental” cloned repo • Allow a data scientist to join production features with their own without affecting production operations

Other ﬁlesystems • Support for repository metadata and fact sets
to be on different ﬁle systems • Support HDFS, S3 and POSIX

Ivory - An Introduction

Ivory - An Introduction

More Decks by Ambiata

Other Decks in Technology

Featured

Transcript