Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up
for free
Ivory - Concepts
Ambiata
October 20, 2014
Technology
0
580
Ivory - Concepts
Ambiata
October 20, 2014
Tweet
Share
More Decks by Ambiata
See All by Ambiata
ambiata
3
610
ambiata
1
500
ambiata
0
290
ambiata
1
1k
Other Decks in Technology
See All in Technology
hiroyaiizuka
0
160
yshr1200
0
180
sansandsoc
0
440
koukyo1994
5
1.6k
pakio
0
140
will03
0
100
miyakemito
1
620
yamamuteki
3
840
kentaro
1
680
mahito
0
240
110y
2
11k
iwashi
1
210
Featured
See All Featured
tanoku
86
8.6k
akmur
252
19k
malarkey
193
8.6k
jensimmons
207
10k
erikaheidi
14
4.3k
chriscoyier
683
180k
jmmastey
10
630
maltzj
502
36k
hatefulcrawdad
257
17k
jrom
116
7.2k
philhawksworth
190
17k
tammielis
237
23k
Transcript
IVORY CONCEPTS http://github.com/ambiata/ivory © Ambiata 2014
IVORY A scalable and extensible data store for storing facts
and extracting features © Ambiata 2014
Ivory Repository Ingest facts Extract features © Ambiata 2014
REPOSITORY • Storing and extracting data for a single class
of entity, e.g.: • customer • account • asset © Ambiata 2014
DATA MODEL © Ambiata 2014
customer-1 balance 634 @ 2014-02-01 single “fact” Fact: Entity -
Attribute - Value - Time The value of a feature (attribute) for a given entity known to be valid from a certain point in time. © Ambiata 2014
customer-1 balance 634 @ 2014-02-01 customer-2 customer-3 customer-4 469 @
2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 © Ambiata 2014 scalable
customer-2 customer-3 customer-4 customer-1 gender balance purchases zipcode 634 @
2014-02-01 extensible 469 @ 2014-02-01 276 @ 2014-04-01 1966 @ 2014-03-01 ‘M’ @ 2012-01-01 3 @ 2014-03-27 ‘4670’ @ 2009-05-13 © Ambiata 2014
736 @ 2014-01-01 3 @ 2014-02-19 184 @ 2014-02-01 312
@ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender balance purchases zipcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 Sparse 469 @ 2014-02-01 © Ambiata 2014
INGESTING FACTS © Ambiata 2014
• Facts are ingested in atomic units called factsets •
Facts in a factset can span any set of: • entities • attributes • dates/times © Ambiata 2014
customer-1 balance 634 2014-02-01 customer-3 balance 184 2014-02-01 customer-4 purchases
4 2014-02-04 cusomter-2 balance 312 2014-03-01 customer-3 gender F 2007-04-01 customer-2 zipcode 3001 2011-03-14 © Ambiata 2014
ATTRIBUTE DICTIONARY © Ambiata 2014
• Any attribute that is ingested must be declared in
the repository’s dictionary • Dictionary stores metadata for each attribute • Updated dictionaries can be imported into a repository at any time © Ambiata 2014
namespace name encoding type description demographics gender string categorical Gender
demographics zipcode string categorical Post-code, zip-code accounts balance double numerical Balance of savings account accounts purchases int numerical Number of credit-card purchases © Ambiata 2014
EXTRACTING FEATURES © Ambiata 2014
© Ambiata 2014 0.00 3 3001 634.83 16 4670 15.12
2 - 33.56 2 - 98.34 12 3303 523.81 23 2046 1086.05 17 - 224.81 9 - 78.21 2 2134 126.48 4 - M - F M F - F F M - gender balance purchases zipcode 89340218 feature instance 48149407 18452274 07499337 62948721 93754723 00272446 13374497 31989993 46474236
SNAPSHOTS • Attribute values for entities at a point in
time • Same time for all entities • Select latest attribute values with respect to that time • Typically used in preparing instances for scoring © Ambiata 2014
736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312
@ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender balance purchases zipcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-02-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 snapshot @ 2014-03-01 469 @ 2014-02-01 © Ambiata 2014
customer-1 customer-2 customer-3 customer-4 gender balance purchases zipcode ‘M’ 312
4 ‘4670’ ‘F’ ‘3001’ 1966 634 469 © Ambiata 2014
• It is assumed snapshots run periodically - e.g. daily,
weekly • Ivory exploits this assumption to improve the runtime of successive snapshots
CHORDS • Attribute values for entities at a point in
time • Different times for different entities • Select latest attribute values with respect to the times • Typically used in preparing instances for training © Ambiata 2014
736 @ 2014-01-01 6 @ 2014-02-19 184 @ 2014-02-01 312
@ 2014-03-01 customer-1 customer-2 customer-3 customer-4 gender balance purchases zipcode ‘M’ @ 2012-01-01 276 @ 2014-04-01 4 @ 2014-04-04 2 @ 2014-03-12 3 @ 2014-03-27 ‘2381’ @ 2004-08-19 ‘4670’ @ 2009-05-13 ‘F’ @ 2007-04-01 ‘3001’ @ 2011-09-14 1876 @ 2014-02-01 1966 @ 2014-03-01 634 @ 2014-02-01 469 @ 2014-02-01 customer2 @ 2014-03-01 customer4 @ 2014-01-01 © Ambiata 2014
customer-2 @ 2014-03-01 customer-4 @ 2014-01-01 gender balance purchases postcode
‘M’ 312 6 ‘4670’ ‘3001’ 1876 © Ambiata 2014
DERIVED FACTS © Ambiata 2014
184 @ 2014-02-01 312 @ 2014-03-01 customer-2 balance max.balance.4M 276
@ 2014-04-01 ? © Ambiata 2014 Maximum balance over last 4 months can be derived from set of balance facts
Many facts can be derived from a time series of
base facts © Ambiata 2014
base fact derived facts balance Maximum balance over the last
month Mean balance over the last 2 months Balance gradient over the last 3 months purchase Number of purchases in the last 3 weeks Proportion of supermarket purchases in the last 2 weeks zipcode Number of times the zipcode has change in the last 5 years Longest period where the zipcode has not changed in the last 5 years © Ambiata 2014
VIRTUAL FEATURES © Ambiata 2014
• Ivory represents derived facts as virtual features • Virtual
features are declared in the dictionary • Specify expressions against base facts • Are computed lazily when features extracted © Ambiata 2014
name source expression window max.balance.4M balance max 4 month mean.balance.6M
balance mean 6 months num.purchases.3W purchase count 3 weeks changes.zipcode.5Y zipcode num_flips 5 years © Ambiata 2014
COMMITS © Ambiata 2014
Ivory Repository Ingest facts Extract features Import dictionary © Ambiata
2014
• A commit is recorded for any repository change: •
factset ingestions • dictionary imports • The repository at a given commit is an immutable data store • Snapshot and chord can be done at a specific commit © Ambiata 2014
1 2 3 4 5 0 create repository import dictionary
ingest factset ingest factset import dictionary ingest factset snapshot snapshot chord © Ambiata 2014
KEY CONCEPTS © Ambiata 2014
• Repository • Commit • Dictionary • Factset • Base
fact • Virtual feature • Snapshot • Chord © Ambiata 2014