Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up
for free
Ivory - A Data Store for Data Science
Ambiata
October 20, 2014
Technology
1
500
Ivory - A Data Store for Data Science
Ambiata
October 20, 2014
Tweet
Share
More Decks by Ambiata
See All by Ambiata
ambiata
3
610
ambiata
0
580
ambiata
0
290
ambiata
1
1k
Other Decks in Technology
See All in Technology
miyake
1
500
ippey
2
210
karamem0
1
790
ryomasumura
0
120
shoichiron
1
150
ihcomega56
1
600
simosako
1
150
sansandsoc
0
370
kahara33
0
120
ayatokura
0
100
hmatsu47
5
730
nisshii0313
1
170
Featured
See All Featured
ammeep
656
54k
akmur
252
19k
cherdarchuk
71
260k
chriscoyier
683
180k
eitanlees
112
10k
jnunemaker
PRO
40
4.6k
pedronauck
652
110k
myddelton
109
11k
reverentgeek
27
2k
jensimmons
207
10k
morganepeng
93
14k
3n
163
22k
Transcript
IVORY A DATA STORE FOR DATA SCIENCE http://github.com/ambiata/ivory © Ambiata
2014
DATA SCIENCE IN THE REAL WORLD © Ambiata 2014
PROBLEM #1 © Ambiata 2014
“DATA WRANGLING” © Ambiata 2014
WHAT WE START WITH © Ambiata 2014
© Ambiata 2014
WHAT WE NEED © Ambiata 2014
Feature vectors © Ambiata 2014 0.00 3 3001 1.00 634.83
16 4670 0.6875 15.12 2 - 0.50 33.56 2 - 1.00 98.34 12 3303 0.8333 523.81 23 2046 0.4782 1086.05 17 - 1.00 224.81 9 - 0.2222 78.21 2 2134 0.50 126.48 4 - 0.0 1 3 1 1 4 1 2 1 1 1 M - F M F - F F M - gender balance purchases zipcode prop_online num_accs 89340218 feature instance 48149407 18452274 07499337 62948721 93754723 00272446 13374497 31989993 46474236
Data set B Data set C Data set D Feature
Eng Model train Score A typical workflow Multiple data sources: • Transaction logs • Database snapshots • Segmentation models • 3rd-party data © Ambiata 2014 Feature engineering: • Data source are joined • Instances are created • Features are engineered The cool stuff: • Models are built • Instances are scored
Feature preparation Modelling 85% 15% © Ambiata 2014
Data set A Data set B Data set C Data
set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Multiple data science projects Feature engineering is in a silo - no reuse between model builds © Ambiata 2014
PROBLEM #2 © Ambiata 2014
“LAB TO FACTORY” AKA DEV OPS © Ambiata 2014
• Continually receiving data • Want to leverage a history
of all this data • Continually training + scoring • Data may need to be corrected • Need to extend data model on-the-fly © Ambiata 2014
LAMBDA ARCHITECTURE © Ambiata 2014
© Ambiata 2014 query = function(all data)
© Ambiata 2014 New data stream Query Magical query engine
© Ambiata 2014 SERVING LAYER New data stream Query All
data Precomputed views Stream processing Incremental views BATCH LAYER SPEED LAYER Real-time store
© Ambiata 2014 New data stream Query All data Feature
view Stream processing Incremental views Real-time store
© Ambiata 2014 New data stream Query All data Feature
view Stream processing Incremental views Real-time store Model train and score
IVORY © Ambiata 2014
Data set A Data set B Data set C Data
set D Data set E Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Ivory © Ambiata 2014 A shared feature view asset
© Ambiata 2014 New data stream Query All data Ivory
Stream processing Incremental views Real-time store Model train and score An immutable, batch-oriented data store
© Ambiata 2014 Feature vectors Ivory An extensible data model,
backed by HDFS/S3 HDFS / S3
Apache V2 Licence github.com/ambiata/ivory © Ambiata 2014