Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Ivory - A Data Store for Data Science
Ambiata
October 20, 2014
Technology
1
570
Ivory - A Data Store for Data Science
Ambiata
October 20, 2014
Tweet
Share
More Decks by Ambiata
See All by Ambiata
Improving feature engineering in the lab and production with Ivory
ambiata
3
620
Ivory - Concepts
ambiata
0
680
Ivory - Data Modelling
ambiata
0
360
Ivory - An Introduction
ambiata
1
1.1k
Other Decks in Technology
See All in Technology
Sysdig Secure/Falcoの活用術! ~Kubernetes基盤の脅威モデリングとランタイムセキュリティの強化~
owlinux1000
0
200
サイバー攻撃を想定したクラウドネイティブセキュリティガイドラインとCNAPP及びSecurity Observabilityの未来
sakon310
4
430
初めてのLambdaによる運用自動化 ~ 不要リソース削除大作戦 ~
kentosuzuki
1
340
サイバー攻撃を想定したクラウドネイティブセキュリティガイドラインとCNAPP及びSecurity Observabilityの未来
syoshie
1
850
私のAWS愛を聞け!ここが好きだよAmazon FSx for NetApp ONTAP
non97
0
710
Settlement simulation testing to ensure correct settlement processing
applepine1125
2
860
DeepDive into Modern Development with AWS
mokocm
1
320
eBPFで実現するコンテナランタイムセキュリティ / Container Runtime Security with eBPF
tobachi
PRO
5
1.5k
フィンテック養成勉強会#24
finengine
0
320
塩漬けにしているMySQL 8.0.xxをバージョンアップしたくなる、ここ数年でのMySQL 8.0の改善点 / MySQL Update 202208
yoshiakiyamasaki
1
560
psql, my favorite tool!
nuko_yokohama
1
170
AutoMLを利用した機械学習モデル構築時に意識すること
sbtechnight
0
140
Featured
See All Featured
Git: the NoSQL Database
bkeepers
PRO
415
59k
Three Pipe Problems
jasonvnalue
89
8.7k
The World Runs on Bad Software
bkeepers
PRO
57
5.4k
Optimizing for Happiness
mojombo
365
63k
Typedesign – Prime Four
hannesfritz
34
1.4k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
316
19k
The Cult of Friendly URLs
andyhume
68
4.8k
Making the Leap to Tech Lead
cromwellryan
113
7.4k
The Straight Up "How To Draw Better" Workshop
denniskardys
225
120k
The Brand Is Dead. Long Live the Brand.
mthomps
46
2.7k
Principles of Awesome APIs and How to Build Them.
keavy
113
15k
Atom: Resistance is Futile
akmur
255
20k
Transcript
IVORY A DATA STORE FOR DATA SCIENCE http://github.com/ambiata/ivory © Ambiata
2014
DATA SCIENCE IN THE REAL WORLD © Ambiata 2014
PROBLEM #1 © Ambiata 2014
“DATA WRANGLING” © Ambiata 2014
WHAT WE START WITH © Ambiata 2014
© Ambiata 2014
WHAT WE NEED © Ambiata 2014
Feature vectors © Ambiata 2014 0.00 3 3001 1.00 634.83
16 4670 0.6875 15.12 2 - 0.50 33.56 2 - 1.00 98.34 12 3303 0.8333 523.81 23 2046 0.4782 1086.05 17 - 1.00 224.81 9 - 0.2222 78.21 2 2134 0.50 126.48 4 - 0.0 1 3 1 1 4 1 2 1 1 1 M - F M F - F F M - gender balance purchases zipcode prop_online num_accs 89340218 feature instance 48149407 18452274 07499337 62948721 93754723 00272446 13374497 31989993 46474236
Data set B Data set C Data set D Feature
Eng Model train Score A typical workflow Multiple data sources: • Transaction logs • Database snapshots • Segmentation models • 3rd-party data © Ambiata 2014 Feature engineering: • Data source are joined • Instances are created • Features are engineered The cool stuff: • Models are built • Instances are scored
Feature preparation Modelling 85% 15% © Ambiata 2014
Data set A Data set B Data set C Data
set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Multiple data science projects Feature engineering is in a silo - no reuse between model builds © Ambiata 2014
PROBLEM #2 © Ambiata 2014
“LAB TO FACTORY” AKA DEV OPS © Ambiata 2014
• Continually receiving data • Want to leverage a history
of all this data • Continually training + scoring • Data may need to be corrected • Need to extend data model on-the-fly © Ambiata 2014
LAMBDA ARCHITECTURE © Ambiata 2014
© Ambiata 2014 query = function(all data)
© Ambiata 2014 New data stream Query Magical query engine
© Ambiata 2014 SERVING LAYER New data stream Query All
data Precomputed views Stream processing Incremental views BATCH LAYER SPEED LAYER Real-time store
© Ambiata 2014 New data stream Query All data Feature
view Stream processing Incremental views Real-time store
© Ambiata 2014 New data stream Query All data Feature
view Stream processing Incremental views Real-time store Model train and score
IVORY © Ambiata 2014
Data set A Data set B Data set C Data
set D Data set E Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Ivory © Ambiata 2014 A shared feature view asset
© Ambiata 2014 New data stream Query All data Ivory
Stream processing Incremental views Real-time store Model train and score An immutable, batch-oriented data store
© Ambiata 2014 Feature vectors Ivory An extensible data model,
backed by HDFS/S3 HDFS / S3
Apache V2 Licence github.com/ambiata/ivory © Ambiata 2014