Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Ivory - A Data Store for Data Science
Search
Ambiata
October 20, 2014
Technology
740
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Ivory - A Data Store for Data Science
Ambiata
October 20, 2014
More Decks by Ambiata
See All by Ambiata
Improving feature engineering in the lab and production with Ivory
ambiata
3
680
Ivory - Concepts
ambiata
0
920
Ivory - Data Modelling
ambiata
0
520
Ivory - An Introduction
ambiata
1
1.3k
Other Decks in Technology
See All in Technology
"何を作るか"を任される エンジニアは、どう育つのか
yutaokafuji
1
610
爆速でマルチプロダクトを立ち上げる時 事業・CTO目線で大事にしたい事
miyatakoji
0
100
小さくはじめるSLI/SLO ~育てながら組織に定着させる実践知~ / Starting Small with SLI/SLOs: Building Adoption Through Continuous Growth
nari_ex
6
1.8k
RSA暗号を手計算したくなること、ありますよね?? (20260615_orestudy6_rsa)
thousanda
0
270
SONiC Scale-Up Working Group から探る Scale-UpやUltraEthernet機能の実装方法
ebiken
PRO
2
130
失敗を経て、Harness Engineering で 大切にしたいことを考える / Learning from Failure: What Matters in Harness Engineering
bitkey
PRO
1
320
あなたの AI ワークスペースに、 専門コーダーを連れてくる - Amazon Quick Desktop 最新情報
kawaji_scratch
1
130
Chainlitで作るお手軽チャットUI
ynt0485
0
200
Claude Code×Terraform IaC テンプレート駆動開発
itouhi
1
500
2026 TECHFRESH 畢業分享會 - AI-Native 重塑軟體工程與虛擬講師
line_developers_tw
PRO
0
840
Oracle AI Database@AWS:サービス概要のご紹介
oracle4engineer
PRO
4
2.9k
2026TECHFRESH畢業分享會 - Lightning Talk - 打造精準高效的 MCP 設計模式與測試實務
line_developers_tw
PRO
0
840
Featured
See All Featured
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
Agile Leadership in an Agile Organization
kimpetersen
PRO
0
160
HTML-Aware ERB: The Path to Reactive Rendering @ RubyCon 2026, Rimini, Italy
marcoroth
1
180
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
3.4k
Building Adaptive Systems
keathley
44
3k
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
140
GraphQLの誤解/rethinking-graphql
sonatard
75
12k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.7k
Git: the NoSQL Database
bkeepers
PRO
432
67k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
55k
<Decoding/> the Language of Devs - We Love SEO 2024
nikkihalliwell
1
240
Transcript
IVORY A DATA STORE FOR DATA SCIENCE http://github.com/ambiata/ivory © Ambiata
2014
DATA SCIENCE IN THE REAL WORLD © Ambiata 2014
PROBLEM #1 © Ambiata 2014
“DATA WRANGLING” © Ambiata 2014
WHAT WE START WITH © Ambiata 2014
© Ambiata 2014
WHAT WE NEED © Ambiata 2014
Feature vectors © Ambiata 2014 0.00 3 3001 1.00 634.83
16 4670 0.6875 15.12 2 - 0.50 33.56 2 - 1.00 98.34 12 3303 0.8333 523.81 23 2046 0.4782 1086.05 17 - 1.00 224.81 9 - 0.2222 78.21 2 2134 0.50 126.48 4 - 0.0 1 3 1 1 4 1 2 1 1 1 M - F M F - F F M - gender balance purchases zipcode prop_online num_accs 89340218 feature instance 48149407 18452274 07499337 62948721 93754723 00272446 13374497 31989993 46474236
Data set B Data set C Data set D Feature
Eng Model train Score A typical workflow Multiple data sources: • Transaction logs • Database snapshots • Segmentation models • 3rd-party data © Ambiata 2014 Feature engineering: • Data source are joined • Instances are created • Features are engineered The cool stuff: • Models are built • Instances are scored
Feature preparation Modelling 85% 15% © Ambiata 2014
Data set A Data set B Data set C Data
set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Multiple data science projects Feature engineering is in a silo - no reuse between model builds © Ambiata 2014
PROBLEM #2 © Ambiata 2014
“LAB TO FACTORY” AKA DEV OPS © Ambiata 2014
• Continually receiving data • Want to leverage a history
of all this data • Continually training + scoring • Data may need to be corrected • Need to extend data model on-the-fly © Ambiata 2014
LAMBDA ARCHITECTURE © Ambiata 2014
© Ambiata 2014 query = function(all data)
© Ambiata 2014 New data stream Query Magical query engine
© Ambiata 2014 SERVING LAYER New data stream Query All
data Precomputed views Stream processing Incremental views BATCH LAYER SPEED LAYER Real-time store
© Ambiata 2014 New data stream Query All data Feature
view Stream processing Incremental views Real-time store
© Ambiata 2014 New data stream Query All data Feature
view Stream processing Incremental views Real-time store Model train and score
IVORY © Ambiata 2014
Data set A Data set B Data set C Data
set D Data set E Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Ivory © Ambiata 2014 A shared feature view asset
© Ambiata 2014 New data stream Query All data Ivory
Stream processing Incremental views Real-time store Model train and score An immutable, batch-oriented data store
© Ambiata 2014 Feature vectors Ivory An extensible data model,
backed by HDFS/S3 HDFS / S3
Apache V2 Licence github.com/ambiata/ivory © Ambiata 2014