Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Designing Data Pipelines for Machine Learning Applications
Alexis Seigneurin
March 14, 2019
Technology
0
34
Designing Data Pipelines for Machine Learning Applications
Alexis Seigneurin
March 14, 2019
Tweet
Share
More Decks by Alexis Seigneurin
See All by Alexis Seigneurin
KSQL - The power of SQL, the simplicity of SQL
aseigneurin
0
49
My journey with Kotlin
aseigneurin
1
94
Kafka Summit - Introduction to Kafka Streams with a Real-Life Example
aseigneurin
4
2.6k
Introduction to Spark ML / Databricks
aseigneurin
0
180
Introduction to Kafka Streams with a Real-Life Example
aseigneurin
5
2.5k
Lessons Learned: Using Spark and Microservices
aseigneurin
2
800
Record Linkage - A real use case with Spark ML
aseigneurin
0
170
Record linkage, a real use case with Spark ML
aseigneurin
3
1k
Codeurs en Seine - Hands-on Spark
aseigneurin
0
50
Other Decks in Technology
See All in Technology
#BabylonJS5 の祭ツイートまとめ Let's take a look at what people create with the latest #BabylonJS5
chomado
0
780
AWS Control TowerとAWS Organizationsを活用した組織におけるセキュリティ設定
fu3ak1
2
630
成長を続ける組織でのSRE戦略:プレモーテムによる信頼性の認識共有 SRE Next 2022
niwatakeru
7
2.4k
1年間のポストモーテム運用とそこから生まれたツール sre-advisor / SRE NEXT 2022
fujiwara3
6
3k
Devに力を授けたいSREのあゆみ / SRE that wants to empower developers
tocyuki
3
470
Babylon.js で簡単 WebXR
yuhara0928
2
1k
SRENEXT2022 組織にSREを実装していくまでの道のり
marnie0301
1
260
技術広報の役割を定義してみた 2022年春
afroscript
3
2.4k
僕の Microsoft Teams (+α) 便利技紹介 2022年春
taichinakamura
0
2.8k
2022年度ロボットフロンティア第1回
ryuichiueda
0
140
Spotify物理コントローラーがほしい
miso
0
140
runn is a package/tool for running operations following a scenario. / golang.tokyo #32
k1low
1
130
Featured
See All Featured
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
181
15k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
224
49k
What’s in a name? Adding method to the madness
productmarketing
11
1.5k
GraphQLの誤解/rethinking-graphql
sonatard
24
6.2k
Producing Creativity
orderedlist
PRO
333
37k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
12
890
Done Done
chrislema
174
14k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
237
19k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
37
3.2k
Building Better People: How to give real-time feedback that sticks.
wjessup
343
17k
Clear Off the Table
cherdarchuk
79
280k
Code Review Best Practice
trishagee
41
6.7k
Transcript
DESIGNING DATA PIPELINES FOR MACHINE LEARNING APPLICATIONS (WITH KAFKA AND
KAFKA STREAMS) Alexis Seigneurin
Me • Data Engineer • Kafka, Spark, AWS… • Blog:
aseigneurin.github.io • Twitter: @aseigneurin
Data Pipeline?
Data Pipeline • It is the journey of your data
• Ingest, transform, output… all in streaming • Kafka • Sometimes called a DAG • Apache Airflow
Data Pipeline - Example
Data Pipeline - Airflow
Key elements • Streaming • Transformations = Independent jobs •
(= different technologies?) • Data can be consumed by multiple consumers
Machine Learning Application?
Machine Learning Application • Use historical data to make predictions
on new data • Train a model on historical data • Apply the model on new data
Example - Real estate pricing Source: https://www.kaggle.com/shree1992/housedata#data.csv F e a
t u r e s Label
A batch oriented process • Training on a batch of
data • Can take hours (days?) • Validate + deploy the new model
Data Pipeline + ML?
Batch + Streaming • Training in batch to create a
model • Streaming app to use the model
Training • Done from time to time (e.g. every other
week) • With a fixed data set • Export a model • A new model is deployed when it has been validated by a Data Scientist
Streaming app • Use Kafka as a backbone • Kafka
Streams to implement transformations • Need a way to use the model on the JVM
Architecture ML Model ML Training Labeled data Unlabeled data (in
Kafka) Apply the model (in Kafka Streams) Data with predictions Batch Streaming
ML frameworks
Python libraries • scikit-learn, TensorFlow, Keras… • ⚠ Need to
build and expose a REST API • ⚠ Scaling can be complicated
Spark ML • Integrated in Spark • ⚠ Have to
use Spark Streaming
Cloud-hosted services • AWS SageMaker, Google Cloud Machine Learning Engine…
• No code to write to serve the models • ⚠ Less control over how the model is served
H2O.ai • Java-based ML platform • Can download a POJO
Creating a model with H2O
Importing the data
Preparing the model • Set column types: numeric, enum… •
Split the dataset: 70/20/10 (training/validation/test) • Algorithm: Gradient Boosting Machine
Model with H2O
Model with H2O
POJO of the Model
Using the model
In a simple app Model Data Prediction
In a simple app
{ "date": "2014-05-02 00:00:00", "bedrooms": 2, "bathrooms": 2.0, "sqft_living": 2591,
"sqft_lot": 5182, "floors": 0.0, "waterfront": "0", "view": "0", "condition": "0", "sqft_above": 2591, "sqft_basement": 0, "yr_built": 1911, "yr_renovated": 1995, "street": "Burke-Gilman Trail", "city": "Seattle", "statezip": "WA 98155", "country": "USA" } Stream processing { "date": "2014-05-02 00:00:00", "bedrooms": 2, "bathrooms": 2.0, "sqft_living": 2591, "sqft_lot": 5182, "floors": 0.0, "waterfront": "0", "view": "0", "condition": "0", "sqft_above": 2591, "sqft_basement": 0, "yr_built": 1911, "yr_renovated": 1995, "street": "Burke-Gilman Trail", "city": "Seattle", "statezip": "WA 98155", "country": "USA", "price": 781898.4215855601 } Kafka Streams application Kafka topic Kafka topic
Kafka Streams • Client library for Java and Scala •
DSL: stream(), map(), filter(), to()... • Aggregations, joins, windowing • KStream / KTable • Simple deployment model • Allows to create "microservices”
Kafka Streams app Read from the source topic Apply the
model Write to the output topic Start consuming
Deploying the application • Kafka Streams = a plain Java
app • Run the app: • On a VM (e.g. EC2) • In Kubernetes • …
Result
Think of how your ML framework can serve your models
T I P # 1
Serving the model • Your choice of ML framework constrains
you! • Embedded model or REST API? • Are you ok running Python code in production? • (Spotify is not - JVM only)
Single app to deploy Kafka Streams application Kafka topic Kafka
topic ML model
2 apps to deploy Kafka Streams application Kafka topic Kafka
topic ML model HTTP request REST API
Same feature engineering in streaming and training T I P
# 2
Feature Engineering • Need to apply the same transformations in
batch & streaming • Scaling, date conversions, text extraction… • Challenging!
Feature Engineering • Option 1: Use the same UDFs •
Option 2: Featran? (github.com/spotify/featran)
Calculate statistics about your data (and compare) T I P
# 3
Know your data • Calculate statistics from your training data
• E.g. min, max, average, median… of the square footage • E.g. nulls?
Check your new data • In streaming: • Check: min,
max, null… • In batch: • Check: average, median… ‣ Outliers → Update the model?
Enrich the data in your pipeline T I P #
4
Enrich your data • Add features from other datasets •
Use reference datasets: • Zip code → details about the location • User ID → details about the user • IP address → location • …
2 apps to deploy Kafka Streams enrichment Kafka topic Kafka
topic Reference data as a CDC stream Join KStream KTable Reference data
Design your pipeline for testing T I P # 5
Testing • Final test in real conditions • Production data
• No real output
Pre-production testing Production application Input topic Output topic (production) ML
model Pre-prod application Output topic (pre-prod) ML model Different consumer group Compare
Design your pipeline for A/B testing T I P #
6
A/B testing • 2 models running in parallel • Compare
the outputs • Split: • Embedded model → partition-based split • REST API → more control
A/B (pre-split) Kafka Streams application Input topic Output topic ML
model Kafka Streams application ML model Same consumer group A B
A/B (post-split) Kafka Streams application Input topic Temp topic A
ML model Kafka Streams application Temp topic B ML model Merge A B Different consumer groups Output topic
Design your pipeline for model updates T I P #
7
Updates • Update: • Embedded model → deploy a new
version of the app • REST API → update the API • Stop/start vs Rolling updates
Design your pipeline to be resilient to failures T I
P # 8
High Availability • Deploy more than one instance of your
app • Same consumer group / Different physical locations
High Availability Kafka broker Availability Zone 2 Kafka broker Availability
Zone 1 Kafka broker Availability Zone 3 Kafka Streams app VM1 / Availability Zone 1 Kafka Streams app VM2 / Availability Zone 2
Summary & Conclusion
Summary • Training in batch / Predictions in streaming •
Model: Embedded vs REST API • Kafka + Kafka Streams • Design for testing, H/A, updates… • Blog: Realtime Machine Learning predictions with Kafka and H2O.ai • https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html
Thank you! @aseigneurin