Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up
for free
Designing Data Pipelines for Machine Learning Applications
Alexis Seigneurin
March 14, 2019
Technology
0
28
Designing Data Pipelines for Machine Learning Applications
Alexis Seigneurin
March 14, 2019
Tweet
Share
More Decks by Alexis Seigneurin
See All by Alexis Seigneurin
aseigneurin
0
37
aseigneurin
1
88
aseigneurin
4
2.4k
aseigneurin
0
160
aseigneurin
5
2.3k
aseigneurin
2
680
aseigneurin
0
150
aseigneurin
3
950
aseigneurin
0
39
Other Decks in Technology
See All in Technology
clustervr
0
200
satoryu
0
2.2k
kawaguti
0
120
takuros
2
330
ymas0315
0
160
asaju7142501
0
290
shirayanagiryuji
0
2.1k
kakka
0
3.6k
yosshi_
3
290
takem001
0
890
line_developers
PRO
3
510
hecateball
1
12k
Featured
See All Featured
deanohume
295
27k
mza
80
4.1k
lynnandtonic
270
16k
lauravandoore
437
28k
kneath
219
15k
ddemaree
274
31k
colly
186
14k
ufuk
56
5.4k
garrettdimon
287
110k
bkeepers
52
4.1k
kastner
54
1.9k
swwweet
206
6.8k
Transcript
DESIGNING DATA PIPELINES FOR MACHINE LEARNING APPLICATIONS (WITH KAFKA AND
KAFKA STREAMS) Alexis Seigneurin
Me • Data Engineer • Kafka, Spark, AWS… • Blog:
aseigneurin.github.io • Twitter: @aseigneurin
Data Pipeline?
Data Pipeline • It is the journey of your data
• Ingest, transform, output… all in streaming • Kafka • Sometimes called a DAG • Apache Airflow
Data Pipeline - Example
Data Pipeline - Airflow
Key elements • Streaming • Transformations = Independent jobs •
(= different technologies?) • Data can be consumed by multiple consumers
Machine Learning Application?
Machine Learning Application • Use historical data to make predictions
on new data • Train a model on historical data • Apply the model on new data
Example - Real estate pricing Source: https://www.kaggle.com/shree1992/housedata#data.csv F e a
t u r e s Label
A batch oriented process • Training on a batch of
data • Can take hours (days?) • Validate + deploy the new model
Data Pipeline + ML?
Batch + Streaming • Training in batch to create a
model • Streaming app to use the model
Training • Done from time to time (e.g. every other
week) • With a fixed data set • Export a model • A new model is deployed when it has been validated by a Data Scientist
Streaming app • Use Kafka as a backbone • Kafka
Streams to implement transformations • Need a way to use the model on the JVM
Architecture ML Model ML Training Labeled data Unlabeled data (in
Kafka) Apply the model (in Kafka Streams) Data with predictions Batch Streaming
ML frameworks
Python libraries • scikit-learn, TensorFlow, Keras… • ⚠ Need to
build and expose a REST API • ⚠ Scaling can be complicated
Spark ML • Integrated in Spark • ⚠ Have to
use Spark Streaming
Cloud-hosted services • AWS SageMaker, Google Cloud Machine Learning Engine…
• No code to write to serve the models • ⚠ Less control over how the model is served
H2O.ai • Java-based ML platform • Can download a POJO
Creating a model with H2O
Importing the data
Preparing the model • Set column types: numeric, enum… •
Split the dataset: 70/20/10 (training/validation/test) • Algorithm: Gradient Boosting Machine
Model with H2O
Model with H2O
POJO of the Model
Using the model
In a simple app Model Data Prediction
In a simple app
{ "date": "2014-05-02 00:00:00", "bedrooms": 2, "bathrooms": 2.0, "sqft_living": 2591,
"sqft_lot": 5182, "floors": 0.0, "waterfront": "0", "view": "0", "condition": "0", "sqft_above": 2591, "sqft_basement": 0, "yr_built": 1911, "yr_renovated": 1995, "street": "Burke-Gilman Trail", "city": "Seattle", "statezip": "WA 98155", "country": "USA" } Stream processing { "date": "2014-05-02 00:00:00", "bedrooms": 2, "bathrooms": 2.0, "sqft_living": 2591, "sqft_lot": 5182, "floors": 0.0, "waterfront": "0", "view": "0", "condition": "0", "sqft_above": 2591, "sqft_basement": 0, "yr_built": 1911, "yr_renovated": 1995, "street": "Burke-Gilman Trail", "city": "Seattle", "statezip": "WA 98155", "country": "USA", "price": 781898.4215855601 } Kafka Streams application Kafka topic Kafka topic
Kafka Streams • Client library for Java and Scala •
DSL: stream(), map(), filter(), to()... • Aggregations, joins, windowing • KStream / KTable • Simple deployment model • Allows to create "microservices”
Kafka Streams app Read from the source topic Apply the
model Write to the output topic Start consuming
Deploying the application • Kafka Streams = a plain Java
app • Run the app: • On a VM (e.g. EC2) • In Kubernetes • …
Result
Think of how your ML framework can serve your models
T I P # 1
Serving the model • Your choice of ML framework constrains
you! • Embedded model or REST API? • Are you ok running Python code in production? • (Spotify is not - JVM only)
Single app to deploy Kafka Streams application Kafka topic Kafka
topic ML model
2 apps to deploy Kafka Streams application Kafka topic Kafka
topic ML model HTTP request REST API
Same feature engineering in streaming and training T I P
# 2
Feature Engineering • Need to apply the same transformations in
batch & streaming • Scaling, date conversions, text extraction… • Challenging!
Feature Engineering • Option 1: Use the same UDFs •
Option 2: Featran? (github.com/spotify/featran)
Calculate statistics about your data (and compare) T I P
# 3
Know your data • Calculate statistics from your training data
• E.g. min, max, average, median… of the square footage • E.g. nulls?
Check your new data • In streaming: • Check: min,
max, null… • In batch: • Check: average, median… ‣ Outliers → Update the model?
Enrich the data in your pipeline T I P #
4
Enrich your data • Add features from other datasets •
Use reference datasets: • Zip code → details about the location • User ID → details about the user • IP address → location • …
2 apps to deploy Kafka Streams enrichment Kafka topic Kafka
topic Reference data as a CDC stream Join KStream KTable Reference data
Design your pipeline for testing T I P # 5
Testing • Final test in real conditions • Production data
• No real output
Pre-production testing Production application Input topic Output topic (production) ML
model Pre-prod application Output topic (pre-prod) ML model Different consumer group Compare
Design your pipeline for A/B testing T I P #
6
A/B testing • 2 models running in parallel • Compare
the outputs • Split: • Embedded model → partition-based split • REST API → more control
A/B (pre-split) Kafka Streams application Input topic Output topic ML
model Kafka Streams application ML model Same consumer group A B
A/B (post-split) Kafka Streams application Input topic Temp topic A
ML model Kafka Streams application Temp topic B ML model Merge A B Different consumer groups Output topic
Design your pipeline for model updates T I P #
7
Updates • Update: • Embedded model → deploy a new
version of the app • REST API → update the API • Stop/start vs Rolling updates
Design your pipeline to be resilient to failures T I
P # 8
High Availability • Deploy more than one instance of your
app • Same consumer group / Different physical locations
High Availability Kafka broker Availability Zone 2 Kafka broker Availability
Zone 1 Kafka broker Availability Zone 3 Kafka Streams app VM1 / Availability Zone 1 Kafka Streams app VM2 / Availability Zone 2
Summary & Conclusion
Summary • Training in batch / Predictions in streaming •
Model: Embedded vs REST API • Kafka + Kafka Streams • Design for testing, H/A, updates… • Blog: Realtime Machine Learning predictions with Kafka and H2O.ai • https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html
Thank you! @aseigneurin