Designing Data Pipelines for Machine Learning Applications

DESIGNING DATA PIPELINES FOR MACHINE LEARNING APPLICATIONS (WITH KAFKA AND
KAFKA STREAMS) Alexis Seigneurin

Me • Data Engineer • Kafka, Spark, AWS… • Blog:
aseigneurin.github.io • Twitter: @aseigneurin

Data Pipeline?

Data Pipeline • It is the journey of your data
• Ingest, transform, output… all in streaming • Kafka • Sometimes called a DAG • Apache Airﬂow

Data Pipeline - Example

Data Pipeline - Airﬂow

Key elements • Streaming • Transformations = Independent jobs •
(= different technologies?) • Data can be consumed by multiple consumers

Machine Learning Application?

Machine Learning Application • Use historical data to make predictions
on new data • Train a model on historical data • Apply the model on new data

Example - Real estate pricing Source: https://www.kaggle.com/shree1992/housedata#data.csv F e a
t u r e s Label

A batch oriented process • Training on a batch of
data • Can take hours (days?) • Validate + deploy the new model

Data Pipeline + ML?

Batch + Streaming • Training in batch to create a
model • Streaming app to use the model

Training • Done from time to time (e.g. every other
week) • With a ﬁxed data set • Export a model • A new model is deployed when it has been validated by a Data Scientist

Streaming app • Use Kafka as a backbone • Kafka
Streams to implement transformations • Need a way to use the model on the JVM

Architecture ML Model ML Training Labeled data Unlabeled data (in
Kafka) Apply the model (in Kafka Streams) Data with predictions Batch Streaming

ML frameworks

Python libraries • scikit-learn, TensorFlow, Keras… • ⚠ Need to
build and expose a REST API • ⚠ Scaling can be complicated

Spark ML • Integrated in Spark • ⚠ Have to
use Spark Streaming

Cloud-hosted services • AWS SageMaker, Google Cloud Machine Learning Engine…
• No code to write to serve the models • ⚠ Less control over how the model is served

H2O.ai • Java-based ML platform • Can download a POJO

Creating a model with H2O

Importing the data

Preparing the model • Set column types: numeric, enum… •
Split the dataset: 70/20/10 (training/validation/test) • Algorithm: Gradient Boosting Machine

Model with H2O

POJO of the Model

Using the model

In a simple app Model Data Prediction

In a simple app

{ "date": "2014-05-02 00:00:00", "bedrooms": 2, "bathrooms": 2.0, "sqft_living": 2591,
"sqft_lot": 5182, "floors": 0.0, "waterfront": "0", "view": "0", "condition": "0", "sqft_above": 2591, "sqft_basement": 0, "yr_built": 1911, "yr_renovated": 1995, "street": "Burke-Gilman Trail", "city": "Seattle", "statezip": "WA 98155", "country": "USA" } Stream processing { "date": "2014-05-02 00:00:00", "bedrooms": 2, "bathrooms": 2.0, "sqft_living": 2591, "sqft_lot": 5182, "floors": 0.0, "waterfront": "0", "view": "0", "condition": "0", "sqft_above": 2591, "sqft_basement": 0, "yr_built": 1911, "yr_renovated": 1995, "street": "Burke-Gilman Trail", "city": "Seattle", "statezip": "WA 98155", "country": "USA", "price": 781898.4215855601 } Kafka Streams application Kafka topic Kafka topic

Kafka Streams • Client library for Java and Scala •
DSL: stream(), map(), ﬁlter(), to()... • Aggregations, joins, windowing • KStream / KTable • Simple deployment model • Allows to create "microservices”

Kafka Streams app Read from the source topic Apply the
model Write to the output topic Start consuming

Deploying the application • Kafka Streams = a plain Java
app • Run the app: • On a VM (e.g. EC2) • In Kubernetes • …

Result

Think of how your ML framework can serve your models
T I P # 1

Serving the model • Your choice of ML framework constrains
you! • Embedded model or REST API? • Are you ok running Python code in production? • (Spotify is not - JVM only)

Single app to deploy Kafka Streams application Kafka topic Kafka
topic ML model

2 apps to deploy Kafka Streams application Kafka topic Kafka
topic ML model HTTP request REST API

Same feature engineering in streaming and training T I P
# 2

Feature Engineering • Need to apply the same transformations in
batch & streaming • Scaling, date conversions, text extraction… • Challenging!

Feature Engineering • Option 1: Use the same UDFs •
Option 2: Featran? (github.com/spotify/featran)

Calculate statistics about your data (and compare) T I P
# 3

Know your data • Calculate statistics from your training data
• E.g. min, max, average, median… of the square footage • E.g. nulls?

Check your new data • In streaming: • Check: min,
max, null… • In batch: • Check: average, median… ‣ Outliers → Update the model?

Enrich the data in your pipeline T I P #
4

Enrich your data • Add features from other datasets •
Use reference datasets: • Zip code → details about the location • User ID → details about the user • IP address → location • …

2 apps to deploy Kafka Streams enrichment Kafka topic Kafka
topic Reference data as a CDC stream Join KStream KTable Reference data

Design your pipeline for testing T I P # 5

Testing • Final test in real conditions • Production data
• No real output

Pre-production testing Production application Input topic Output topic (production) ML
model Pre-prod application Output topic (pre-prod) ML model Different consumer group Compare

Design your pipeline for A/B testing T I P #
6

A/B testing • 2 models running in parallel • Compare
the outputs • Split: • Embedded model → partition-based split • REST API → more control

A/B (pre-split) Kafka Streams application Input topic Output topic ML
model Kafka Streams application ML model Same consumer group A B

A/B (post-split) Kafka Streams application Input topic Temp topic A
ML model Kafka Streams application Temp topic B ML model Merge A B Different consumer groups Output topic

Design your pipeline for model updates T I P #
7

Updates • Update: • Embedded model → deploy a new
version of the app • REST API → update the API • Stop/start vs Rolling updates

Design your pipeline to be resilient to failures T I
P # 8

High Availability • Deploy more than one instance of your
app • Same consumer group / Different physical locations

High Availability Kafka broker Availability Zone 2 Kafka broker Availability
Zone 1 Kafka broker Availability Zone 3 Kafka Streams app VM1 / Availability Zone 1 Kafka Streams app VM2 / Availability Zone 2

Summary & Conclusion

Summary • Training in batch / Predictions in streaming •
Model: Embedded vs REST API • Kafka + Kafka Streams • Design for testing, H/A, updates… • Blog: Realtime Machine Learning predictions with Kafka and H2O.ai • https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html

Thank you! @aseigneurin

Designing Data Pipelines for Machine Learning A...

Designing Data Pipelines for Machine Learning Applications

More Decks by Alexis Seigneurin

Other Decks in Technology

Featured

Transcript