Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing Data Pipelines for Machine Learning Applications

Designing Data Pipelines for Machine Learning Applications

Alexis Seigneurin

March 14, 2019
Tweet

More Decks by Alexis Seigneurin

Other Decks in Technology

Transcript

  1. DESIGNING DATA PIPELINES
    FOR MACHINE LEARNING APPLICATIONS
    (WITH KAFKA AND KAFKA STREAMS)
    Alexis Seigneurin

    View Slide

  2. Me
    • Data Engineer
    • Kafka, Spark, AWS…
    • Blog: aseigneurin.github.io
    • Twitter: @aseigneurin

    View Slide

  3. Data Pipeline?

    View Slide

  4. Data Pipeline
    • It is the journey of your data
    • Ingest, transform, output… all in streaming
    • Kafka
    • Sometimes called a DAG
    • Apache Airflow

    View Slide

  5. Data Pipeline - Example

    View Slide

  6. Data Pipeline - Airflow

    View Slide

  7. Key elements
    • Streaming
    • Transformations = Independent jobs
    • (= different technologies?)
    • Data can be consumed by multiple consumers

    View Slide

  8. Machine Learning Application?

    View Slide

  9. Machine Learning Application
    • Use historical data to make predictions on new data
    • Train a model on historical data
    • Apply the model on new data

    View Slide

  10. Example - Real estate pricing
    Source: https://www.kaggle.com/shree1992/housedata#data.csv
    F e a t u r e s
    Label

    View Slide

  11. A batch oriented process
    • Training on a batch of data
    • Can take hours (days?)
    • Validate + deploy the new model

    View Slide

  12. Data Pipeline + ML?

    View Slide

  13. Batch + Streaming
    • Training in batch to create a model
    • Streaming app to use the model

    View Slide

  14. Training
    • Done from time to time (e.g. every other week)
    • With a fixed data set
    • Export a model
    • A new model is deployed when it has been validated by a
    Data Scientist

    View Slide

  15. Streaming app
    • Use Kafka as a backbone
    • Kafka Streams to implement transformations
    • Need a way to use the model on the JVM

    View Slide

  16. Architecture
    ML Model
    ML Training
    Labeled data
    Unlabeled data
    (in Kafka)
    Apply the model
    (in Kafka Streams)
    Data with predictions
    Batch
    Streaming

    View Slide

  17. ML frameworks

    View Slide

  18. Python libraries
    • scikit-learn, TensorFlow, Keras…
    • ⚠ Need to build and expose a REST API
    • ⚠ Scaling can be complicated

    View Slide

  19. Spark ML
    • Integrated in Spark
    • ⚠ Have to use Spark Streaming

    View Slide

  20. Cloud-hosted services
    • AWS SageMaker, Google Cloud Machine Learning
    Engine…
    • No code to write to serve the models
    • ⚠ Less control over how the model is served

    View Slide

  21. H2O.ai
    • Java-based ML platform
    • Can download a POJO

    View Slide

  22. Creating a model with H2O

    View Slide

  23. Importing the data

    View Slide

  24. Preparing the model
    • Set column types: numeric, enum…
    • Split the dataset: 70/20/10 (training/validation/test)
    • Algorithm: Gradient Boosting Machine

    View Slide

  25. Model with H2O

    View Slide

  26. Model with H2O

    View Slide

  27. POJO of the Model

    View Slide

  28. Using the model

    View Slide

  29. In a simple app
    Model
    Data
    Prediction

    View Slide

  30. In a simple app

    View Slide

  31. {
    "date": "2014-05-02 00:00:00",
    "bedrooms": 2,
    "bathrooms": 2.0,
    "sqft_living": 2591,
    "sqft_lot": 5182,
    "floors": 0.0,
    "waterfront": "0",
    "view": "0",
    "condition": "0",
    "sqft_above": 2591,
    "sqft_basement": 0,
    "yr_built": 1911,
    "yr_renovated": 1995,
    "street": "Burke-Gilman Trail",
    "city": "Seattle",
    "statezip": "WA 98155",
    "country": "USA"
    }
    Stream processing
    {
    "date": "2014-05-02 00:00:00",
    "bedrooms": 2,
    "bathrooms": 2.0,
    "sqft_living": 2591,
    "sqft_lot": 5182,
    "floors": 0.0,
    "waterfront": "0",
    "view": "0",
    "condition": "0",
    "sqft_above": 2591,
    "sqft_basement": 0,
    "yr_built": 1911,
    "yr_renovated": 1995,
    "street": "Burke-Gilman Trail",
    "city": "Seattle",
    "statezip": "WA 98155",
    "country": "USA",
    "price": 781898.4215855601
    }
    Kafka Streams
    application
    Kafka topic Kafka topic

    View Slide

  32. Kafka Streams
    • Client library for Java and Scala
    • DSL: stream(), map(), filter(), to()...
    • Aggregations, joins, windowing
    • KStream / KTable
    • Simple deployment model
    • Allows to create "microservices”

    View Slide

  33. Kafka Streams app
    Read from the
    source topic
    Apply the model
    Write to the output
    topic
    Start consuming

    View Slide

  34. Deploying the application
    • Kafka Streams = a plain Java app
    • Run the app:
    • On a VM (e.g. EC2)
    • In Kubernetes
    • …

    View Slide

  35. Result

    View Slide

  36. Think of how your ML
    framework can serve
    your models
    T I P # 1

    View Slide

  37. Serving the model
    • Your choice of ML framework constrains you!
    • Embedded model or REST API?
    • Are you ok running Python code in production?
    • (Spotify is not - JVM only)

    View Slide

  38. Single app to deploy
    Kafka Streams
    application
    Kafka topic Kafka topic
    ML model

    View Slide

  39. 2 apps to deploy
    Kafka Streams
    application
    Kafka topic Kafka topic
    ML model
    HTTP request
    REST API

    View Slide

  40. Same feature
    engineering in
    streaming and training
    T I P # 2

    View Slide

  41. Feature Engineering
    • Need to apply the same transformations in batch &
    streaming
    • Scaling, date conversions, text extraction…
    • Challenging!

    View Slide

  42. Feature Engineering
    • Option 1: Use the same UDFs
    • Option 2: Featran? (github.com/spotify/featran)

    View Slide

  43. Calculate statistics
    about your data
    (and compare)
    T I P # 3

    View Slide

  44. Know your data
    • Calculate statistics from your training data
    • E.g. min, max, average, median… of the square footage
    • E.g. nulls?

    View Slide

  45. Check your new data
    • In streaming:
    • Check: min, max, null…
    • In batch:
    • Check: average, median…
    ‣ Outliers → Update the model?

    View Slide

  46. Enrich the data in
    your pipeline
    T I P # 4

    View Slide

  47. Enrich your data
    • Add features from other datasets
    • Use reference datasets:
    • Zip code → details about the location
    • User ID → details about the user
    • IP address → location
    • …

    View Slide

  48. 2 apps to deploy
    Kafka Streams
    enrichment
    Kafka topic Kafka topic
    Reference data
    as a CDC stream
    Join
    KStream
    KTable
    Reference data

    View Slide

  49. Design your pipeline
    for testing
    T I P # 5

    View Slide

  50. Testing
    • Final test in real conditions
    • Production data
    • No real output

    View Slide

  51. Pre-production testing
    Production
    application
    Input topic Output topic
    (production)
    ML model
    Pre-prod
    application
    Output topic
    (pre-prod)
    ML model
    Different
    consumer group
    Compare

    View Slide

  52. Design your pipeline
    for A/B testing
    T I P # 6

    View Slide

  53. A/B testing
    • 2 models running in parallel
    • Compare the outputs
    • Split:
    • Embedded model → partition-based split
    • REST API → more control

    View Slide

  54. A/B (pre-split)
    Kafka Streams
    application
    Input topic Output topic
    ML model
    Kafka Streams
    application
    ML model
    Same
    consumer group
    A
    B

    View Slide

  55. A/B (post-split)
    Kafka Streams
    application
    Input topic
    Temp topic A
    ML model
    Kafka Streams
    application
    Temp topic B
    ML model
    Merge
    A
    B
    Different
    consumer groups
    Output topic

    View Slide

  56. Design your pipeline
    for model updates
    T I P # 7

    View Slide

  57. Updates
    • Update:
    • Embedded model → deploy a new version of the app
    • REST API → update the API
    • Stop/start vs Rolling updates

    View Slide

  58. Design your pipeline
    to be resilient to
    failures
    T I P # 8

    View Slide

  59. High Availability
    • Deploy more than one instance of your app
    • Same consumer group / Different physical locations

    View Slide

  60. High Availability
    Kafka broker
    Availability Zone 2
    Kafka broker
    Availability Zone 1
    Kafka broker
    Availability Zone 3
    Kafka Streams app
    VM1 / Availability Zone 1
    Kafka Streams app
    VM2 / Availability Zone 2

    View Slide

  61. Summary
    &
    Conclusion

    View Slide

  62. Summary
    • Training in batch / Predictions in streaming
    • Model: Embedded vs REST API
    • Kafka + Kafka Streams
    • Design for testing, H/A, updates…
    • Blog: Realtime Machine Learning predictions with Kafka and H2O.ai
    • https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html

    View Slide

  63. Thank you!
    @aseigneurin

    View Slide