Designing Data Pipelines for Machine Learning Applications

Designing Data Pipelines for Machine Learning Applications

B1ed299a884f153fd23b9a1b81b798ac?s=128

Alexis Seigneurin

March 14, 2019
Tweet

Transcript

  1. DESIGNING DATA PIPELINES FOR MACHINE LEARNING APPLICATIONS (WITH KAFKA AND

    KAFKA STREAMS) Alexis Seigneurin
  2. Me • Data Engineer • Kafka, Spark, AWS… • Blog:

    aseigneurin.github.io • Twitter: @aseigneurin
  3. Data Pipeline?

  4. Data Pipeline • It is the journey of your data

    • Ingest, transform, output… all in streaming • Kafka • Sometimes called a DAG • Apache Airflow
  5. Data Pipeline - Example

  6. Data Pipeline - Airflow

  7. Key elements • Streaming • Transformations = Independent jobs •

    (= different technologies?) • Data can be consumed by multiple consumers
  8. Machine Learning Application?

  9. Machine Learning Application • Use historical data to make predictions

    on new data • Train a model on historical data • Apply the model on new data
  10. Example - Real estate pricing Source: https://www.kaggle.com/shree1992/housedata#data.csv F e a

    t u r e s Label
  11. A batch oriented process • Training on a batch of

    data • Can take hours (days?) • Validate + deploy the new model
  12. Data Pipeline + ML?

  13. Batch + Streaming • Training in batch to create a

    model • Streaming app to use the model
  14. Training • Done from time to time (e.g. every other

    week) • With a fixed data set • Export a model • A new model is deployed when it has been validated by a Data Scientist
  15. Streaming app • Use Kafka as a backbone • Kafka

    Streams to implement transformations • Need a way to use the model on the JVM
  16. Architecture ML Model ML Training Labeled data Unlabeled data (in

    Kafka) Apply the model (in Kafka Streams) Data with predictions Batch Streaming
  17. ML frameworks

  18. Python libraries • scikit-learn, TensorFlow, Keras… • ⚠ Need to

    build and expose a REST API • ⚠ Scaling can be complicated
  19. Spark ML • Integrated in Spark • ⚠ Have to

    use Spark Streaming
  20. Cloud-hosted services • AWS SageMaker, Google Cloud Machine Learning Engine…

    • No code to write to serve the models • ⚠ Less control over how the model is served
  21. H2O.ai • Java-based ML platform • Can download a POJO

  22. Creating a model with H2O

  23. Importing the data

  24. Preparing the model • Set column types: numeric, enum… •

    Split the dataset: 70/20/10 (training/validation/test) • Algorithm: Gradient Boosting Machine
  25. Model with H2O

  26. Model with H2O

  27. POJO of the Model

  28. Using the model

  29. In a simple app Model Data Prediction

  30. In a simple app

  31. { "date": "2014-05-02 00:00:00", "bedrooms": 2, "bathrooms": 2.0, "sqft_living": 2591,

    "sqft_lot": 5182, "floors": 0.0, "waterfront": "0", "view": "0", "condition": "0", "sqft_above": 2591, "sqft_basement": 0, "yr_built": 1911, "yr_renovated": 1995, "street": "Burke-Gilman Trail", "city": "Seattle", "statezip": "WA 98155", "country": "USA" } Stream processing { "date": "2014-05-02 00:00:00", "bedrooms": 2, "bathrooms": 2.0, "sqft_living": 2591, "sqft_lot": 5182, "floors": 0.0, "waterfront": "0", "view": "0", "condition": "0", "sqft_above": 2591, "sqft_basement": 0, "yr_built": 1911, "yr_renovated": 1995, "street": "Burke-Gilman Trail", "city": "Seattle", "statezip": "WA 98155", "country": "USA", "price": 781898.4215855601 } Kafka Streams application Kafka topic Kafka topic
  32. Kafka Streams • Client library for Java and Scala •

    DSL: stream(), map(), filter(), to()... • Aggregations, joins, windowing • KStream / KTable • Simple deployment model • Allows to create "microservices”
  33. Kafka Streams app Read from the source topic Apply the

    model Write to the output topic Start consuming
  34. Deploying the application • Kafka Streams = a plain Java

    app • Run the app: • On a VM (e.g. EC2) • In Kubernetes • …
  35. Result

  36. Think of how your ML framework can serve your models

    T I P # 1
  37. Serving the model • Your choice of ML framework constrains

    you! • Embedded model or REST API? • Are you ok running Python code in production? • (Spotify is not - JVM only)
  38. Single app to deploy Kafka Streams application Kafka topic Kafka

    topic ML model
  39. 2 apps to deploy Kafka Streams application Kafka topic Kafka

    topic ML model HTTP request REST API
  40. Same feature engineering in streaming and training T I P

    # 2
  41. Feature Engineering • Need to apply the same transformations in

    batch & streaming • Scaling, date conversions, text extraction… • Challenging!
  42. Feature Engineering • Option 1: Use the same UDFs •

    Option 2: Featran? (github.com/spotify/featran)
  43. Calculate statistics about your data (and compare) T I P

    # 3
  44. Know your data • Calculate statistics from your training data

    • E.g. min, max, average, median… of the square footage • E.g. nulls?
  45. Check your new data • In streaming: • Check: min,

    max, null… • In batch: • Check: average, median… ‣ Outliers → Update the model?
  46. Enrich the data in your pipeline T I P #

    4
  47. Enrich your data • Add features from other datasets •

    Use reference datasets: • Zip code → details about the location • User ID → details about the user • IP address → location • …
  48. 2 apps to deploy Kafka Streams enrichment Kafka topic Kafka

    topic Reference data as a CDC stream Join KStream KTable Reference data
  49. Design your pipeline for testing T I P # 5

  50. Testing • Final test in real conditions • Production data

    • No real output
  51. Pre-production testing Production application Input topic Output topic (production) ML

    model Pre-prod application Output topic (pre-prod) ML model Different consumer group Compare
  52. Design your pipeline for A/B testing T I P #

    6
  53. A/B testing • 2 models running in parallel • Compare

    the outputs • Split: • Embedded model → partition-based split • REST API → more control
  54. A/B (pre-split) Kafka Streams application Input topic Output topic ML

    model Kafka Streams application ML model Same consumer group A B
  55. A/B (post-split) Kafka Streams application Input topic Temp topic A

    ML model Kafka Streams application Temp topic B ML model Merge A B Different consumer groups Output topic
  56. Design your pipeline for model updates T I P #

    7
  57. Updates • Update: • Embedded model → deploy a new

    version of the app • REST API → update the API • Stop/start vs Rolling updates
  58. Design your pipeline to be resilient to failures T I

    P # 8
  59. High Availability • Deploy more than one instance of your

    app • Same consumer group / Different physical locations
  60. High Availability Kafka broker Availability Zone 2 Kafka broker Availability

    Zone 1 Kafka broker Availability Zone 3 Kafka Streams app VM1 / Availability Zone 1 Kafka Streams app VM2 / Availability Zone 2
  61. Summary & Conclusion

  62. Summary • Training in batch / Predictions in streaming •

    Model: Embedded vs REST API • Kafka + Kafka Streams • Design for testing, H/A, updates… • Blog: Realtime Machine Learning predictions with Kafka and H2O.ai • https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html
  63. Thank you! @aseigneurin