$30 off During Our Annual Pro Sale. View Details »

Large-scale Experimentation with Spark & Productionizing Native Spark ML Models - Masood Krohy

Large-scale Experimentation with Spark & Productionizing Native Spark ML Models - Masood Krohy

Masood Krohy at May 21, 2019 event of montrealml.dev

Title: Large-scale Experimentation with Spark & Productionizing Native Spark ML Models

Summary: Apache Spark is the state-of-the-art distributed processing, analytics and ML engine and we are presenting and demo-ing two interesting ways one can use Spark in ML projects: 1) we use Spark to distribute the grid-search optimization of a generic ML model (from a regular, single-machine ML library). We show how Spark can distribute processing tasks over the CPU cores of a cluster which gives a near-linear speedup and lowers processing times; hence it facilitates the exploration of a much larger space to find the optimal hyperparameters for the ML model. This use case is suitable when the projects do not involve Big Data and we use Big Data technologies, i.e., Spark, for the purpose of speeding up the processing of tasks; 2) we demonstrate how to train an example model using the ML lib of Spark itself and how to serve the model with MLeap, a production-quality, low-latency serving engine. This second use case/workflow is suitable when projects do involve Big Data.

Bio: Masood Krohy is a Data Science Platform Architect/Advisor and most recently acted as the Chief Architect of UniAnalytica, an advanced data science platform with wide, out-of-the-box support for time-series and geospatial use cases. He has worked with several corporations in different industries in the past few years to design, implement and productionize Deep Learning and Big Data products. He holds a Ph.D. in computer engineering.

PatternedScience

May 21, 2019
Tweet

More Decks by PatternedScience

Other Decks in Technology

Transcript

  1. Copyright © 2019, PatternedScience Inc.
    www.patterned.science
    Spark: Large-scale Experimentation +
    Productionizing Native Spark ML Models
    Presenter
    Masood Krohy, Ph.D.
    May 21, 2019

    View Slide

  2. 2
    Copyright © 2019, PatternedScience Inc.
    Presentation Layout
    ● Presenter bio
    ● Why have several modes for doing ML
    ● Quick intro to Spark and a tour of Web UIs
    Intros
    01
    ● Distributed ML model optimization with Spark
    ● Parallel coordinates visualization
    Large-scale Experimentation with Spark
    02
    ● Training an ML model with Spark’s own ML lib
    ● Productionizing the model with MLeap
    ML on Big Data with Spark
    03

    View Slide

  3. 3
    Copyright © 2019, PatternedScience Inc.
    Ph.D. in Computer Engineering
    Analytical modeling of botnets. Validated by data collected in industry. 3 top publications.
    Senior Analyst, Rogers
    Managing the analytics reporting/statistical analyses of the national benchmarking program.
    Data Scientist, Intact
    First Data Scientist of the company. Led the Big Data mining project for the UBI program.
    Lead Data Scientist, CN
    Implemented an object-within-object detection system to detect cracks in railway equipment.
    Masood Krohy
    Presenter Bio
    2013
    Sr Data Science Advisor, B.Yond
    Implemented a pattern detection system for stream of alarms coming from telecom devices.
    Chief Architect, UniAnalytica (advanced data science platform)
    Platform contains Apache Spark, MLeap, and Anaconda, among many others.
    2014
    2016
    2017
    2018
    2019
    Data Science Platform Architect & Advisor

    View Slide

  4. 4
    Copyright © 2019, PatternedScience Inc.
    2. Spark & TensorFlow/scikit-learn
    Distributed grid search with Spark and
    TensorFlow/scikit-learn (small datasets,
    perfectly parallel)
    5. Interpretable AI
    Images - Classification with visual explanation for
    classifications using Class Activation Maps
    3. Ray Tune & TensorFlow/scikit-learn
    Intelligent, distributed hyperparam search with
    Asynchronous Hyperband, Ray Tune, and
    TensorFlow/scikit-learn (small datasets,
    perfectly parallel)
    4. ML on images
    Images - TensorFlow Object Detection API (intro)
    1. Horovod & TensorFlow
    Distributed Deep Learning with
    TensorFlow and Horovod (large datasets,
    data parallelism)
    Machine Learning Stack
    UniAnalytica Platform
    Additional pointers
    ● Standard use of Spark for ML on Big Data is of course supported
    ● Legacy (2016): TensorSpark (contributed to run it in production in yarn-cluster mode)

    View Slide

  5. 5
    Copyright © 2019, PatternedScience Inc.
    Graph source: Databricks

    View Slide

  6. 6
    Copyright © 2019, PatternedScience Inc.

    View Slide

  7. 7
    Copyright © 2019, PatternedScience Inc.

    View Slide

  8. 8
    Copyright © 2019, PatternedScience Inc.

    View Slide

  9. 9
    Copyright © 2019, PatternedScience Inc.

    View Slide

  10. 10
    Copyright © 2019, PatternedScience Inc.

    View Slide

  11. 11
    Copyright © 2019, PatternedScience Inc.
    Presentation Layout
    ● Presenter bio
    ● Why have several modes for doing ML
    ● Quick intro to Spark and a tour of Web UIs
    Intros
    01
    ● Distributed ML model optimization with Spark
    ● Parallel coordinates visualization
    Large-scale Experimentation with Spark
    02
    ● Training an ML model with Spark’s own ML lib
    ● Productionizing the model with MLeap
    ML on Big Data with Spark
    03

    View Slide

  12. Code
    Walkthrough
    & Live Demo
    ● Zeppelin note: ML Distributed GridSearch with Spark
    ● Jupyter notebook: Grid search results analysis using
    multidimensional visualization (Parallel Coordinates plot)
    ● Jupyter notebook: ARIMA model with daily data
    (retraining the best model on each new bar and making
    prediction for the following bar)
    Notebooks/Scripts

    View Slide

  13. 13
    Copyright © 2019, PatternedScience Inc.
    Parallel Coordinates plot

    View Slide

  14. 14
    Copyright © 2019, PatternedScience Inc.
    Presentation Layout
    ● Presenter bio
    ● Why have several modes for doing ML
    ● Quick intro to Spark and a tour of Web UIs
    Intros
    01
    ● Distributed ML model optimization with Spark
    ● Parallel coordinates visualization
    Large-scale Experimentation with Spark
    02
    ● Training an ML model with Spark’s own ML lib
    ● Productionizing the model with MLeap
    ML on Big Data with Spark
    03

    View Slide

  15. Code
    Walkthrough
    & Live Demo
    ● Zeppelin note: Spark ML model training on Big Data &
    exporting the trained model with MLeap
    ● Jupyter notebook: serving the trained model with MLeap
    and the client code (only shell commands; notebook is
    used for documentation)
    Notebooks/Scripts

    View Slide

  16. Q&A

    View Slide