Large-scale Experimentation with Spark & Productionizing Native Spark ML Models - Masood Krohy

Large-scale Experimentation with Spark & Productionizing Native Spark ML Models - Masood Krohy

Masood Krohy at May 21, 2019 event of montrealml.dev

Title: Large-scale Experimentation with Spark & Productionizing Native Spark ML Models

Summary: Apache Spark is the state-of-the-art distributed processing, analytics and ML engine and we are presenting and demo-ing two interesting ways one can use Spark in ML projects: 1) we use Spark to distribute the grid-search optimization of a generic ML model (from a regular, single-machine ML library). We show how Spark can distribute processing tasks over the CPU cores of a cluster which gives a near-linear speedup and lowers processing times; hence it facilitates the exploration of a much larger space to find the optimal hyperparameters for the ML model. This use case is suitable when the projects do not involve Big Data and we use Big Data technologies, i.e., Spark, for the purpose of speeding up the processing of tasks; 2) we demonstrate how to train an example model using the ML lib of Spark itself and how to serve the model with MLeap, a production-quality, low-latency serving engine. This second use case/workflow is suitable when projects do involve Big Data.

Bio: Masood Krohy is a Data Science Platform Architect/Advisor and most recently acted as the Chief Architect of UniAnalytica, an advanced data science platform with wide, out-of-the-box support for time-series and geospatial use cases. He has worked with several corporations in different industries in the past few years to design, implement and productionize Deep Learning and Big Data products. He holds a Ph.D. in computer engineering.

C9254b955021b34b6cf0f61a40dd150a?s=128

PatternedScience

May 21, 2019
Tweet

Transcript

  1. Copyright © 2019, PatternedScience Inc. www.patterned.science Spark: Large-scale Experimentation +

    Productionizing Native Spark ML Models Presenter Masood Krohy, Ph.D. May 21, 2019
  2. 2 Copyright © 2019, PatternedScience Inc. Presentation Layout • Presenter

    bio • Why have several modes for doing ML • Quick intro to Spark and a tour of Web UIs Intros 01 • Distributed ML model optimization with Spark • Parallel coordinates visualization Large-scale Experimentation with Spark 02 • Training an ML model with Spark’s own ML lib • Productionizing the model with MLeap ML on Big Data with Spark 03
  3. 3 Copyright © 2019, PatternedScience Inc. Ph.D. in Computer Engineering

    Analytical modeling of botnets. Validated by data collected in industry. 3 top publications. Senior Analyst, Rogers Managing the analytics reporting/statistical analyses of the national benchmarking program. Data Scientist, Intact First Data Scientist of the company. Led the Big Data mining project for the UBI program. Lead Data Scientist, CN Implemented an object-within-object detection system to detect cracks in railway equipment. Masood Krohy Presenter Bio 2013 Sr Data Science Advisor, B.Yond Implemented a pattern detection system for stream of alarms coming from telecom devices. Chief Architect, UniAnalytica (advanced data science platform) Platform contains Apache Spark, MLeap, and Anaconda, among many others. 2014 2016 2017 2018 2019 Data Science Platform Architect & Advisor
  4. 4 Copyright © 2019, PatternedScience Inc. 2. Spark & TensorFlow/scikit-learn

    Distributed grid search with Spark and TensorFlow/scikit-learn (small datasets, perfectly parallel) 5. Interpretable AI Images - Classification with visual explanation for classifications using Class Activation Maps 3. Ray Tune & TensorFlow/scikit-learn Intelligent, distributed hyperparam search with Asynchronous Hyperband, Ray Tune, and TensorFlow/scikit-learn (small datasets, perfectly parallel) 4. ML on images Images - TensorFlow Object Detection API (intro) 1. Horovod & TensorFlow Distributed Deep Learning with TensorFlow and Horovod (large datasets, data parallelism) Machine Learning Stack UniAnalytica Platform Additional pointers • Standard use of Spark for ML on Big Data is of course supported • Legacy (2016): TensorSpark (contributed to run it in production in yarn-cluster mode)
  5. 5 Copyright © 2019, PatternedScience Inc. Graph source: Databricks

  6. 6 Copyright © 2019, PatternedScience Inc.

  7. 7 Copyright © 2019, PatternedScience Inc.

  8. 8 Copyright © 2019, PatternedScience Inc.

  9. 9 Copyright © 2019, PatternedScience Inc.

  10. 10 Copyright © 2019, PatternedScience Inc.

  11. 11 Copyright © 2019, PatternedScience Inc. Presentation Layout • Presenter

    bio • Why have several modes for doing ML • Quick intro to Spark and a tour of Web UIs Intros 01 • Distributed ML model optimization with Spark • Parallel coordinates visualization Large-scale Experimentation with Spark 02 • Training an ML model with Spark’s own ML lib • Productionizing the model with MLeap ML on Big Data with Spark 03
  12. Code Walkthrough & Live Demo • Zeppelin note: ML Distributed

    GridSearch with Spark • Jupyter notebook: Grid search results analysis using multidimensional visualization (Parallel Coordinates plot) • Jupyter notebook: ARIMA model with daily data (retraining the best model on each new bar and making prediction for the following bar) Notebooks/Scripts
  13. 13 Copyright © 2019, PatternedScience Inc. Parallel Coordinates plot

  14. 14 Copyright © 2019, PatternedScience Inc. Presentation Layout • Presenter

    bio • Why have several modes for doing ML • Quick intro to Spark and a tour of Web UIs Intros 01 • Distributed ML model optimization with Spark • Parallel coordinates visualization Large-scale Experimentation with Spark 02 • Training an ML model with Spark’s own ML lib • Productionizing the model with MLeap ML on Big Data with Spark 03
  15. Code Walkthrough & Live Demo • Zeppelin note: Spark ML

    model training on Big Data & exporting the trained model with MLeap • Jupyter notebook: serving the trained model with MLeap and the client code (only shell commands; notebook is used for documentation) Notebooks/Scripts
  16. Q&A