Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning With H2O vs SparkML

Arnab Biswas
June 01, 2018
48

Machine Learning With H2O vs SparkML

Two popular tools for doing Machine Learning on top of JVM ecosystem is H2O and SparkML. This presentation compares these two tools as Machine Learning libraries (Didn't consider Spark's Data Munjing perspective). This work was done during June of 2018.

Arnab Biswas

June 01, 2018
Tweet

Transcript

  1. H2O Open Source, In-Memory, Distributed Machine Learning Tool • Open

    Source (Apache 2.0) • In-Memory (Faster) • Distributed (Big Data/No Sampling) • Third Version (Stable) • Easy To Use • Mission - "How do we get this to work efficiently at big data scale?“ http://docs.h2o.ai/
  2. • R, Python, Scala, Java, JSON, JavaScript, Web Interface (Flow)

    • Entire library is embedded inside a jar file • Composed in Java, naturally supports Java & Scala • R, Python, JavaScript, Excel, Tableau, Flow communicates with H2O clusters using REST API calls • Easy to switch between R/Python/Java/Flow environments Multiple Language Support
  3. • Uses in-memory compression (2-4 times smaller than gzip) •

    Data frames are much smaller in memory and on disk • Handles billions of data rows in-memory, even with a small cluster • Data gets distributed across multiple JVM • Modeling using whole set of data (without sampling) • Faster training/prediction time • The larger is the data set, the better is the performance • Consists of a Flow web-based GUI (Easy to use for Non-Programmers) • However, not very impressive! • Easy to deploy models in production • Checkpoint • Continue training an existing model with new data • Iterative Methods (???) H2O : Advantage https://en.wikipedia.org/wiki/H2O_(software)
  4. Clustering (1/2) • Can be deployed on a single node

    / multi-node cluster / Hadoop cluster / Apache Spark cluster • Clustering enhances speed of computation • Hadoop/Spark for clustering is NOT mandatory • Multi-node cluster with shared memory model • All computation in-memory • Each node sees only some rows of data • No limit to cluster size • Distributed Data Frames (collection of vectors) • Columns are distributed (across nodes) - https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library - https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
  5. Clustering : Limitations (2/2) • For small data, clustering introduces

    slowness • Find the sweet spot between data size & number of nodes • Each node on the cluster must be of same size (Recommended) • New Nodes can not be added once the cluster starts up • If any machine dies, the whole cluster must be rebuilt • If a single node gets removed, whole cluster becomes unusable • Nodes should be physically close, to minimize network latency • Each node must be running the same version of h2o.jar
  6. Productionizing H2O 1. Build a Model using Python/R/Java/Flow 2. Download

    the model (as a POJO or MOJO) as a zip file. 3. Download resulting h2o-genmodel.jar (Is a library supporting scoring) 4. Invoke the model from Java class to generate prediction • Can be easily embedded inside a Java Application http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
  7. H2O Flow • Web-based interactive client environment • Similar to

    Jupyter Notebook • Can be used by non-programmer as well (Mouse clicks!) • Combine code execution, text, mathematics, plots & rich media in a single document • Allows • Data upload • View data uploaded directly / through other clients • Build Model • View models built directly / through other clients • Predict • View predictions generated directly or through other clients • Check cluster/CPU status
  8. Algorithms Supervised Unsupervised Miscellaneous Common Cox Proportional Hazards Aggregagtor Word2vec

    Quantiles Deep Learning Generalized Low Rank Models (GLRM) Early Stopping Distributed Random Forest K-Means Clustering Generalized Linear Model Principal Component Analysis (PCA) Gradient Boosting Machine Naïve Bayes Classifier Stacked Ensembles XGBoost https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow/images/H2O-Algorithms-Road-Map.pdf
  9. H2O Ecosystem • H2O • Steam • Enterprise Steam •

    Sparkling Water • Driverless AI • H2O4GPU
  10. H2O Steam • End-to-end platform that streamlines the entire process

    of building and deploying applications • Cluster Manager • Start/stop cluster, allocate memory, start/pause/stop H2O instances • Secure multi-tenant environment • Model Manager • Build, store, manage, compare, promote (historical) models • Run A/B Test for models • Scoring Server • Deploys a model • Scoring through REST API or In-App
  11. Sparkling Water (1/3) • Combines the fast, scalable machine learning

    algorithms of H2O with the capabilities of Spark • Provides a way to launch the H2O service on each Spark executor in the Spark cluster, forming a H2O cluster • “Certified on Spark”
  12. Sparkling Water – Use Case (2/3) Use Case 1: Data

    pipeline consists of multiple data transformations with help of Spark API. Final form of data is transformed into H2O frame and passed to an H2O algorithm. Use Case 2: Data pipeline consists of H2O’s parallel data load and parse capabilities, while Spark API is used as another provider of data transformations. H2O can be also be used as in- place data transformer. http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/index.html
  13. Sparkling Water – Use Case (3/3) Use Case 3: 1.

    The off-line training pipeline invoked regularly utilizes Spark & H2O API and provides an H2O model as output. The model is exported in a form independent on H2O run-time. 2. The streaming data pipeline (Using Spark Streaming) uses model trained in the first pipeline to score the incoming data. Since the model is exported with no run-time dependency to H2O, the streaming pipeline can be lightweight and independent on H2O/ Sparkling Water infrastructure.
  14. Spark (MLib) vs H2O • Spark is better at the

    data preparation and data munging steps • H2O is faster than the algorithms in SparkMLib • MLib under performs in terms of Memory, CPU and Time • H2O provides Web Interface (Flow) for data visualization • H2O and MLib has overlap of algorithms • H2O is better for productionization • POJO/MOJO approach more friendly to integrate with Java applications • Allows evaluation metrics visualization, tracking jobs and job statuses • H2O allows grid search (Spark doesn’t?) • Spark has a better community support • H2O has enterprise support Check the slide on References
  15. • Need for “iyzico” fraud detection product • Continuous Delivery:

    Models need to be continuously deployed on production • Real-Time Fraud Detection: Prediction time of max 100 ms • High Availability & Scalability • Low Learning Curve: Stack should be usable by data scientist & SW developer • Open Source • Fast : Fast prototyping & deploying • On Premise • Initial Choice • prediction.io + Spark ML Case Study I : Migration From SparkMLib To H2O (1/3) Source : https://iyzico.engineering/spark-ml-to-h2o-migration-for-machine-learning-in-iyzico-dcba86b8eab2
  16. Case Study I (2/3) • Benchmarking Criteria : TensorFlow, SparkML,

    H2O (Winner) • Simplicity of deploying an existing model (local env) to production • POJO based models. Easy to deploy in Java environment • Release management and DevOps cycle are easy • Hardware requirements for training • Memory need for training with 1 million transactions & 100 features with RF (64 Trees) Spark ML : 16 GB RAM, Tensor Flow : 10 GB, H2O : 2 GB • Decision Trees and Bayesian Models • Python, R, SQL Support • Experimentation on local environment • Experiments can be done with Python, R • Prediction time (ms)
  17. • Feature Engineering, Data Pipeline was in Java 8. No

    need of migration • Migration from Spark ML + prediction.io to H2O • 60 GB RAM is saved (Spark ML & prediction.io needed for model trainings) • 12 cores saved (Spark ML & prediction.io needed these cores to reduce model training time) • Response time decreased almost 10 times (300 milliseconds to 35 milliseconds) Case Study I (3/3)
  18. Benchmarking ML Libraries https://github.com/szilard/benchm-ml • Training data • Number of

    rows varied as 10K, 100K, 1M, 10M • ~1K features • Binary Classification Problem • Hardware (Single Instance) • Amazon EC2 c3.8xlarge (32 cores, 60GB RAM) • If OOM, r3.8xlarge instance (32 cores, 250GB RAM) • Observations • Training time • Maximum memory usage during training • AUC (predictive accuracy)
  19. Random Forest H2O • Fast, uses all cores, more accurate

    • Memory Efficient • 1M : 5G, 10M : 25 G SparkMLib • Slower • Larger memory footprint • Runs OOM at n = 1M • With 250 G, finishes for 1M, but crashes for 10M • AUC broke at 1M • Spark 2.0 is even slower XGBoost • Fast • High accuracy • Memory efficient • 1M : 2G, 10M : 9G
  20. Gradient Boosting Machines Learn_rate=0.01 max_depth=16 n_trees=1000 Learn_rate=0.1 max_depth=6 n_trees=300 •

    Memory footprint of GBMs smaller than for RF • Bottleneck is mainly training time • Spark is inefficient in memory (especially for deeper trees) & crashes. Works for shallow trees • H2O and xgboost are the fastest
  21. Performance of various GBM implementations For deployment, H2O has the

    best ways to deploy as a real-time (fast scoring) application. https://github.com/szilard/GBM-perf
  22. Do I need Big Data? • Single Instance vs Cluster

    • Sending data over a network vs using shared memory • Several distributed systems have significant computation & memory overhead • Map-reduce style communication pattern : Not best fit for many ML algorithms Benchmarking For Bigger Data
  23. Netflix VectorFlow • Minimalist library • Specifically optimized for training

    sparse data • Single-machine, multi-core environment
  24. Benchmarking For Bigger Data • Not enough clarity about the

    hardware used • For tree-based ensembles (RF, GBM) H2O and xgboost can train on 100M records on a single server, though the training times become several hours Single Node Multiple Nodes
  25. Disadvantages • No High Availability (HA) for Clusters • Doesn’t

    work well on sparse data • GPU Support is in alpha stage • There is No SVM • Cluster support helps Big Data • For small data needs single, fast machines with lot of cores
  26. Do I need Spark to run H20? - https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library -

    https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
  27. H2O : POJO vs MOJO - POJOs are not supported

    for source files larger than 1G - MOJOs are supported for AutoML, Deep Learning, DRF, GBM, GLM, GLRM, K-Means, Stacked Ensembles, SVM, Word2vec, and XGBoost models. - POJOs are also not supported for XGBoost, GLRM, or Stacked Ensembles models. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
  28. SparkML vs SparkMLib • Spark MLib vs Spark ML :

    • https://spark.apache.org/docs/latest/ml-guide.html