Machine Learning With H2O vs SparkML

Machine Learning by H2O vs SparkML Arnab Biswas June 2018

H2O Open Source, In-Memory, Distributed Machine Learning Tool • Open
Source (Apache 2.0) • In-Memory (Faster) • Distributed (Big Data/No Sampling) • Third Version (Stable) • Easy To Use • Mission - "How do we get this to work efficiently at big data scale?“ http://docs.h2o.ai/

• R, Python, Scala, Java, JSON, JavaScript, Web Interface (Flow)
• Entire library is embedded inside a jar file • Composed in Java, naturally supports Java & Scala • R, Python, JavaScript, Excel, Tableau, Flow communicates with H2O clusters using REST API calls • Easy to switch between R/Python/Java/Flow environments Multiple Language Support

• Uses in-memory compression (2-4 times smaller than gzip) •
Data frames are much smaller in memory and on disk • Handles billions of data rows in-memory, even with a small cluster • Data gets distributed across multiple JVM • Modeling using whole set of data (without sampling) • Faster training/prediction time • The larger is the data set, the better is the performance • Consists of a Flow web-based GUI (Easy to use for Non-Programmers) • However, not very impressive! • Easy to deploy models in production • Checkpoint • Continue training an existing model with new data • Iterative Methods (???) H2O : Advantage https://en.wikipedia.org/wiki/H2O_(software)

Clustering (1/2) • Can be deployed on a single node
/ multi-node cluster / Hadoop cluster / Apache Spark cluster • Clustering enhances speed of computation • Hadoop/Spark for clustering is NOT mandatory • Multi-node cluster with shared memory model • All computation in-memory • Each node sees only some rows of data • No limit to cluster size • Distributed Data Frames (collection of vectors) • Columns are distributed (across nodes) - https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library - https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638

Clustering : Limitations (2/2) • For small data, clustering introduces
slowness • Find the sweet spot between data size & number of nodes • Each node on the cluster must be of same size (Recommended) • New Nodes can not be added once the cluster starts up • If any machine dies, the whole cluster must be rebuilt • If a single node gets removed, whole cluster becomes unusable • Nodes should be physically close, to minimize network latency • Each node must be running the same version of h2o.jar

Productionizing H2O 1. Build a Model using Python/R/Java/Flow 2. Download
the model (as a POJO or MOJO) as a zip file. 3. Download resulting h2o-genmodel.jar (Is a library supporting scoring) 4. Invoke the model from Java class to generate prediction • Can be easily embedded inside a Java Application http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html

H2O Flow • Web-based interactive client environment • Similar to
Jupyter Notebook • Can be used by non-programmer as well (Mouse clicks!) • Combine code execution, text, mathematics, plots & rich media in a single document • Allows • Data upload • View data uploaded directly / through other clients • Build Model • View models built directly / through other clients • Predict • View predictions generated directly or through other clients • Check cluster/CPU status

Algorithms Supervised Unsupervised Miscellaneous Common Cox Proportional Hazards Aggregagtor Word2vec
Quantiles Deep Learning Generalized Low Rank Models (GLRM) Early Stopping Distributed Random Forest K-Means Clustering Generalized Linear Model Principal Component Analysis (PCA) Gradient Boosting Machine Naïve Bayes Classifier Stacked Ensembles XGBoost https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow/images/H2O-Algorithms-Road-Map.pdf

H2O Ecosystem • H2O • Steam • Enterprise Steam •
Sparkling Water • Driverless AI • H2O4GPU

H2O Steam • End-to-end platform that streamlines the entire process
of building and deploying applications • Cluster Manager • Start/stop cluster, allocate memory, start/pause/stop H2O instances • Secure multi-tenant environment • Model Manager • Build, store, manage, compare, promote (historical) models • Run A/B Test for models • Scoring Server • Deploys a model • Scoring through REST API or In-App

Sparkling Water (1/3) • Combines the fast, scalable machine learning
algorithms of H2O with the capabilities of Spark • Provides a way to launch the H2O service on each Spark executor in the Spark cluster, forming a H2O cluster • “Certified on Spark”

Sparkling Water – Use Case (2/3) Use Case 1: Data
pipeline consists of multiple data transformations with help of Spark API. Final form of data is transformed into H2O frame and passed to an H2O algorithm. Use Case 2: Data pipeline consists of H2O’s parallel data load and parse capabilities, while Spark API is used as another provider of data transformations. H2O can be also be used as in- place data transformer. http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/index.html

Sparkling Water – Use Case (3/3) Use Case 3: 1.
The off-line training pipeline invoked regularly utilizes Spark & H2O API and provides an H2O model as output. The model is exported in a form independent on H2O run-time. 2. The streaming data pipeline (Using Spark Streaming) uses model trained in the first pipeline to score the incoming data. Since the model is exported with no run-time dependency to H2O, the streaming pipeline can be lightweight and independent on H2O/ Sparkling Water infrastructure.

Spark (MLib) vs H2O • Spark is better at the
data preparation and data munging steps • H2O is faster than the algorithms in SparkMLib • MLib under performs in terms of Memory, CPU and Time • H2O provides Web Interface (Flow) for data visualization • H2O and MLib has overlap of algorithms • H2O is better for productionization • POJO/MOJO approach more friendly to integrate with Java applications • Allows evaluation metrics visualization, tracking jobs and job statuses • H2O allows grid search (Spark doesn’t?) • Spark has a better community support • H2O has enterprise support Check the slide on References

• Need for “iyzico” fraud detection product • Continuous Delivery:
Models need to be continuously deployed on production • Real-Time Fraud Detection: Prediction time of max 100 ms • High Availability & Scalability • Low Learning Curve: Stack should be usable by data scientist & SW developer • Open Source • Fast : Fast prototyping & deploying • On Premise • Initial Choice • prediction.io + Spark ML Case Study I : Migration From SparkMLib To H2O (1/3) Source : https://iyzico.engineering/spark-ml-to-h2o-migration-for-machine-learning-in-iyzico-dcba86b8eab2

Case Study I (2/3) • Benchmarking Criteria : TensorFlow, SparkML,
H2O (Winner) • Simplicity of deploying an existing model (local env) to production • POJO based models. Easy to deploy in Java environment • Release management and DevOps cycle are easy • Hardware requirements for training • Memory need for training with 1 million transactions & 100 features with RF (64 Trees) Spark ML : 16 GB RAM, Tensor Flow : 10 GB, H2O : 2 GB • Decision Trees and Bayesian Models • Python, R, SQL Support • Experimentation on local environment • Experiments can be done with Python, R • Prediction time (ms)

• Feature Engineering, Data Pipeline was in Java 8. No
need of migration • Migration from Spark ML + prediction.io to H2O • 60 GB RAM is saved (Spark ML & prediction.io needed for model trainings) • 12 cores saved (Spark ML & prediction.io needed these cores to reduce model training time) • Response time decreased almost 10 times (300 milliseconds to 35 milliseconds) Case Study I (3/3)

Case Study II : Booking.com (1/n) Source : https://www.youtube.com/watch?v=_CBKECLkIt8

Case Study II : Booking.com (2/n)

Case Study II : Booking.com (3/3)

Spark/Sparkling Water – Do I need it?

Benchmarking ML Libraries https://github.com/szilard/benchm-ml • Training data • Number of
rows varied as 10K, 100K, 1M, 10M • ~1K features • Binary Classification Problem • Hardware (Single Instance) • Amazon EC2 c3.8xlarge (32 cores, 60GB RAM) • If OOM, r3.8xlarge instance (32 cores, 250GB RAM) • Observations • Training time • Maximum memory usage during training • AUC (predictive accuracy)

Random Forest H2O • Fast, uses all cores, more accurate
• Memory Efficient • 1M : 5G, 10M : 25 G SparkMLib • Slower • Larger memory footprint • Runs OOM at n = 1M • With 250 G, finishes for 1M, but crashes for 10M • AUC broke at 1M • Spark 2.0 is even slower XGBoost • Fast • High accuracy • Memory efficient • 1M : 2G, 10M : 9G

Gradient Boosting Machines Learn_rate=0.01 max_depth=16 n_trees=1000 Learn_rate=0.1 max_depth=6 n_trees=300 •
Memory footprint of GBMs smaller than for RF • Bottleneck is mainly training time • Spark is inefficient in memory (especially for deeper trees) & crashes. Works for shallow trees • H2O and xgboost are the fastest

Performance of various GBM implementations For deployment, H2O has the
best ways to deploy as a real-time (fast scoring) application. https://github.com/szilard/GBM-perf

Do I need Big Data? • Single Instance vs Cluster
• Sending data over a network vs using shared memory • Several distributed systems have significant computation & memory overhead • Map-reduce style communication pattern : Not best fit for many ML algorithms Benchmarking For Bigger Data

Netflix VectorFlow • Minimalist library • Specifically optimized for training
sparse data • Single-machine, multi-core environment

Benchmarking For Bigger Data • Not enough clarity about the
hardware used • For tree-based ensembles (RF, GBM) H2O and xgboost can train on 100M records on a single server, though the training times become several hours Single Node Multiple Nodes

Security In H2O http://docs.h2o.ai/h2o/latest-stable/h2o-docs/security.html

Disadvantages • No High Availability (HA) for Clusters • Doesn’t
work well on sparse data • GPU Support is in alpha stage • There is No SVM • Cluster support helps Big Data • For small data needs single, fast machines with lot of cores

References • https://www.quora.com/Does-H2O-software-allow-you-to-perform-faster- machine-learning-if-it-is-not-used-on-a-cluster-How • https://www.quora.com/Why-would-one-use-H2O-ai-over-scikit-learn-machine- learning-tool • https://www.quora.com/What-are-the-risks-of-using-H2O-ai-framework-When- would-my-company-need-to-pay-anything-to-H2O-ai-Is-the-framework-buggy-
somehow-or-is-it-hard-to-install-configure-extend-Do-I-need-to-pay-for- consultancy-eventually • https://groups.google.com/forum/#!msg/h2ostream/m2HIfUxfw-k/X8G2- OMQAwAJ

Questions

H2O Architecture https://www.stat.berkeley.edu/~ledell/docs/h2o_hpccon_oct2015.pdf http://gotocon.com/dl/goto-berlin-2014/slides/PetrMaj_and_TomasNykodym_FastAnalyticsOnBigData.pdf

H2O Frame Distributed Fork & Join

Do I need Spark to run H20? - https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library -
https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638

H2O : POJO vs MOJO - POJOs are not supported
for source files larger than 1G - MOJOs are supported for AutoML, Deep Learning, DRF, GBM, GLM, GLRM, K-Means, Stacked Ensembles, SVM, Word2vec, and XGBoost models. - POJOs are also not supported for XGBoost, GLRM, or Stacked Ensembles models. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html

SparkML vs SparkMLib • Spark MLib vs Spark ML :
• https://spark.apache.org/docs/latest/ml-guide.html

Machine Learning With H2O vs SparkML

Machine Learning With H2O vs SparkML

More Decks by Arnab Biswas

Featured

Transcript