Benchmarking Machine Learning Tools - H2O World Conference - Nov 2015

Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy Szilárd
Pafka, PhD Chief Scientist, Epoch H2O World Conference, Mountain View Nov 2015

Disclaimer: I am not representing my employer (Epoch) in this
talk I cannot confirm nor deny if Epoch is using or not any of the methods, tools, results etc. mentioned in this talk. The results presented in this talk should not be considered as any indication whether Epoch is using these methods, tools, results etc. or not.

I usually use other people’s code [...] it is usually
not “efficient” (from time budget perspective) to write my own algorithm [...] I can find open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang http://blog.kaggle.com/2015/06/22/profiling-top-kagglers-owen-zhang-currently-1-in-the-world/

Data Size for Supervised Learning # records: <10M 10M-10B >10B

Data Size for Non-Linear Supervised Learning # records: <1M 1M-100M
>100M

binary classification, 10M records numeric & categorical features, non-sparse

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

- R packages - Python scikit-learn - Vowpal Wabbit -
H2O - xgboost - Spark MLlib

- R packages 30% - Python scikit-learn 40% - Vowpal
Wabbit 8% - H2O 10% - xgboost 8% - Spark MLlib 6%

- R packages 30% - Python scikit-learn 40% - Vowpal
Wabbit 8% - H2O 10% - xgboost 8% - Spark MLlib 6% - a few others

Distributed computation generally is hard, because it adds an additional
layer of complexity and [network] communication overhead. The ideal case is scaling linearly with the number of nodes; that’s rarely the case. Emerging evidence shows that very often, one big machine, or even a laptop, outperforms a cluster. http://fastml.com/the-emperors-new-clothes-distributed-machine-learning/

n = 10K, 100K, 1M, 10M, 100M Training time RAM
usage AUC CPU % by core read data, pre-process, score test data

linear tops off (data size) (accuracy)

linear tops off more data & better algo (data size)
(accuracy)

linear tops off more data & better algo random forest
on 1% of data beats linear on all data (data size) (accuracy)

http://datascience.la/benchmarking-random-forest-implementations/#comment-53599

we will continue to run large [...] jobs to scan
petabytes of [...] data to extract interesting features, but this paper explores the interesting possibility of switching over to a multi-core, shared-memory system for efficient execution on more refined datasets [...] e.g., machine learning http://openproceedings.org/2014/conf/edbt/KumarGDL14.pdf

learn_rate = 0.1, max_depth = 6, n_trees = 300 learn_rate
= 0.01, max_depth = 16, n_trees = 1000

Non-Linear Supervised Learning

# records: <1M 1M-100M >100M Non-Linear Supervised Learning

Benchmarking Machine Learning Tools - H2O World...

Benchmarking Machine Learning Tools - H2O World Conference - Nov 2015

szilard

More Decks by szilard

Featured

Transcript

Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy Szilárd

Disclaimer: I am not representing my employer (Epoch) in this

I usually use other people’s code [...] it is usually

Data Size for Supervised Learning # records: <10M 10M-10B >10B

Data Size for Non-Linear Supervised Learning # records: <1M 1M-100M

binary classification, 10M records numeric & categorical features, non-sparse

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

- R packages - Python scikit-learn - Vowpal Wabbit -

- R packages 30% - Python scikit-learn 40% - Vowpal

- R packages 30% - Python scikit-learn 40% - Vowpal

- R packages 30% - Python scikit-learn 40% - Vowpal

EC2

Distributed computation generally is hard, because it adds an additional

n = 10K, 100K, 1M, 10M, 100M Training time RAM

linear tops off (data size) (accuracy)

linear tops off more data & better algo (data size)

linear tops off more data & better algo random forest

10x

http://datascience.la/benchmarking-random-forest-implementations/#comment-53599

we will continue to run large [...] jobs to scan

learn_rate = 0.1, max_depth = 6, n_trees = 300 learn_rate

Non-Linear Supervised Learning

# records: <1M 1M-100M >100M Non-Linear Supervised Learning