Machine Learning on Largish Data - Talk at Morgan Stanley, Budapest - Aug 2015

Machine Learning on Largish Data - A Study of Open
Source Tools Szilárd Pafka, PhD Chief Scientist, Epoch Talk at Morgan Stanley Budapest Aug 2015

I usually use other people’s code [...] it is usually
not “efficient” (from time budget perspective) to write my own algorithm [...] I can find open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang http://blog.kaggle.com/2015/06/22/profiling-top-kagglers-owen-zhang-currently-1-in-the-world/

Data Size for Supervised Learning # records: <10M 10M-10B >10B

Data Size for Non-Linear Supervised Learning # records: <1M 1M-100M
>100M

binary classification, 10M records numeric & categorical features, non-sparse

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

- R packages - Python scikit-learn - Vowpal Wabbit -
H2O - xgboost - Spark MLlib

- R packages 30% - Python scikit-learn 40% - Vowpal
Wabbit 8% - H2O 10% - xgboost 8% - Spark MLlib 6%

- R packages 30% - Python scikit-learn 40% - Vowpal
Wabbit 8% - H2O 10% - xgboost 8% - Spark MLlib 6% - a few others

Distributed computation generally is hard, because it adds an additional
layer of complexity and [network] communication overhead. The ideal case is scaling linearly with the number of nodes; that’s rarely the case. Emerging evidence shows that very often, one big machine, or even a laptop, outperforms a cluster. http://fastml.com/the-emperors-new-clothes-distributed-machine-learning/

n = 10K, 100K, 1M, 10M, 100M Training time RAM
usage AUC CPU % by core read data, pre-process, score test data

vs “More data usually beats better algorithms” http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

linear tops off (data size) (accuracy)

linear tops off more data & better algo (data size)
(accuracy)

linear tops off more data & better algo random forest
on 1% of data beats linear on all data (data size) (accuracy)

http://datascience.la/benchmarking-random-forest-implementations/#comment-53599

I’m of course paranoid that the need for distributed learning
is diminishing as individual computing nodes (augmented with GPUs) become increasingly powerful. So I was ready for Jure Leskovec’s workshop talk [at NIPS 2014]. Here is a killer screenshot. -- Paul Mineiro

we will continue to run large [...] jobs to scan
petabytes of [...] data to extract interesting features, but this paper explores the interesting possibility of switching over to a multi-core, shared-memory system for efficient execution on more refined datasets [...] e.g., machine learning http://openproceedings.org/2014/conf/edbt/KumarGDL14.pdf

learn_rate = 0.1, max_depth = 6, n_trees = 300 learn_rate
= 0.01, max_depth = 16, n_trees = 1000

Non-Linear Supervised Learning

# records: <1M 1M-100M >100M Non-Linear Supervised Learning

Machine Learning on Largish Data - Talk at Morg...

Machine Learning on Largish Data - Talk at Morgan Stanley, Budapest - Aug 2015

szilard

More Decks by szilard

Featured

Transcript

Machine Learning on Largish Data - A Study of Open

I usually use other people’s code [...] it is usually

Data Size for Supervised Learning # records: <10M 10M-10B >10B

Data Size for Non-Linear Supervised Learning # records: <1M 1M-100M

binary classification, 10M records numeric & categorical features, non-sparse

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

- R packages - Python scikit-learn - Vowpal Wabbit -

- R packages 30% - Python scikit-learn 40% - Vowpal

- R packages 30% - Python scikit-learn 40% - Vowpal

- R packages 30% - Python scikit-learn 40% - Vowpal

EC2

Distributed computation generally is hard, because it adds an additional

n = 10K, 100K, 1M, 10M, 100M Training time RAM

vs “More data usually beats better algorithms” http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

linear tops off (data size) (accuracy)

linear tops off more data & better algo (data size)

linear tops off more data & better algo random forest

linear tops off more data & better algo random forest

10x

http://datascience.la/benchmarking-random-forest-implementations/#comment-53599

I’m of course paranoid that the need for distributed learning

we will continue to run large [...] jobs to scan

learn_rate = 0.1, max_depth = 6, n_trees = 300 learn_rate

Non-Linear Supervised Learning

# records: <1M 1M-100M >100M Non-Linear Supervised Learning