Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy - LA ML Meetup @eHarmony - June 2015

Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy Szilárd
Pafka, PhD Chief Scientist, Epoch LA Machine Learning Meetup June 2015

I usually use other people’s code [...] it is usually
not “efficient” (from time budget perspective) to write my own algorithm [...] I can find open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang http://blog.kaggle.com/2015/06/22/profiling-top-kagglers-owen-zhang-currently-1-in-the-world/

Data Science Toolbox Survey 1. data munging 2. visualization 3.
machine learning Results: http://datascience.la/?s=survey Compare: - kdnuggets poll - Rexer data mining survey

Data Size for Supervised Learning # records: <10M 10M-10B >10B

Data Size for Non-Linear Supervised Learning # records: <1M 1M-100M
>100M

binary classification, 10M records numeric & categorical features, non-sparse

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

Distributed computation generally is hard, because it adds an additional
layer of complexity and [network] communication overhead. The ideal case is scaling linearly with the number of nodes; that’s rarely the case. Emerging evidence shows that very often, one big machine, or even a laptop, outperforms a cluster. http://fastml.com/the-emperors-new-clothes-distributed-machine-learning/

n = 10K, 100K, 1M, 10M, 100M Training time RAM
usage AUC CPU % by core read data, pre-process, score test data

vs More data usually beats better algorithms Datawocky

I’m of course paranoid that the need for distributed learning
is diminishing as individual computing nodes (augmented with GPUs) become increasingly powerful. So I was ready for Jure Leskovec’s workshop talk [at NIPS 2014]. Here is a killer screenshot. -- Paul Mineiro

we will continue to run large [...] jobs to scan
petabytes of [...] data to extract interesting features, but this paper explores the interesting possibility of switching over to a multi-core, shared-memory system for efficient execution on more refined datasets [...] e.g., machine learning http://openproceedings.org/2014/conf/edbt/KumarGDL14.pdf

Non-Linear Supervised Learning

# records: <1M 1M-100M >100M Non-Linear Supervised Learning

Benchmarking Machine Learning Tools for Scalabi...

Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy - LA ML Meetup @eHarmony - June 2015

szilard

More Decks by szilard

Other Decks in Technology

Featured

Transcript

Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy Szilárd

I usually use other people’s code [...] it is usually

Data Science Toolbox Survey 1. data munging 2. visualization 3.

Data Size for Supervised Learning # records: <10M 10M-10B >10B

Data Size for Non-Linear Supervised Learning # records: <1M 1M-100M

binary classification, 10M records numeric & categorical features, non-sparse

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

EC2

Distributed computation generally is hard, because it adds an additional

n = 10K, 100K, 1M, 10M, 100M Training time RAM

vs More data usually beats better algorithms Datawocky

I’m of course paranoid that the need for distributed learning

we will continue to run large [...] jobs to scan

Non-Linear Supervised Learning

# records: <1M 1M-100M >100M Non-Linear Supervised Learning