Productive Data Science & Machine Learning on Largish Data - Budapest Data Science Meetup @Prezi - July 2015

Productive Data Science & Machine Learning on Largish Data Szilárd
Pafka, PhD Chief Scientist, Epoch Budapest Data Science Meetup July 2015

http://datascience.la

Productive Data Science coming... a few

Productive Data Science coming... a few - non-exclusive list -
somewhat high-level - yet about tools

Use high-level APIs

Use tools that are fast (interactive)

Prefer simple over complex

Use an environment for data analysis

Use tools that facilitate reproducibility

high-level API fast environment reproducibility

Data frames: “in-memory table” with (fast) bulk operations (“vectorized”) thousands
of packages (providing high-level API) R, Python (pandas), Spark best way to work with structured data

I usually use other people’s code [...] it is usually
not “efficient” (from time budget perspective) to write my own algorithm [...] I can find open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang http://blog.kaggle.com/2015/06/22/profiling-top-kagglers-owen-zhang-currently-1-in-the-world/

Data Size for Supervised Learning # records: <10M 10M-10B >10B

Data Size for Non-Linear Supervised Learning # records: <1M 1M-100M
>100M

binary classification, 10M records numeric & categorical features, non-sparse

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

- R packages - Python scikit-learn - Vowpal Wabbit -
H2O - xgboost - Spark MLlib

- R packages 30% - Python scikit-learn 40% - Vowpal
Wabbit 8% - H2O 10% - xgboost 8% - Spark MLlib 6%

- R packages 30% - Python scikit-learn 40% - Vowpal
Wabbit 8% - H2O 10% - xgboost 8% - Spark MLlib 6% - a few others

Distributed computation generally is hard, because it adds an additional
layer of complexity and [network] communication overhead. The ideal case is scaling linearly with the number of nodes; that’s rarely the case. Emerging evidence shows that very often, one big machine, or even a laptop, outperforms a cluster. http://fastml.com/the-emperors-new-clothes-distributed-machine-learning/

n = 10K, 100K, 1M, 10M, 100M Training time RAM
usage AUC CPU % by core read data, pre-process, score test data

vs “More data usually beats better algorithms” http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

linear tops off (data size) (accuracy)

linear tops off more data & better algo (data size)
(accuracy)

linear tops off more data & better algo random forest
on 1% of data beats linear on all data (data size) (accuracy)

http://datascience.la/benchmarking-random-forest-implementations/#comment-53599

I’m of course paranoid that the need for distributed learning
is diminishing as individual computing nodes (augmented with GPUs) become increasingly powerful. So I was ready for Jure Leskovec’s workshop talk [at NIPS 2014]. Here is a killer screenshot. -- Paul Mineiro

we will continue to run large [...] jobs to scan
petabytes of [...] data to extract interesting features, but this paper explores the interesting possibility of switching over to a multi-core, shared-memory system for efficient execution on more refined datasets [...] e.g., machine learning http://openproceedings.org/2014/conf/edbt/KumarGDL14.pdf

learn_rate = 0.1, max_depth = 6, n_trees = 300 learn_rate
= 0.01, max_depth = 16, n_trees = 1000

Non-Linear Supervised Learning

# records: <1M 1M-100M >100M Non-Linear Supervised Learning

random forest max_depth = 20, n_trees = 100; train (1M
rows) + test data

http://www.msr-waypoint.com/pubs/204499/a20-appuswamy.pdf

Productive Data Science & Machine Learning on L...

Productive Data Science & Machine Learning on Largish Data - Budapest Data Science Meetup @Prezi - July 2015

More Decks by szilard

Featured

Transcript