Size of Datasets for Analytics and Implications for R - useR! conference, Stanford University - June 2016

Size of Datasets for Analytics and Implications for R Szilárd
Pafka, PhD Chief Scientist, Epoch useR! conference, Stanford University June 2016

Disclaimer: I am not representing my employer (Epoch) in this
talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk

(largest data analyzed)

https://deads.gitbooks.io/paratext-bench/content/teaser.html June 2016

Aggregation 100M rows 1M groups Join 100M rows x 1M
rows time [s] time [s]

Aggregation 100M rows 1M groups Join 100M rows x 1M
rows time [s] time [s] Speedup 5 nodes: Hive 1.5x Spark 2x

(largest data analyzed)

data size [M] training time [s] 10x Gradient Boosting Machines

linear tops off (data size) (accuracy)

linear tops off more data & better algo (data size)
(accuracy)

linear tops off more data & better algo random forest
on 1% of data beats linear on all data (data size) (accuracy)

Summary / Tips for analyzing “big” data: - Get lots
of RAM (physical/ cloud) - Use R and high performance R packages (e.g. data. table, xgboost) - Do data reduction in database (analytical db/ big data system) - (Only) distribute embarrassingly parallel tasks (e.g. hyperparameter search for machine learning) - Let engineers (store and) ETL the data (“scalable”) - Use statistics/ domain knowledge/ thinking - Use “big data tools” only if the above tips not enough

Size of Datasets for Analytics and Implications...

Size of Datasets for Analytics and Implications for R - useR! conference, Stanford University - June 2016

szilard

More Decks by szilard

Featured

Transcript

Size of Datasets for Analytics and Implications for R Szilárd

Disclaimer: I am not representing my employer (Epoch) in this

(largest data analyzed)

(largest data analyzed)

https://deads.gitbooks.io/paratext-bench/content/teaser.html June 2016

Aggregation 100M rows 1M groups Join 100M rows x 1M

Aggregation 100M rows 1M groups Join 100M rows x 1M

(largest data analyzed)

(largest data analyzed)

data size [M] training time [s] 10x Gradient Boosting Machines

linear tops off (data size) (accuracy)

linear tops off more data & better algo (data size)

linear tops off more data & better algo random forest

linear tops off more data & better algo random forest

Summary / Tips for analyzing “big” data: - Get lots