Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Size of Datasets for Analytics and Implications for R - useR! conference, Stanford University - June 2016

szilard
June 23, 2016
1.3k

Size of Datasets for Analytics and Implications for R - useR! conference, Stanford University - June 2016

szilard

June 23, 2016
Tweet

More Decks by szilard

Transcript

  1. Size of Datasets for Analytics and Implications for R Szilárd

    Pafka, PhD Chief Scientist, Epoch useR! conference, Stanford University June 2016
  2. Disclaimer: I am not representing my employer (Epoch) in this

    talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
  3. Aggregation 100M rows 1M groups Join 100M rows x 1M

    rows time [s] time [s] Speedup 5 nodes: Hive 1.5x Spark 2x
  4. linear tops off more data & better algo random forest

    on 1% of data beats linear on all data (data size) (accuracy)
  5. linear tops off more data & better algo random forest

    on 1% of data beats linear on all data (data size) (accuracy)
  6. Summary / Tips for analyzing “big” data: - Get lots

    of RAM (physical/ cloud) - Use R and high performance R packages (e.g. data. table, xgboost) - Do data reduction in database (analytical db/ big data system) - (Only) distribute embarrassingly parallel tasks (e.g. hyperparameter search for machine learning) - Let engineers (store and) ETL the data (“scalable”) - Use statistics/ domain knowledge/ thinking - Use “big data tools” only if the above tips not enough