Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Size of Datasets for Analytics and Implications...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for szilard szilard
June 23, 2016
1.4k

Size of Datasets for Analytics and Implications for R - useR! conference, Stanford University - June 2016

Avatar for szilard

szilard

June 23, 2016
Tweet

More Decks by szilard

Transcript

  1. Size of Datasets for Analytics and Implications for R Szilárd

    Pafka, PhD Chief Scientist, Epoch useR! conference, Stanford University June 2016
  2. Disclaimer: I am not representing my employer (Epoch) in this

    talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
  3. Aggregation 100M rows 1M groups Join 100M rows x 1M

    rows time [s] time [s] Speedup 5 nodes: Hive 1.5x Spark 2x
  4. linear tops off more data & better algo random forest

    on 1% of data beats linear on all data (data size) (accuracy)
  5. linear tops off more data & better algo random forest

    on 1% of data beats linear on all data (data size) (accuracy)
  6. Summary / Tips for analyzing “big” data: - Get lots

    of RAM (physical/ cloud) - Use R and high performance R packages (e.g. data. table, xgboost) - Do data reduction in database (analytical db/ big data system) - (Only) distribute embarrassingly parallel tasks (e.g. hyperparameter search for machine learning) - Let engineers (store and) ETL the data (“scalable”) - Use statistics/ domain knowledge/ thinking - Use “big data tools” only if the above tips not enough