Size of Datasets for Analytics and Implications for R - useR! conference, Stanford University - June 2016

Ce8e94cc306ba164175f693fb01aa8b0?s=47 szilard
June 23, 2016
1.2k

Size of Datasets for Analytics and Implications for R - useR! conference, Stanford University - June 2016

Ce8e94cc306ba164175f693fb01aa8b0?s=128

szilard

June 23, 2016
Tweet

Transcript

  1. Size of Datasets for Analytics and Implications for R Szilárd

    Pafka, PhD Chief Scientist, Epoch useR! conference, Stanford University June 2016
  2. None
  3. Disclaimer: I am not representing my employer (Epoch) in this

    talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
  4. None
  5. None
  6. None
  7. None
  8. None
  9. None
  10. None
  11. None
  12. None
  13. None
  14. None
  15. None
  16. (largest data analyzed)

  17. (largest data analyzed)

  18. None
  19. None
  20. None
  21. https://deads.gitbooks.io/paratext-bench/content/teaser.html June 2016

  22. Aggregation 100M rows 1M groups Join 100M rows x 1M

    rows time [s] time [s]
  23. Aggregation 100M rows 1M groups Join 100M rows x 1M

    rows time [s] time [s] Speedup 5 nodes: Hive 1.5x Spark 2x
  24. None
  25. None
  26. (largest data analyzed)

  27. (largest data analyzed)

  28. None
  29. None
  30. None
  31. data size [M] training time [s] 10x Gradient Boosting Machines

  32. None
  33. None
  34. linear tops off (data size) (accuracy)

  35. linear tops off more data & better algo (data size)

    (accuracy)
  36. linear tops off more data & better algo random forest

    on 1% of data beats linear on all data (data size) (accuracy)
  37. linear tops off more data & better algo random forest

    on 1% of data beats linear on all data (data size) (accuracy)
  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. None
  45. Summary / Tips for analyzing “big” data: - Get lots

    of RAM (physical/ cloud) - Use R and high performance R packages (e.g. data. table, xgboost) - Do data reduction in database (analytical db/ big data system) - (Only) distribute embarrassingly parallel tasks (e.g. hyperparameter search for machine learning) - Let engineers (store and) ETL the data (“scalable”) - Use statistics/ domain knowledge/ thinking - Use “big data tools” only if the above tips not enough
  46. None