Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Benchmarking Machine Learning Tools - H2O World Conference - Nov 2015

szilard
November 05, 2015
130

Benchmarking Machine Learning Tools - H2O World Conference - Nov 2015

szilard

November 05, 2015
Tweet

More Decks by szilard

Transcript

  1. Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy Szilárd

    Pafka, PhD Chief Scientist, Epoch H2O World Conference, Mountain View Nov 2015
  2. Disclaimer: I am not representing my employer (Epoch) in this

    talk I cannot confirm nor deny if Epoch is using or not any of the methods, tools, results etc. mentioned in this talk. The results presented in this talk should not be considered as any indication whether Epoch is using these methods, tools, results etc. or not.
  3. I usually use other people’s code [...] it is usually

    not “efficient” (from time budget perspective) to write my own algorithm [...] I can find open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang http://blog.kaggle.com/2015/06/22/profiling-top-kagglers-owen-zhang-currently-1-in-the-world/
  4. - R packages 30% - Python scikit-learn 40% - Vowpal

    Wabbit 8% - H2O 10% - xgboost 8% - Spark MLlib 6%
  5. - R packages 30% - Python scikit-learn 40% - Vowpal

    Wabbit 8% - H2O 10% - xgboost 8% - Spark MLlib 6% - a few others
  6. - R packages 30% - Python scikit-learn 40% - Vowpal

    Wabbit 8% - H2O 10% - xgboost 8% - Spark MLlib 6% - a few others
  7. EC2

  8. Distributed computation generally is hard, because it adds an additional

    layer of complexity and [network] communication overhead. The ideal case is scaling linearly with the number of nodes; that’s rarely the case. Emerging evidence shows that very often, one big machine, or even a laptop, outperforms a cluster. http://fastml.com/the-emperors-new-clothes-distributed-machine-learning/
  9. n = 10K, 100K, 1M, 10M, 100M Training time RAM

    usage AUC CPU % by core read data, pre-process, score test data
  10. linear tops off more data & better algo random forest

    on 1% of data beats linear on all data (data size) (accuracy)
  11. 10x

  12. we will continue to run large [...] jobs to scan

    petabytes of [...] data to extract interesting features, but this paper explores the interesting possibility of switching over to a multi-core, shared-memory system for efficient execution on more refined datasets [...] e.g., machine learning http://openproceedings.org/2014/conf/edbt/KumarGDL14.pdf
  13. learn_rate = 0.1, max_depth = 6, n_trees = 300 learn_rate

    = 0.01, max_depth = 16, n_trees = 1000