Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy - LA ML Meetup @eHarmony - June 2015

Ce8e94cc306ba164175f693fb01aa8b0?s=47 szilard
June 12, 2015

Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy - LA ML Meetup @eHarmony - June 2015



June 12, 2015

More Decks by szilard

Other Decks in Technology


  1. Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy Szilárd

    Pafka, PhD Chief Scientist, Epoch LA Machine Learning Meetup June 2015
  2. None
  3. None
  4. None
  5. I usually use other people’s code [...] it is usually

    not “efficient” (from time budget perspective) to write my own algorithm [...] I can find open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang http://blog.kaggle.com/2015/06/22/profiling-top-kagglers-owen-zhang-currently-1-in-the-world/
  6. Data Science Toolbox Survey 1. data munging 2. visualization 3.

    machine learning Results: http://datascience.la/?s=survey Compare: - kdnuggets poll - Rexer data mining survey
  7. None
  8. Data Size for Supervised Learning # records: <10M 10M-10B >10B

  9. Data Size for Non-Linear Supervised Learning # records: <1M 1M-100M

  10. binary classification, 10M records numeric & categorical features, non-sparse

  11. None
  12. None
  13. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

  14. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

  15. None
  16. None
  17. None
  18. None
  19. None
  20. EC2

  21. None
  22. Distributed computation generally is hard, because it adds an additional

    layer of complexity and [network] communication overhead. The ideal case is scaling linearly with the number of nodes; that’s rarely the case. Emerging evidence shows that very often, one big machine, or even a laptop, outperforms a cluster. http://fastml.com/the-emperors-new-clothes-distributed-machine-learning/
  23. n = 10K, 100K, 1M, 10M, 100M Training time RAM

    usage AUC CPU % by core read data, pre-process, score test data
  24. None
  25. None
  26. None
  27. None
  28. vs More data usually beats better algorithms Datawocky

  29. None
  30. None
  31. None
  32. None
  33. None
  34. None
  35. None
  36. None
  37. I’m of course paranoid that the need for distributed learning

    is diminishing as individual computing nodes (augmented with GPUs) become increasingly powerful. So I was ready for Jure Leskovec’s workshop talk [at NIPS 2014]. Here is a killer screenshot. -- Paul Mineiro
  38. None
  39. we will continue to run large [...] jobs to scan

    petabytes of [...] data to extract interesting features, but this paper explores the interesting possibility of switching over to a multi-core, shared-memory system for efficient execution on more refined datasets [...] e.g., machine learning http://openproceedings.org/2014/conf/edbt/KumarGDL14.pdf
  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. Non-Linear Supervised Learning

  49. # records: <1M 1M-100M >100M Non-Linear Supervised Learning

  50. None
  51. None