Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning Software in Practice: Quo Vadis? - Invited Talk, KDD Conference, Applied Data Science Track - August 2017, Halifax, Canada

szilard
August 13, 2017
1.5k

Machine Learning Software in Practice: Quo Vadis? - Invited Talk, KDD Conference, Applied Data Science Track - August 2017, Halifax, Canada

szilard

August 13, 2017
Tweet

More Decks by szilard

Transcript

  1. Machine Learning Software in Practice: Quo Vadis? Szilárd Pafka, PhD

    Chief Scientist, Epoch KDD Conference - Applied Data Science Track Invited Talk August 2017, Halifax, Canada
  2. Disclaimer: I am not representing my employer (Epoch) in this

    talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
  3. ML Tools Mismatch: - What practitioners wish for - What

    they truly need - What’s available - What’s advertised - What developers/researchers focus on
  4. Warning: This talk is a series or rants observations with

    the aim to provoke encourage thinking and constructive discussions about topics of impact on our industry.
  5. Warning: This talk is a series or rants observations with

    the aim to provoke encourage thinking and constructive discussions about topics of impact on our industry. Rantometer:
  6. 10x

  7. I usually use other people’s code [...] I can find

    open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang http://blog.kaggle.com/2015/06/22/profiling-top-kagglers-owen-zhang-currently-1-in-the-world/
  8. - R packages - Python scikit-learn - Vowpal Wabbit -

    H2O - xgboost - Spark MLlib - a few others
  9. - R packages 30% - Python scikit-learn 40% - Vowpal

    Wabbit 8% - H2O 10% - xgboost 8% - Spark MLlib 6% - a few others
  10. - R packages 30% - Python scikit-learn 40% - Vowpal

    Wabbit 8% - H2O 10% - xgboost 8% - Spark MLlib 6% - a few others
  11. EC2

  12. n = 10K, 100K, 1M, 10M, 100M Training time RAM

    usage AUC CPU % by core read data, pre-process, score test data
  13. n = 10K, 100K, 1M, 10M, 100M Training time RAM

    usage AUC CPU % by core read data, pre-process, score test data
  14. 10x

  15. learn_rate = 0.1, max_depth = 6, n_trees = 300 learn_rate

    = 0.01, max_depth = 16, n_trees = 1000
  16. ...

  17. Wishlist: - more datasets (10-100, structure, size) - automation: upgrading

    tools, re-running ($$) - more algos, more tools (OS/commercial?) - (even) more tuning of parameters
  18. Wishlist: - more datasets (10-100, structure, size) - automation: upgrading

    tools, re-running ($$) - more algos, more tools (OS/commercial?) - (even) more tuning of parameters - BaaS? crowdsourcing (data, tools/tuning)? - other ML problems (recsys, NLP…)
  19. “people that know what they’re doing just use open source

    [...] the same open source tools that the MLaaS services offer” - Bradford Cross
  20. already pre-processed data less domain knowledge (or deliberately hidden) AUC

    0.0001 increases "relevant" no business metric no actual deployment models too complex no online evaluation no monitoring data leakage