Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Random Walk in Data Science and Machine Learning in Practice - Use Cases Seminar, MS Biz Analytics, CEU - Budapest, May 2019

szilard
May 08, 2019
27

A Random Walk in Data Science and Machine Learning in Practice - Use Cases Seminar, MS Biz Analytics, CEU - Budapest, May 2019

szilard

May 08, 2019
Tweet

More Decks by szilard

Transcript

  1. A Random Walk in Data Science and
    Machine Learning in Practice
    Szilard Pafka, PhD
    Chief Scientist, Epoch (USA)
    CEU, Business Analytics Masters
    Budapest, May 2019

    View full-size slide

  2. Disclaimer:
    I am not representing my employer (Epoch) in this talk
    I cannot confirm nor deny if Epoch is using any of the methods, tools,
    results etc. mentioned in this talk

    View full-size slide

  3. CRISP-DM, 1999

    View full-size slide

  4. Best Practices for Using Machine Learning
    in Businesses in 2018
    Szilárd Pafka, PhD
    Chief Scientist, Epoch (USA)
    Budapest BI Forum Conference
    November 2018

    View full-size slide

  5. Disclaimer:
    I am not representing my employer (Epoch) in this talk
    I cannot confirm nor deny if Epoch is using any of the methods, tools,
    results etc. mentioned in this talk

    View full-size slide

  6. https://twitter.com/baroquepasa/

    View full-size slide

  7. y = f (x1, x2, ... , xn)
    Source: Hastie etal, ESL 2ed

    View full-size slide

  8. y = f (x1, x2, ... , xn)

    View full-size slide

  9. Source: Yann LeCun

    View full-size slide

  10. #1 Use the Right Algo

    View full-size slide

  11. Source: Andrew Ng

    View full-size slide

  12. #2 Use Open Source

    View full-size slide

  13. in 2006
    - cost was not a factor!
    - data.frame
    - [800] packages

    View full-size slide

  14. #3 Simple > Complex

    View full-size slide

  15. #4 Incorporate Domain Knowledge
    Do Feature Engineering (Still)
    Explore Your Data
    Clean Your Data

    View full-size slide

  16. #5 Do Proper Validation
    Avoid: Overfitting, Data Leakage

    View full-size slide

  17. #6 Batch or Real-Time Scoring?

    View full-size slide

  18. https://medium.com/@HarlanH/patterns-for-connecting-predictive-models-to-software-products-f9b6e923f02d

    View full-size slide

  19. https://medium.com/@dvelsner/deploying-a-simple-machine-learning-model-in-a-modern-web-application-flask-angular-docker-a657db075280
    your app

    View full-size slide

  20. R/Python:
    - Slow(er)
    - Encoding of categ. variables

    View full-size slide

  21. #7 Do Online Validation as Well

    View full-size slide

  22. https://www.oreilly.com/ideas/evaluating-machine-learning-models/page/2/orientation

    View full-size slide

  23. https://www.oreilly.com/ideas/evaluating-machine-learning-models/page/2/orientation

    View full-size slide

  24. https://www.oreilly.com/ideas/evaluating-machine-learning-models/page/2/orientation
    https://www.slideshare.net/FaisalZakariaSiddiqi/netflix-recommendations-feature-engineering-with-time-travel

    View full-size slide

  25. #8 Monitor Your Models

    View full-size slide

  26. https://www.retentionscience.com/blog/automating-machine-learning-monitoring-rs-labs/

    View full-size slide

  27. https://www.retentionscience.com/blog/automating-machine-learning-monitoring-rs-labs/

    View full-size slide

  28. 20%
    80%
    (my guess)

    View full-size slide

  29. 20%
    80%
    (my guess)

    View full-size slide

  30. #9 Business Value
    Seek / Measure / Sell

    View full-size slide

  31. #10 Make it Reproducible

    View full-size slide

  32. Cloud (servers)

    View full-size slide

  33. ML training:
    lots of CPU cores
    lots of RAM
    limited time

    View full-size slide

  34. ML training:
    lots of CPU cores
    lots of RAM
    limited time
    ML scoring:
    separated servers

    View full-size slide

  35. ML (cloud) services (MLaaS)

    View full-size slide

  36. “people that know what they’re doing just
    use open source [...] the same open
    source tools that the MLaaS services offer”
    - Bradford Cross

    View full-size slide

  37. already pre-processed data
    less domain knowledge
    (or deliberately hidden)
    AUC 0.0001 increases "relevant"
    no business metric
    no actual deployment
    models too complex
    no online evaluation
    no monitoring
    data leakage

    View full-size slide

  38. Tuning and Auto ML

    View full-size slide

  39. Ben Recht, Kevin Jamieson: http://www.argmin.net/2016/06/20/hypertuning/

    View full-size slide

  40. Aggregation 100M rows 1M groups
    Join 100M rows x 1M rows
    time [s]
    time [s]

    View full-size slide

  41. Aggregation 100M rows 1M groups
    Join 100M rows x 1M rows
    time [s]
    time [s]
    “Motherfucka!”

    View full-size slide

  42. API and GUIs

    View full-size slide

  43. How to Start?

    View full-size slide

  44. Better than Deep Learning:
    Gradient Boosting Machines (GBM)
    Szilard Pafka, PhD
    Chief Scientist, Epoch (USA)
    DataWorks Summit, Barcelona, Spain
    March 2019

    View full-size slide

  45. Disclaimer:
    I am not representing my employer (Epoch) in this talk
    I cannot confirm nor deny if Epoch is using any of the methods, tools,
    results etc. mentioned in this talk

    View full-size slide

  46. Source: Andrew Ng

    View full-size slide

  47. Source: Andrew Ng

    View full-size slide

  48. Source: Andrew Ng

    View full-size slide

  49. Source: https://twitter.com/iamdevloper/

    View full-size slide

  50. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
    http://lowrank.net/nikos/pubs/empirical.pdf

    View full-size slide

  51. Source: Hastie etal, ESL 2ed

    View full-size slide

  52. Source: Hastie etal, ESL 2ed

    View full-size slide

  53. Source: Hastie etal, ESL 2ed

    View full-size slide

  54. Source: Hastie etal, ESL 2ed

    View full-size slide

  55. I usually use other people’s code [...] I can find open source code for
    what I want to do, and my time is much better spent doing research
    and feature engineering -- Owen Zhang

    View full-size slide

  56. http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

    View full-size slide

  57. http://www.argmin.net/2016/06/20/hypertuning/

    View full-size slide

  58. no-one is using this crap

    View full-size slide

  59. Machine Learning Software in Practice:
    Quo Vadis?
    Szilárd Pafka, PhD
    Chief Scientist, Epoch
    KDD Conference - Applied Data Science Track
    Invited Talk
    August 2017, Halifax, Canada

    View full-size slide

  60. Machine Learning Software in Practice:
    Quo Vadis?
    Szilárd Pafka, PhD
    Chief Scientist, Epoch
    KDD Conference - Applied Data Science Track
    Invited Talk
    August 2017, Halifax, Canada
    SOME OF

    View full-size slide

  61. ML Tools Mismatch:
    - What practitioners wish for
    - What they truly need

    View full-size slide

  62. ML Tools Mismatch:
    - What practitioners wish for
    - What they truly need
    - What’s available
    - What’s advertised
    - What developers/researchers focus on

    View full-size slide

  63. This talk is mostly in the context of
    (binary) classification

    View full-size slide

  64. Warning:
    This talk is a series or rants observations with the aim
    to provoke encourage thinking and constructive
    discussions about topics of impact on our industry.

    View full-size slide

  65. Warning:
    This talk is a series or rants observations with the aim
    to provoke encourage thinking and constructive
    discussions about topics of impact on our industry.
    Rantometer:

    View full-size slide

  66. Our tools are optimized for what use cases?

    View full-size slide

  67. Is building this the best allocation of
    our developer resources?

    View full-size slide

  68. Efficiency for users during usage?

    View full-size slide

  69. Machine Learning Tools
    Speed, Memory, Accuracy

    View full-size slide

  70. I usually use other people’s code [...] I can find open source code for
    what I want to do, and my time is much better spent doing research
    and feature engineering -- Owen Zhang

    View full-size slide

  71. binary classification, 10M records
    numeric & categorical features, non-sparse

    View full-size slide

  72. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
    http://lowrank.net/nikos/pubs/empirical.pdf

    View full-size slide

  73. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
    http://lowrank.net/nikos/pubs/empirical.pdf

    View full-size slide

  74. n = 10K, 100K, 1M, 10M, 100M
    Training time
    RAM usage
    AUC
    CPU % by core
    read data, pre-process, score test data

    View full-size slide

  75. n = 10K, 100K, 1M, 10M, 100M
    Training time
    RAM usage
    AUC
    CPU % by core
    read data, pre-process, score test data

    View full-size slide

  76. http://datascience.la/benchmarking-random-forest-implementations/#comment-53599

    View full-size slide

  77. Best linear: 71.1

    View full-size slide

  78. learn_rate = 0.1, max_depth = 6, n_trees = 300
    learn_rate = 0.01, max_depth = 16, n_trees = 1000

    View full-size slide

  79. Deep Learning
    AI
    Oh my... OUT

    View full-size slide

  80. Distributed ML
    OUT

    View full-size slide

  81. Multicore ML

    View full-size slide

  82. 1M: CPU cache effects

    View full-size slide

  83. (lightgbm 10M)

    View full-size slide

  84. 16 cores vs 1:
    16 cores:

    View full-size slide

  85. Aggregation 100M rows 1M groups
    Join 100M rows x 1M rows
    time [s]
    time [s]

    View full-size slide

  86. Wishlist:
    - more datasets (10-100, structure, size)
    - automation: upgrading tools, re-running ($$)

    View full-size slide

  87. Wishlist:
    - more datasets (10-100, structure, size)
    - automation: upgrading tools, re-running ($$)
    - more algos, more tools (OS/commercial?)
    - (even) more tuning of parameters

    View full-size slide

  88. Wishlist:
    - more datasets (10-100, structure, size)
    - automation: upgrading tools, re-running ($$)
    - more algos, more tools (OS/commercial?)
    - (even) more tuning of parameters
    - BaaS? crowdsourcing (data, tools/tuning)?
    - other ML problems (recsys, NLP…)

    View full-size slide

  89. so far we discussed
    performance +
    (some) system architecture
    but for training only

    View full-size slide

  90. APIs (and GUIs)
    OUT

    View full-size slide

  91. Cloud (MLaaS)
    OUT

    View full-size slide

  92. Real-Time Scoring

    View full-size slide

  93. R/Python:
    - Slow(er)
    - Encoding of categ. variables

    View full-size slide

  94. Tuning & AutoML
    OUT

    View full-size slide

  95. Model Understanding,
    Accountability

    View full-size slide

  96. Evaluation Metrics
    OUT

    View full-size slide

  97. Machine Learning with H2O.ai
    Szilárd Pafka, PhD
    Chief Scientist, Epoch
    LA H2O Meetup @ AT&T
    January 2017

    View full-size slide

  98. Machine Learning with H2O.ai
    Szilárd Pafka, PhD
    Chief Scientist, Epoch
    LA H2O Meetup @ AT&T
    January 2017
    SOME OF

    View full-size slide

  99. Supervised Learning
    y = f(x)
    train: “learn” f from data X (n*p), y (n)
    score: f(x’)
    algos: k-NN, LR, NB, RF, GBM, SVM, NN, DL…
    goal: max accuracy measure (on new data)
    f ∈ F(θ) min
    θ
    ( L(y, f(x,θ)) + R(θ) ) on train set
    evaluate on separate test set /cross validation

    View full-size slide

  100. Structure/Hyperparameters λ
    min
    θ
    ( L(y, f(x,θ[,λ])) + R(θ,λ) )
    often λ ~ capacity/complexity

    View full-size slide

  101. Model selection:
    Need
    Vary λ and get model with best accuracy on validation set
    Evaluate final model on test set
    /cross validation

    View full-size slide

  102. http://datascience.la/meetup-summary-winning-data-science-competitions/

    View full-size slide

  103. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
    http://lowrank.net/nikos/pubs/empirical.pdf

    View full-size slide

  104. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
    http://lowrank.net/nikos/pubs/empirical.pdf

    View full-size slide

  105. http://datascience.la/meetup-summary-winning-data-science-competitions/

    View full-size slide

  106. data size [M]
    training
    time [s]
    10x
    Gradient Boosting Machines

    View full-size slide

  107. Disclaimer:
    I’m not affiliated with H2O.ai.
    It’s just that in my opinion H2O is a machine learning
    tool with several advantages. There are many other
    good tools (and many more awful ones).

    View full-size slide

  108. - high-performance implementation of best algos (RF,
    GBM, NN etc.)
    - R, Python etc. interfaces, easy to use API

    View full-size slide

  109. - high-performance implementation of best algos (RF,
    GBM, NN etc.)
    - R, Python etc. interfaces, easy to use API
    - open source
    - advisors: Hastie, Tibshirani

    View full-size slide

  110. - high-performance implementation of best algos (RF,
    GBM, NN etc.)
    - R, Python etc. interfaces, easy to use API
    - open source
    - advisors: Hastie, Tibshirani
    - Java, but C-style memalloc, by Java gurus
    - distributed, “big data”

    View full-size slide

  111. - high-performance implementation of best algos (RF,
    GBM, NN etc.)
    - R, Python etc. interfaces, easy to use API
    - open source
    - advisors: Hastie, Tibshirani
    - Java, but C-style memalloc, by Java gurus
    - distributed, “big data”
    - many knobs/tuning, model evaluation, cross validation,
    model selection (hyperparameter search)

    View full-size slide

  112. - high-performance implementation of best algos (RF,
    GBM, NN etc.)
    - R, Python etc. interfaces, easy to use API
    - open source
    - advisors: Hastie, Tibshirani
    - Java, but C-style memalloc, by Java gurus
    - distributed, “big data”
    - many knobs/tuning, model evaluation, cross validation,
    model selection (hyperparameter search)

    View full-size slide

  113. install.packages("h2o")
    http://www.h2o.ai/

    View full-size slide

  114. https://gist.github.com/szilard/b87233bbf41a4b366c26eede7bb1a0f3
    Laptop / 1 server / cluster

    View full-size slide

  115. No need for manual 1-hot
    encoding of categorical variables

    View full-size slide

  116. https://gist.github.com/szilard/b87233bbf41a4b366c26eede7bb1a0f3

    View full-size slide

  117. Some Updates

    View full-size slide

  118. A Few More Thoughts

    View full-size slide