A Random Walk in Data Science and Machine Learning in Practice - Use Cases Seminar, MS Biz Analytics, CEU - Budapest, May 2019

A Random Walk in Data Science and Machine Learning in
Practice Szilard Pafka, PhD Chief Scientist, Epoch (USA) CEU, Business Analytics Masters Budapest, May 2019

Disclaimer: I am not representing my employer (Epoch) in this
talk I cannot conﬁrm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk

CRISP-DM, 1999

Best Practices for Using Machine Learning in Businesses in 2018
Szilárd Pafka, PhD Chief Scientist, Epoch (USA) Budapest BI Forum Conference November 2018

https://twitter.com/baroquepasa/

y = f (x1, x2, ... , xn) Source: Hastie
etal, ESL 2ed

y = f (x1, x2, ... , xn)

Source: Yann LeCun

#1 Use the Right Algo

Source: Andrew Ng

#2 Use Open Source

in 2006 - cost was not a factor! - data.frame
- [800] packages

#3 Simple > Complex

#4 Incorporate Domain Knowledge Do Feature Engineering (Still) Explore Your
Data Clean Your Data

#5 Do Proper Validation Avoid: Overfitting, Data Leakage

#6 Batch or Real-Time Scoring?

https://medium.com/@HarlanH/patterns-for-connecting-predictive-models-to-software-products-f9b6e923f02d

https://medium.com/@dvelsner/deploying-a-simple-machine-learning-model-in-a-modern-web-application-flask-angular-docker-a657db075280 your app

R/Python: - Slow(er) - Encoding of categ. variables

#7 Do Online Validation as Well

https://www.oreilly.com/ideas/evaluating-machine-learning-models/page/2/orientation

https://www.oreilly.com/ideas/evaluating-machine-learning-models/page/2/orientation https://www.slideshare.net/FaisalZakariaSiddiqi/netflix-recommendations-feature-engineering-with-time-travel

#8 Monitor Your Models

https://www.retentionscience.com/blog/automating-machine-learning-monitoring-rs-labs/

20% 80% (my guess)

#9 Business Value Seek / Measure / Sell

#10 Make it Reproducible

Cloud (servers)

ML training: lots of CPU cores lots of RAM limited
time

ML training: lots of CPU cores lots of RAM limited
time ML scoring: separated servers

ML (cloud) services (MLaaS)

“people that know what they’re doing just use open source
[...] the same open source tools that the MLaaS services offer” - Bradford Cross

Kaggle

already pre-processed data less domain knowledge (or deliberately hidden) AUC
0.0001 increases "relevant" no business metric no actual deployment models too complex no online evaluation no monitoring data leakage

Tuning and Auto ML

Ben Recht, Kevin Jamieson: http://www.argmin.net/2016/06/20/hypertuning/

Aggregation 100M rows 1M groups Join 100M rows x 1M
rows time [s] time [s]

rows time [s] time [s] “Motherfucka!”

API and GUIs

How to Start?

Better than Deep Learning: Gradient Boosting Machines (GBM) Szilard Pafka,
PhD Chief Scientist, Epoch (USA) DataWorks Summit, Barcelona, Spain March 2019

Source: Andrew Ng

Source: https://twitter.com/iamdevloper/

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

Source: Hastie etal, ESL 2ed

I usually use other people’s code [...] I can find
open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang

http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

http://www.argmin.net/2016/06/20/hypertuning/

no-one is using this crap

Machine Learning Software in Practice: Quo Vadis? Szilárd Pafka, PhD
Chief Scientist, Epoch KDD Conference - Applied Data Science Track Invited Talk August 2017, Halifax, Canada

Machine Learning Software in Practice: Quo Vadis? Szilárd Pafka, PhD
Chief Scientist, Epoch KDD Conference - Applied Data Science Track Invited Talk August 2017, Halifax, Canada SOME OF

ML Tools Mismatch: - What practitioners wish for - What
they truly need

ML Tools Mismatch: - What practitioners wish for - What
they truly need - What’s available - What’s advertised - What developers/researchers focus on

This talk is mostly in the context of (binary) classification

Warning: This talk is a series or rants observations with
the aim to provoke encourage thinking and constructive discussions about topics of impact on our industry.

Warning: This talk is a series or rants observations with
the aim to provoke encourage thinking and constructive discussions about topics of impact on our industry. Rantometer:

Our tools are optimized for what use cases?

Is building this the best allocation of our developer resources?

Efficiency for users during usage?

Big Data

Machine Learning Tools Speed, Memory, Accuracy

I usually use other people’s code [...] I can find
open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang

binary classification, 10M records numeric & categorical features, non-sparse

n = 10K, 100K, 1M, 10M, 100M Training time RAM
usage AUC CPU % by core read data, pre-process, score test data

http://datascience.la/benchmarking-random-forest-implementations/#comment-53599

Best linear: 71.1

learn_rate = 0.1, max_depth = 6, n_trees = 300 learn_rate
= 0.01, max_depth = 16, n_trees = 1000

Deep Learning AI Oh my... OUT

Distributed ML OUT

Multicore ML

1M: CPU cache effects

(lightgbm 10M)

16 cores vs 1: 16 cores:

rows time [s] time [s]

Benchmarks

Wishlist: - more datasets (10-100, structure, size) - automation: upgrading
tools, re-running ($$)

tools, re-running ($$) - more algos, more tools (OS/commercial?) - (even) more tuning of parameters

tools, re-running ($$) - more algos, more tools (OS/commercial?) - (even) more tuning of parameters - BaaS? crowdsourcing (data, tools/tuning)? - other ML problems (recsys, NLP…)

so far we discussed performance + (some) system architecture but
for training only

APIs (and GUIs) OUT

Cloud (MLaaS) OUT

Real-Time Scoring

R/Python: - Slow(er) - Encoding of categ. variables

Kaggle OUT

Tuning & AutoML OUT

Model Understanding, Accountability

Evaluation Metrics OUT

Machine Learning with H2O.ai Szilárd Pafka, PhD Chief Scientist, Epoch
LA H2O Meetup @ AT&T January 2017

Machine Learning with H2O.ai Szilárd Pafka, PhD Chief Scientist, Epoch
LA H2O Meetup @ AT&T January 2017 SOME OF

Supervised Learning y = f(x) train: “learn” f from data
X (n*p), y (n) score: f(x’) algos: k-NN, LR, NB, RF, GBM, SVM, NN, DL… goal: max accuracy measure (on new data) f ∈ F(θ) min θ ( L(y, f(x,θ)) + R(θ) ) on train set evaluate on separate test set /cross validation

Structure/Hyperparameters λ min θ ( L(y, f(x,θ[,λ])) + R(θ,λ) )
often λ ~ capacity/complexity

Model selection: Need Vary λ and get model with best
accuracy on validation set Evaluate final model on test set /cross validation

overfitting

http://datascience.la/meetup-summary-winning-data-science-competitions/

data size [M] training time [s] 10x Gradient Boosting Machines

Disclaimer: I’m not affiliated with H2O.ai. It’s just that in
my opinion H2O is a machine learning tool with several advantages. There are many other good tools (and many more awful ones).

- high-performance implementation of best algos (RF, GBM, NN etc.)
- R, Python etc. interfaces, easy to use API

- R, Python etc. interfaces, easy to use API - open source - advisors: Hastie, Tibshirani

- R, Python etc. interfaces, easy to use API - open source - advisors: Hastie, Tibshirani - Java, but C-style memalloc, by Java gurus - distributed, “big data”

- R, Python etc. interfaces, easy to use API - open source - advisors: Hastie, Tibshirani - Java, but C-style memalloc, by Java gurus - distributed, “big data” - many knobs/tuning, model evaluation, cross validation, model selection (hyperparameter search)

install.packages("h2o") http://www.h2o.ai/

https://gist.github.com/szilard/b87233bbf41a4b366c26eede7bb1a0f3 Laptop / 1 server / cluster

No need for manual 1-hot encoding of categorical variables

https://gist.github.com/szilard/b87233bbf41a4b366c26eede7bb1a0f3

Some Updates

A Few More Thoughts

A Random Walk in Data Science and Machine Learn...

A Random Walk in Data Science and Machine Learning in Practice - Use Cases Seminar, MS Biz Analytics, CEU - Budapest, May 2019

More Decks by szilard

Featured

Transcript