Best Practices for Using Machine Learning in Businesses in 2018 - Keynote at Budapest BI Forum Conference - Budapest, November 2018

Best Practices for Using Machine Learning in Businesses in 2018
Szilárd Pafka, PhD Chief Scientist, Epoch (USA) Budapest BI Forum Conference November 2018

Disclaimer: I am not representing my employer (Epoch) in this
talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk

https://twitter.com/baroquepasa/

y = f (x1, x2, ... , xn) Source: Hastie
etal, ESL 2ed

y = f (x1, x2, ... , xn)

Source: Yann LeCun

#1 Use the Right Algo

Source: Andrew Ng

#2 Use Open Source

in 2006 - cost was not a factor! - data.frame
- [800] packages

#3 Simple > Complex

#4 Incorporate Domain Knowledge Do Feature Engineering (Still) Explore Your
Data Clean Your Data

#5 Do Proper Validation Avoid: Overfitting, Data Leakage

#6 Batch or Real-Time Scoring?

https://medium.com/@HarlanH/patterns-for-connecting-predictive-models-to-software-products-f9b6e923f02d

https://medium.com/@dvelsner/deploying-a-simple-machine-learning-model-in-a-modern-web-application-flask-angular-docker-a657db075280 your app

R/Python: - Slow(er) - Encoding of categ. variables

#7 Do Online Validation as Well

https://www.oreilly.com/ideas/evaluating-machine-learning-models/page/2/orientation

https://www.oreilly.com/ideas/evaluating-machine-learning-models/page/2/orientation https://www.slideshare.net/FaisalZakariaSiddiqi/netflix-recommendations-feature-engineering-with-time-travel

#8 Monitor Your Models

https://www.retentionscience.com/blog/automating-machine-learning-monitoring-rs-labs/

20% 80% (my guess)

#9 Business Value Seek / Measure / Sell

#10 Make it Reproducible

Cloud (servers)

ML training: lots of CPU cores lots of RAM limited
time

ML training: lots of CPU cores lots of RAM limited
time ML scoring: separated servers

ML (cloud) services (MLaaS)

“people that know what they’re doing just use open source
[...] the same open source tools that the MLaaS services offer” - Bradford Cross

Kaggle

already pre-processed data less domain knowledge (or deliberately hidden) AUC
0.0001 increases "relevant" no business metric no actual deployment models too complex no online evaluation no monitoring data leakage

Tuning and Auto ML

Ben Recht, Kevin Jamieson: http://www.argmin.net/2016/06/20/hypertuning/

Aggregation 100M rows 1M groups Join 100M rows x 1M
rows time [s] time [s]

Aggregation 100M rows 1M groups Join 100M rows x 1M
rows time [s] time [s] “Motherfucka!”

API and GUIs

How to Start?

Best Practices for Using Machine Learning in Bu...

Best Practices for Using Machine Learning in Businesses in 2018 - Keynote at Budapest BI Forum Conference - Budapest, November 2018

More Decks by szilard

Featured

Transcript