Make Machine Learning Boring Again: Best
Practices for Using Machine Learning in
Businesses
Szilard Pafka, PhD
Chief Scientist, Epoch
LA Data Science Meetup
Aug 2019
Slide 2
Slide 2 text
No content
Slide 3
Slide 3 text
Disclaimer:
I am not representing my employer (Epoch) in this talk
I cannot confirm nor deny if Epoch is using any of the methods, tools,
results etc. mentioned in this talk
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
No content
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
y = f (x1, x2, ... , xn)
Source: Hastie etal, ESL 2ed
Slide 10
Slide 10 text
y = f (x1, x2, ... , xn)
Slide 11
Slide 11 text
No content
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
No content
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
#1 Use the Right Algo
Slide 16
Slide 16 text
Source: Andrew Ng
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
No content
Slide 20
Slide 20 text
No content
Slide 21
Slide 21 text
No content
Slide 22
Slide 22 text
No content
Slide 23
Slide 23 text
No content
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
No content
Slide 26
Slide 26 text
No content
Slide 27
Slide 27 text
No content
Slide 28
Slide 28 text
No content
Slide 29
Slide 29 text
No content
Slide 30
Slide 30 text
No content
Slide 31
Slide 31 text
No content
Slide 32
Slide 32 text
No content
Slide 33
Slide 33 text
No content
Slide 34
Slide 34 text
*
Slide 35
Slide 35 text
#2 Use Open Source
Slide 36
Slide 36 text
No content
Slide 37
Slide 37 text
No content
Slide 38
Slide 38 text
No content
Slide 39
Slide 39 text
No content
Slide 40
Slide 40 text
No content
Slide 41
Slide 41 text
in 2006
- cost was not a factor!
- data.frame
- [800] packages
Slide 42
Slide 42 text
No content
Slide 43
Slide 43 text
No content
Slide 44
Slide 44 text
No content
Slide 45
Slide 45 text
No content
Slide 46
Slide 46 text
No content
Slide 47
Slide 47 text
No content
Slide 48
Slide 48 text
No content
Slide 49
Slide 49 text
#3 Simple > Complex
Slide 50
Slide 50 text
No content
Slide 51
Slide 51 text
10x
Slide 52
Slide 52 text
No content
Slide 53
Slide 53 text
No content
Slide 54
Slide 54 text
No content
Slide 55
Slide 55 text
No content
Slide 56
Slide 56 text
No content
Slide 57
Slide 57 text
No content
Slide 58
Slide 58 text
No content
Slide 59
Slide 59 text
No content
Slide 60
Slide 60 text
#4 Incorporate Domain Knowledge
Do Feature Engineering (Still)
Explore Your Data
Clean Your Data
Slide 61
Slide 61 text
No content
Slide 62
Slide 62 text
No content
Slide 63
Slide 63 text
No content
Slide 64
Slide 64 text
No content
Slide 65
Slide 65 text
No content
Slide 66
Slide 66 text
No content
Slide 67
Slide 67 text
No content
Slide 68
Slide 68 text
No content
Slide 69
Slide 69 text
No content
Slide 70
Slide 70 text
No content
Slide 71
Slide 71 text
No content
Slide 72
Slide 72 text
#5 Do Proper Validation
Avoid: Overfitting, Data Leakage
ML training:
lots of CPU cores
lots of RAM
limited time
Slide 124
Slide 124 text
ML training:
lots of CPU cores
lots of RAM
limited time
ML scoring:
separated servers
Slide 125
Slide 125 text
ML (cloud) services (MLaaS)
Slide 126
Slide 126 text
No content
Slide 127
Slide 127 text
“people that know what they’re doing just
use open source [...] the same open
source tools that the MLaaS services offer”
- Bradford Cross
Slide 128
Slide 128 text
Kaggle
Slide 129
Slide 129 text
No content
Slide 130
Slide 130 text
already pre-processed data
less domain knowledge
(or deliberately hidden)
AUC 0.0001 increases "relevant"
no business metric
no actual deployment
models too complex
no online evaluation
no monitoring
data leakage
Slide 131
Slide 131 text
Tuning and Auto ML
Slide 132
Slide 132 text
Ben Recht, Kevin Jamieson: http://www.argmin.net/2016/06/20/hypertuning/
Slide 133
Slide 133 text
GPUs
Slide 134
Slide 134 text
Aggregation 100M rows 1M groups
Join 100M rows x 1M rows
time [s]
time [s]
Slide 135
Slide 135 text
Aggregation 100M rows 1M groups
Join 100M rows x 1M rows
time [s]
time [s]
“Motherfucka!”