Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data Preparation and the Importance of How Mach...
Search
Rebecca Vickery
February 05, 2020
Technology
0
150
Data Preparation and the Importance of How Machines Learn
Rebecca Vickery
February 05, 2020
Tweet
Share
More Decks by Rebecca Vickery
See All by Rebecca Vickery
Pair Programming with AI
rebeccavickery
1
89
Machine Learning for Everyone
rebeccavickery
0
24
Scaling Machine Learning at Holiday Extras (Big Data LDN 2019))
rebeccavickery
0
130
Scaling_Machine_Learning_at_Holiday_Extras_-_MUC.pdf
rebeccavickery
0
1.2k
Gender Bias, Why we Need More Women in Tech
rebeccavickery
0
1.2k
The Fastest Way to Learn Data Science
rebeccavickery
0
54
Employing Google Cloud Machine Learning Engine to Develop Models in Production
rebeccavickery
0
1.3k
Other Decks in Technology
See All in Technology
Greenは本当にGreenか? - B/GデプロイとAPI自動テストで安心デプロイ
kaz29
0
120
その意思決定、まだ続けるんですか? ~痛みを超えて未来を作る、AI時代の撤退とピボットの技術~
applism118
34
21k
ステートレスなLLMでステートフルなAI agentを作る - YAPC::Fukuoka 2025
gfx
8
1.4k
現地速報!Microsoft Ignite 2025 M365 Copilotアップデートレポート
kasada
2
1.5k
やり方は一つだけじゃない、正解だけを目指さず寄り道やその先まで自分流に楽しむ趣味プログラミングの探求 2025-11-15 YAPC::Fukuoka
sugyan
3
920
FFMとJVMの実装から学ぶJavaのインテグリティ
kazumura
0
160
不確実性に備える ABEMA の信頼性設計とオブザーバビリティ基盤
nagapad
4
5.5k
2025年 面白の現在地 / Where Omoshiro Stands Today: 2025
acidlemon
0
260
マルチドライブアーキテクチャ: 複数の駆動力でプロダクトを前進させる
knih
0
7.9k
Javaコミュニティの歩き方 ~参加から貢献まで、すべて教えます~
tabatad
0
140
【M3】攻めのセキュリティの実践!プロアクティブなセキュリティ対策の実践事例
axelmizu
0
180
米軍Platform One / Black Pearlに学ぶ極限環境DevSecOps
jyoshise
2
520
Featured
See All Featured
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
1.8k
Building Adaptive Systems
keathley
44
2.8k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
36
6.1k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Documentation Writing (for coders)
carmenintech
76
5.1k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.5k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.2k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
VelocityConf: Rendering Performance Case Studies
addyosmani
333
24k
Into the Great Unknown - MozCon
thekraken
40
2.2k
The Language of Interfaces
destraynor
162
25k
Navigating Team Friction
lara
190
16k
Transcript
None
Data Preparation and the Importance of how Machines Learn Rebecca
Vickery, Data Scientist, Holiday Extras
Machine learning
Source: Google images
Simple ML workflow Get data >> baseline model >> model
selection >> model tuning >> predict
Simple ML workflow Get data >> Features/Inputs What we want
to predict
Simple ML workflow Baseline model >> Accuracy score Perfect =
1.0 0.44
Simple ML workflow Model selection >> Best model = Random
Forest
Simple ML workflow Hyperparameter optimisation >> Best score = 1.0
Best Params = {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 500}
Source: Google images
What happens when we have this data set?
What happens when we have this data set?
None
Source: thedailybeast.com
Actual ML workflow Get data >> data preparation >> feature
engineering >> baseline model >> model selection >> model tuning >> predict
Label encoding
Problem Source: flaticon.com 4 is bigger than 1 so there
must be a relationship between these rows Source: flaticon.com 1 = neutered male 2 = spayed female 3 = intact male
Solution: One hot encoding
Ordinal data
Problem: Won’t work for all variables 366 different unique values
= 366 new features
Solution: Feature engineering? Single Colour Multi Colour
Problem: We will lose a lot of information Source: thetelegraph.com
Solution: Weight of evidence For each colour (e.g. Tan): WOE
= ln ( ( pi /p) / ( ni / n) ) pi = number of times Tan appears in positive class (1) p = total number of positive classes (1) ni = number of times Tan appears in negative class (0) n = total number of negative classes (0)
Solution: Weight of evidence Output is a positive or negative
number
Solution(s) WOE is one of many solutions for this
Problem(s) Source: Photo by Louis Reed on Unsplash
Solution: Scikit-learn pipelines
Solution: category_encoders pip install category_encoders
Pipeline example
Less time But still some work to do
“There are only two Machine Learning approaches that win competitions:
Handcrafted & Neural Networks.” Anthony Goldbloom, CEO & Founder, Kaggle
Thanks for listening Find me at...