Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data Preparation and the Importance of How Mach...
Search
Rebecca Vickery
February 05, 2020
Technology
0
150
Data Preparation and the Importance of How Machines Learn
Rebecca Vickery
February 05, 2020
Tweet
Share
More Decks by Rebecca Vickery
See All by Rebecca Vickery
Pair Programming with AI
rebeccavickery
1
88
Machine Learning for Everyone
rebeccavickery
0
23
Scaling Machine Learning at Holiday Extras (Big Data LDN 2019))
rebeccavickery
0
130
Scaling_Machine_Learning_at_Holiday_Extras_-_MUC.pdf
rebeccavickery
0
1.2k
Gender Bias, Why we Need More Women in Tech
rebeccavickery
0
1.2k
The Fastest Way to Learn Data Science
rebeccavickery
0
53
Employing Google Cloud Machine Learning Engine to Develop Models in Production
rebeccavickery
0
1.3k
Other Decks in Technology
See All in Technology
AWS Organizations 新機能!マルチパーティ承認の紹介
yhana
1
210
本が全く読めなかった過去の自分へ
genshun9
0
670
Microsoft Build 2025 技術/製品動向 for Microsoft Startup Tech Community
torumakabe
2
330
Amazon S3標準/ S3 Tables/S3 Express One Zoneを使ったログ分析
shigeruoda
5
580
さくらのIaaS基盤のモニタリングとOpenTelemetry/OSC Hokkaido 2025
fujiwara3
2
120
プロダクトエンジニアリング組織への歩み、その現在地 / Our journey to becoming a product engineering organization
hiro_torii
0
140
「良さそう」と「とても良い」の間には 「良さそうだがホンマか」がたくさんある / 2025.07.01 LLM品質Night
smiyawaki0820
1
430
BrainPadプログラミングコンテスト記念LT会2025_社内イベント&問題解説
brainpadpr
1
180
CursorによるPMO業務の代替 / Automating PMO Tasks with Cursor
motoyoshi_kakaku
2
730
登壇ネタの見つけ方 / How to find talk topics
pinkumohikan
5
570
2025-06-26_Lightning_Talk_for_Lightning_Talks
_hashimo2
2
110
AWS テクニカルサポートとエンドカスタマーの中間地点から見えるより良いサポートの活用方法
kazzpapa3
2
580
Featured
See All Featured
Fireside Chat
paigeccino
37
3.5k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
30
2.1k
Music & Morning Musume
bryan
46
6.6k
What's in a price? How to price your products and services
michaelherold
246
12k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
BBQ
matthewcrist
89
9.7k
Automating Front-end Workflow
addyosmani
1370
200k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
46
9.6k
Agile that works and the tools we love
rasmusluckow
329
21k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.5k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Building Better People: How to give real-time feedback that sticks.
wjessup
367
19k
Transcript
None
Data Preparation and the Importance of how Machines Learn Rebecca
Vickery, Data Scientist, Holiday Extras
Machine learning
Source: Google images
Simple ML workflow Get data >> baseline model >> model
selection >> model tuning >> predict
Simple ML workflow Get data >> Features/Inputs What we want
to predict
Simple ML workflow Baseline model >> Accuracy score Perfect =
1.0 0.44
Simple ML workflow Model selection >> Best model = Random
Forest
Simple ML workflow Hyperparameter optimisation >> Best score = 1.0
Best Params = {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 500}
Source: Google images
What happens when we have this data set?
What happens when we have this data set?
None
Source: thedailybeast.com
Actual ML workflow Get data >> data preparation >> feature
engineering >> baseline model >> model selection >> model tuning >> predict
Label encoding
Problem Source: flaticon.com 4 is bigger than 1 so there
must be a relationship between these rows Source: flaticon.com 1 = neutered male 2 = spayed female 3 = intact male
Solution: One hot encoding
Ordinal data
Problem: Won’t work for all variables 366 different unique values
= 366 new features
Solution: Feature engineering? Single Colour Multi Colour
Problem: We will lose a lot of information Source: thetelegraph.com
Solution: Weight of evidence For each colour (e.g. Tan): WOE
= ln ( ( pi /p) / ( ni / n) ) pi = number of times Tan appears in positive class (1) p = total number of positive classes (1) ni = number of times Tan appears in negative class (0) n = total number of negative classes (0)
Solution: Weight of evidence Output is a positive or negative
number
Solution(s) WOE is one of many solutions for this
Problem(s) Source: Photo by Louis Reed on Unsplash
Solution: Scikit-learn pipelines
Solution: category_encoders pip install category_encoders
Pipeline example
Less time But still some work to do
“There are only two Machine Learning approaches that win competitions:
Handcrafted & Neural Networks.” Anthony Goldbloom, CEO & Founder, Kaggle
Thanks for listening Find me at...