Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data Preparation and the Importance of How Mach...
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Rebecca Vickery
February 05, 2020
Technology
0
160
Data Preparation and the Importance of How Machines Learn
Rebecca Vickery
February 05, 2020
Tweet
Share
More Decks by Rebecca Vickery
See All by Rebecca Vickery
Pair Programming with AI
rebeccavickery
1
97
Machine Learning for Everyone
rebeccavickery
0
26
Scaling Machine Learning at Holiday Extras (Big Data LDN 2019))
rebeccavickery
0
130
Scaling_Machine_Learning_at_Holiday_Extras_-_MUC.pdf
rebeccavickery
0
1.2k
Gender Bias, Why we Need More Women in Tech
rebeccavickery
0
1.2k
The Fastest Way to Learn Data Science
rebeccavickery
0
55
Employing Google Cloud Machine Learning Engine to Develop Models in Production
rebeccavickery
0
1.3k
Other Decks in Technology
See All in Technology
OCI技術資料 : コンピュート・サービス 概要
ocise
4
54k
プロジェクトマネジメントをチームに宿す -ゼロからはじめるチームプロジェクトマネジメントは活動1年未満のチームの教科書です- / 20260304 Shigeki Morizane
shift_evolve
PRO
1
250
OpenClawで回す組織運営
jacopen
3
690
今のWordPress の制作手法ってなにがあんねん?(改) / What’s the Deal with WordPress Development These Days?
tbshiki
0
270
DevOpsエージェントで実現する!! AWS Well-Architected(W-A) を実現するシステム設計 / 20260307 Masaki Okuda
shift_evolve
PRO
3
610
Claude Codeの進化と各機能の活かし方
oikon48
22
12k
マルチアカウント環境でSecurity Hubの運用!導入の苦労とポイント / JAWS DAYS 2026
genda
0
510
AI は "道具" から "同僚" へ 自律型 AI エージェントの最前線と、AI 時代の人材の在り方 / Colleague in the AI Era - Autonomous AI Seminar 2026 at Niigata
gawa
0
140
[JAWSDAYS2026]Who is responsible for IAM
mizukibbb
0
460
Security Diaries of an Open Source IAM
ahus1
0
210
スクリプトの先へ!AIエージェントと組み合わせる モバイルE2Eテスト
error96num
0
160
When an innocent-looking ListOffsets Call Took Down Our Kafka Cluster
lycorptech_jp
PRO
0
120
Featured
See All Featured
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
310
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
240
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
760
Mind Mapping
helmedeiros
PRO
1
120
How to Think Like a Performance Engineer
csswizardry
28
2.5k
Evolving SEO for Evolving Search Engines
ryanjones
0
150
The Limits of Empathy - UXLibs8
cassininazir
1
250
Building Applications with DynamoDB
mza
96
7k
Design in an AI World
tapps
0
170
Visual Storytelling: How to be a Superhuman Communicator
reverentgeek
2
470
Mozcon NYC 2025: Stop Losing SEO Traffic
samtorres
0
170
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
61k
Transcript
None
Data Preparation and the Importance of how Machines Learn Rebecca
Vickery, Data Scientist, Holiday Extras
Machine learning
Source: Google images
Simple ML workflow Get data >> baseline model >> model
selection >> model tuning >> predict
Simple ML workflow Get data >> Features/Inputs What we want
to predict
Simple ML workflow Baseline model >> Accuracy score Perfect =
1.0 0.44
Simple ML workflow Model selection >> Best model = Random
Forest
Simple ML workflow Hyperparameter optimisation >> Best score = 1.0
Best Params = {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 500}
Source: Google images
What happens when we have this data set?
What happens when we have this data set?
None
Source: thedailybeast.com
Actual ML workflow Get data >> data preparation >> feature
engineering >> baseline model >> model selection >> model tuning >> predict
Label encoding
Problem Source: flaticon.com 4 is bigger than 1 so there
must be a relationship between these rows Source: flaticon.com 1 = neutered male 2 = spayed female 3 = intact male
Solution: One hot encoding
Ordinal data
Problem: Won’t work for all variables 366 different unique values
= 366 new features
Solution: Feature engineering? Single Colour Multi Colour
Problem: We will lose a lot of information Source: thetelegraph.com
Solution: Weight of evidence For each colour (e.g. Tan): WOE
= ln ( ( pi /p) / ( ni / n) ) pi = number of times Tan appears in positive class (1) p = total number of positive classes (1) ni = number of times Tan appears in negative class (0) n = total number of negative classes (0)
Solution: Weight of evidence Output is a positive or negative
number
Solution(s) WOE is one of many solutions for this
Problem(s) Source: Photo by Louis Reed on Unsplash
Solution: Scikit-learn pipelines
Solution: category_encoders pip install category_encoders
Pipeline example
Less time But still some work to do
“There are only two Machine Learning approaches that win competitions:
Handcrafted & Neural Networks.” Anthony Goldbloom, CEO & Founder, Kaggle
Thanks for listening Find me at...