Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data Preparation and the Importance of How Mach...
Search
Rebecca Vickery
February 05, 2020
Technology
0
150
Data Preparation and the Importance of How Machines Learn
Rebecca Vickery
February 05, 2020
Tweet
Share
More Decks by Rebecca Vickery
See All by Rebecca Vickery
Pair Programming with AI
rebeccavickery
1
89
Machine Learning for Everyone
rebeccavickery
0
24
Scaling Machine Learning at Holiday Extras (Big Data LDN 2019))
rebeccavickery
0
130
Scaling_Machine_Learning_at_Holiday_Extras_-_MUC.pdf
rebeccavickery
0
1.2k
Gender Bias, Why we Need More Women in Tech
rebeccavickery
0
1.2k
The Fastest Way to Learn Data Science
rebeccavickery
0
54
Employing Google Cloud Machine Learning Engine to Develop Models in Production
rebeccavickery
0
1.3k
Other Decks in Technology
See All in Technology
AWSでAgentic AIを開発するための前提知識の整理
nasuvitz
2
190
Data Hubグループ 紹介資料
sansan33
PRO
0
2.2k
Introduction to Sansan for Engineers / エンジニア向け会社紹介
sansan33
PRO
5
43k
いまからでも遅くない!SSL/TLS証明書超入門(It's not too late to start! SSL/TLS Certificates: The Absolute Beginner's Guide)
norimuraz
0
270
なぜAWSを活かしきれないのか?技術と組織への処方箋
nrinetcom
PRO
5
960
Dylib Hijacking on macOS: Dead or Alive?
patrickwardle
0
230
BI ツールはもういらない?Amazon RedShift & MCP Server で試みる新しいデータ分析アプローチ
cdataj
0
180
Bill One 開発エンジニア 紹介資料
sansan33
PRO
4
14k
大規模サーバーレスAPIの堅牢性・信頼性設計 〜AWSのベストプラクティスから始まる現実的制約との向き合い方〜
maimyyym
10
4.9k
エンタメとAIのための3Dパラレルワールド構築(GPU UNITE 2025 特別講演)
pfn
PRO
0
480
このままAIが発展するだけでAGI達成可能な理由
frievea
0
120
All About Sansan – for New Global Engineers
sansan33
PRO
1
1.2k
Featured
See All Featured
Code Reviewing Like a Champion
maltzj
526
40k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.4k
Statistics for Hackers
jakevdp
799
220k
A Tale of Four Properties
chriscoyier
161
23k
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
10
600
Speed Design
sergeychernyshev
32
1.2k
GraphQLの誤解/rethinking-graphql
sonatard
73
11k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.2k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
115
20k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.5k
The Cost Of JavaScript in 2023
addyosmani
55
9k
Transcript
None
Data Preparation and the Importance of how Machines Learn Rebecca
Vickery, Data Scientist, Holiday Extras
Machine learning
Source: Google images
Simple ML workflow Get data >> baseline model >> model
selection >> model tuning >> predict
Simple ML workflow Get data >> Features/Inputs What we want
to predict
Simple ML workflow Baseline model >> Accuracy score Perfect =
1.0 0.44
Simple ML workflow Model selection >> Best model = Random
Forest
Simple ML workflow Hyperparameter optimisation >> Best score = 1.0
Best Params = {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 500}
Source: Google images
What happens when we have this data set?
What happens when we have this data set?
None
Source: thedailybeast.com
Actual ML workflow Get data >> data preparation >> feature
engineering >> baseline model >> model selection >> model tuning >> predict
Label encoding
Problem Source: flaticon.com 4 is bigger than 1 so there
must be a relationship between these rows Source: flaticon.com 1 = neutered male 2 = spayed female 3 = intact male
Solution: One hot encoding
Ordinal data
Problem: Won’t work for all variables 366 different unique values
= 366 new features
Solution: Feature engineering? Single Colour Multi Colour
Problem: We will lose a lot of information Source: thetelegraph.com
Solution: Weight of evidence For each colour (e.g. Tan): WOE
= ln ( ( pi /p) / ( ni / n) ) pi = number of times Tan appears in positive class (1) p = total number of positive classes (1) ni = number of times Tan appears in negative class (0) n = total number of negative classes (0)
Solution: Weight of evidence Output is a positive or negative
number
Solution(s) WOE is one of many solutions for this
Problem(s) Source: Photo by Louis Reed on Unsplash
Solution: Scikit-learn pipelines
Solution: category_encoders pip install category_encoders
Pipeline example
Less time But still some work to do
“There are only two Machine Learning approaches that win competitions:
Handcrafted & Neural Networks.” Anthony Goldbloom, CEO & Founder, Kaggle
Thanks for listening Find me at...