Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data Preparation and the Importance of How Mach...
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Rebecca Vickery
February 05, 2020
Technology
0
160
Data Preparation and the Importance of How Machines Learn
Rebecca Vickery
February 05, 2020
Tweet
Share
More Decks by Rebecca Vickery
See All by Rebecca Vickery
Pair Programming with AI
rebeccavickery
1
94
Machine Learning for Everyone
rebeccavickery
0
26
Scaling Machine Learning at Holiday Extras (Big Data LDN 2019))
rebeccavickery
0
130
Scaling_Machine_Learning_at_Holiday_Extras_-_MUC.pdf
rebeccavickery
0
1.2k
Gender Bias, Why we Need More Women in Tech
rebeccavickery
0
1.2k
The Fastest Way to Learn Data Science
rebeccavickery
0
54
Employing Google Cloud Machine Learning Engine to Develop Models in Production
rebeccavickery
0
1.3k
Other Decks in Technology
See All in Technology
1,000 にも届く AWS Organizations 組織のポリシー運用をちゃんとしたい、という話
kazzpapa3
0
180
ECS障害を例に学ぶ、インシデント対応に備えたAIエージェントの育て方 / How to develop AI agents for incident response with ECS outage
iselegant
4
370
生成AIを活用した音声文字起こしシステムの2つの構築パターンについて
miu_crescent
PRO
3
220
Red Hat OpenStack Services on OpenShift
tamemiya
0
130
M&A 後の統合をどう進めるか ─ ナレッジワーク × Poetics が実践した組織とシステムの融合
kworkdev
PRO
1
510
SRE Enabling戦記 - 急成長する組織にSREを浸透させる戦いの歴史
markie1009
0
170
フルカイテン株式会社 エンジニア向け採用資料
fullkaiten
0
10k
茨城の思い出を振り返る ~CDKのセキュリティを添えて~ / 20260201 Mitsutoshi Matsuo
shift_evolve
PRO
1
410
プロポーザルに込める段取り八分
shoheimitani
1
650
Amazon Bedrock Knowledge Basesチャンキング解説!
aoinoguchi
0
160
配列に見る bash と zsh の違い
kazzpapa3
3
170
Cloud Runでコロプラが挑む 生成AI×ゲーム『神魔狩りのツクヨミ』の裏側
colopl
0
140
Featured
See All Featured
Paper Plane
katiecoart
PRO
0
46k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
The Power of CSS Pseudo Elements
geoffreycrofte
80
6.2k
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
The innovator’s Mindset - Leading Through an Era of Exponential Change - McGill University 2025
jdejongh
PRO
1
94
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
440
Test your architecture with Archunit
thirion
1
2.2k
What Being in a Rock Band Can Teach Us About Real World SEO
427marketing
0
170
Technical Leadership for Architectural Decision Making
baasie
2
250
How to Ace a Technical Interview
jacobian
281
24k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
3.3k
Visual Storytelling: How to be a Superhuman Communicator
reverentgeek
2
440
Transcript
None
Data Preparation and the Importance of how Machines Learn Rebecca
Vickery, Data Scientist, Holiday Extras
Machine learning
Source: Google images
Simple ML workflow Get data >> baseline model >> model
selection >> model tuning >> predict
Simple ML workflow Get data >> Features/Inputs What we want
to predict
Simple ML workflow Baseline model >> Accuracy score Perfect =
1.0 0.44
Simple ML workflow Model selection >> Best model = Random
Forest
Simple ML workflow Hyperparameter optimisation >> Best score = 1.0
Best Params = {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 500}
Source: Google images
What happens when we have this data set?
What happens when we have this data set?
None
Source: thedailybeast.com
Actual ML workflow Get data >> data preparation >> feature
engineering >> baseline model >> model selection >> model tuning >> predict
Label encoding
Problem Source: flaticon.com 4 is bigger than 1 so there
must be a relationship between these rows Source: flaticon.com 1 = neutered male 2 = spayed female 3 = intact male
Solution: One hot encoding
Ordinal data
Problem: Won’t work for all variables 366 different unique values
= 366 new features
Solution: Feature engineering? Single Colour Multi Colour
Problem: We will lose a lot of information Source: thetelegraph.com
Solution: Weight of evidence For each colour (e.g. Tan): WOE
= ln ( ( pi /p) / ( ni / n) ) pi = number of times Tan appears in positive class (1) p = total number of positive classes (1) ni = number of times Tan appears in negative class (0) n = total number of negative classes (0)
Solution: Weight of evidence Output is a positive or negative
number
Solution(s) WOE is one of many solutions for this
Problem(s) Source: Photo by Louis Reed on Unsplash
Solution: Scikit-learn pipelines
Solution: category_encoders pip install category_encoders
Pipeline example
Less time But still some work to do
“There are only two Machine Learning approaches that win competitions:
Handcrafted & Neural Networks.” Anthony Goldbloom, CEO & Founder, Kaggle
Thanks for listening Find me at...