Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data Preparation and the Importance of How Mach...
Search
Rebecca Vickery
February 05, 2020
Technology
160
0
Share
Data Preparation and the Importance of How Machines Learn
Rebecca Vickery
February 05, 2020
More Decks by Rebecca Vickery
See All by Rebecca Vickery
Pair Programming with AI
rebeccavickery
1
100
Machine Learning for Everyone
rebeccavickery
0
27
Scaling Machine Learning at Holiday Extras (Big Data LDN 2019))
rebeccavickery
0
130
Scaling_Machine_Learning_at_Holiday_Extras_-_MUC.pdf
rebeccavickery
0
1.2k
Gender Bias, Why we Need More Women in Tech
rebeccavickery
0
1.2k
The Fastest Way to Learn Data Science
rebeccavickery
0
57
Employing Google Cloud Machine Learning Engine to Develop Models in Production
rebeccavickery
0
1.3k
Other Decks in Technology
See All in Technology
Data Enabling Team立ち上げました
sansantech
PRO
0
240
Oracle Cloud Infrastructure(OCI):Onboarding Session(はじめてのOCI/Oracle Supportご利⽤ガイド)
oracle4engineer
PRO
2
17k
ZOZOTOWNリプレイスでのSkills導入までの流れとこれから.pptx.pdf
zozotech
PRO
2
260
GitHub Advanced Security × Defender for Cloudで開発とSecOpsのサイロを超える: コードとクラウドをつなぐ、開発プラットフォームのセキュリティ
yuriemori
1
120
「活動」は激変する。「ベース」は変わらない ~ 4つの軸で捉える_AI時代ソフトウェア開発マネジメント
sentokun
0
140
MIX AUDIO EN BROADCAST
ralpherick
0
140
OpenClawでPM業務を自動化
knishioka
2
370
推し活エージェント
yuntan_t
1
640
Move Fast and Break Things: 10 in 20
ramimac
0
120
非同期・イベント駆動処理の分散トレーシングの繋げ方
ichikawaken
1
250
Even G2 クイックスタートガイド(日本語版)
vrshinobi1
0
190
AWSで2番目にリリースされたサービスについてお話しします(諸説あります)
yama3133
0
110
Featured
See All Featured
Ethics towards AI in product and experience design
skipperchong
2
250
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
21
1.4k
Rails Girls Zürich Keynote
gr2m
96
14k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.8k
How to train your dragon (web standard)
notwaldorf
97
6.6k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.6k
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
350
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
1.9k
The innovator’s Mindset - Leading Through an Era of Exponential Change - McGill University 2025
jdejongh
PRO
1
140
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
35
2.4k
Agile that works and the tools we love
rasmusluckow
331
21k
Transcript
None
Data Preparation and the Importance of how Machines Learn Rebecca
Vickery, Data Scientist, Holiday Extras
Machine learning
Source: Google images
Simple ML workflow Get data >> baseline model >> model
selection >> model tuning >> predict
Simple ML workflow Get data >> Features/Inputs What we want
to predict
Simple ML workflow Baseline model >> Accuracy score Perfect =
1.0 0.44
Simple ML workflow Model selection >> Best model = Random
Forest
Simple ML workflow Hyperparameter optimisation >> Best score = 1.0
Best Params = {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 500}
Source: Google images
What happens when we have this data set?
What happens when we have this data set?
None
Source: thedailybeast.com
Actual ML workflow Get data >> data preparation >> feature
engineering >> baseline model >> model selection >> model tuning >> predict
Label encoding
Problem Source: flaticon.com 4 is bigger than 1 so there
must be a relationship between these rows Source: flaticon.com 1 = neutered male 2 = spayed female 3 = intact male
Solution: One hot encoding
Ordinal data
Problem: Won’t work for all variables 366 different unique values
= 366 new features
Solution: Feature engineering? Single Colour Multi Colour
Problem: We will lose a lot of information Source: thetelegraph.com
Solution: Weight of evidence For each colour (e.g. Tan): WOE
= ln ( ( pi /p) / ( ni / n) ) pi = number of times Tan appears in positive class (1) p = total number of positive classes (1) ni = number of times Tan appears in negative class (0) n = total number of negative classes (0)
Solution: Weight of evidence Output is a positive or negative
number
Solution(s) WOE is one of many solutions for this
Problem(s) Source: Photo by Louis Reed on Unsplash
Solution: Scikit-learn pipelines
Solution: category_encoders pip install category_encoders
Pipeline example
Less time But still some work to do
“There are only two Machine Learning approaches that win competitions:
Handcrafted & Neural Networks.” Anthony Goldbloom, CEO & Founder, Kaggle
Thanks for listening Find me at...