Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data Preparation and the Importance of How Mach...
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Rebecca Vickery
February 05, 2020
Technology
160
0
Share
Data Preparation and the Importance of How Machines Learn
Rebecca Vickery
February 05, 2020
More Decks by Rebecca Vickery
See All by Rebecca Vickery
Pair Programming with AI
rebeccavickery
1
100
Machine Learning for Everyone
rebeccavickery
0
27
Scaling Machine Learning at Holiday Extras (Big Data LDN 2019))
rebeccavickery
0
130
Scaling_Machine_Learning_at_Holiday_Extras_-_MUC.pdf
rebeccavickery
0
1.2k
Gender Bias, Why we Need More Women in Tech
rebeccavickery
0
1.2k
The Fastest Way to Learn Data Science
rebeccavickery
0
57
Employing Google Cloud Machine Learning Engine to Develop Models in Production
rebeccavickery
0
1.3k
Other Decks in Technology
See All in Technology
FASTでAIエージェントを作りまくろう!
yukiogawa
4
190
Oracle AI Database@AWS:サービス概要のご紹介
oracle4engineer
PRO
3
2.1k
Network Firewall Proxyで 自前プロキシを消し去ることができるのか
gusandayo
0
170
Databricks Appsで実現する社内向けAIアプリ開発の効率化
r_miura
0
230
「活動」は激変する。「ベース」は変わらない ~ 4つの軸で捉える_AI時代ソフトウェア開発マネジメント
sentokun
0
140
Datadog で実現するセキュリティ対策 ~オブザーバビリティとセキュリティを 一緒にやると何がいいのか~
a2ush
0
190
Cortex Codeでデータの仕事を全部Agenticにやりきろう!
gappy50
0
240
Move Fast and Break Things: 10 in 20
ramimac
0
120
最大のアウトプット術は問題を作ること
ryoaccount
0
280
契約書からの情報抽出を行うLLMのスループットを、バッチ処理を用いて最大40%改善した話
sansantech
PRO
3
350
遊びで始めたNew Relic MCP、気づいたらChatOpsなオブザーバビリティボットができてました/From New Relic MCP to a ChatOps Observability Bot
aeonpeople
1
150
Oracle AI Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
5
1.3k
Featured
See All Featured
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
Become a Pro
speakerdeck
PRO
31
5.9k
Java REST API Framework Comparison - PWX 2021
mraible
34
9.2k
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
62
53k
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
300
The innovator’s Mindset - Leading Through an Era of Exponential Change - McGill University 2025
jdejongh
PRO
1
140
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
320
Large-scale JavaScript Application Architecture
addyosmani
515
110k
WENDY [Excerpt]
tessaabrams
9
37k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
640
My Coaching Mixtape
mlcsv
0
92
The Anti-SEO Checklist Checklist. Pubcon Cyber Week
ryanjones
0
110
Transcript
None
Data Preparation and the Importance of how Machines Learn Rebecca
Vickery, Data Scientist, Holiday Extras
Machine learning
Source: Google images
Simple ML workflow Get data >> baseline model >> model
selection >> model tuning >> predict
Simple ML workflow Get data >> Features/Inputs What we want
to predict
Simple ML workflow Baseline model >> Accuracy score Perfect =
1.0 0.44
Simple ML workflow Model selection >> Best model = Random
Forest
Simple ML workflow Hyperparameter optimisation >> Best score = 1.0
Best Params = {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 500}
Source: Google images
What happens when we have this data set?
What happens when we have this data set?
None
Source: thedailybeast.com
Actual ML workflow Get data >> data preparation >> feature
engineering >> baseline model >> model selection >> model tuning >> predict
Label encoding
Problem Source: flaticon.com 4 is bigger than 1 so there
must be a relationship between these rows Source: flaticon.com 1 = neutered male 2 = spayed female 3 = intact male
Solution: One hot encoding
Ordinal data
Problem: Won’t work for all variables 366 different unique values
= 366 new features
Solution: Feature engineering? Single Colour Multi Colour
Problem: We will lose a lot of information Source: thetelegraph.com
Solution: Weight of evidence For each colour (e.g. Tan): WOE
= ln ( ( pi /p) / ( ni / n) ) pi = number of times Tan appears in positive class (1) p = total number of positive classes (1) ni = number of times Tan appears in negative class (0) n = total number of negative classes (0)
Solution: Weight of evidence Output is a positive or negative
number
Solution(s) WOE is one of many solutions for this
Problem(s) Source: Photo by Louis Reed on Unsplash
Solution: Scikit-learn pipelines
Solution: category_encoders pip install category_encoders
Pipeline example
Less time But still some work to do
“There are only two Machine Learning approaches that win competitions:
Handcrafted & Neural Networks.” Anthony Goldbloom, CEO & Founder, Kaggle
Thanks for listening Find me at...