Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data Preparation and the Importance of How Mach...
Search
Rebecca Vickery
February 05, 2020
Technology
0
160
Data Preparation and the Importance of How Machines Learn
Rebecca Vickery
February 05, 2020
Tweet
Share
More Decks by Rebecca Vickery
See All by Rebecca Vickery
Pair Programming with AI
rebeccavickery
1
91
Machine Learning for Everyone
rebeccavickery
0
26
Scaling Machine Learning at Holiday Extras (Big Data LDN 2019))
rebeccavickery
0
130
Scaling_Machine_Learning_at_Holiday_Extras_-_MUC.pdf
rebeccavickery
0
1.2k
Gender Bias, Why we Need More Women in Tech
rebeccavickery
0
1.2k
The Fastest Way to Learn Data Science
rebeccavickery
0
54
Employing Google Cloud Machine Learning Engine to Develop Models in Production
rebeccavickery
0
1.3k
Other Decks in Technology
See All in Technology
Databricks Free Edition講座 データエンジニアリング編
taka_aki
0
2.7k
Contract One Engineering Unit 紹介資料
sansan33
PRO
0
12k
サラリーマンソフトウェアエンジニアのキャリア
yuheinakasaka
41
19k
純粋なイミュータブルモデルを設計してからイベントソーシングと組み合わせるDeciderの実践方法の紹介 /Introducing Decider Pattern with Event Sourcing
tomohisa
1
1.1k
クラウドセキュリティの進化 — AWSの20年を振り返る
kei4eva4
0
110
20260114_データ横丁 新年LT大会:2026年の抱負
taromatsui_cccmkhd
0
290
名刺メーカーDevグループ 紹介資料
sansan33
PRO
0
1k
Proxmoxで作る自宅クラウド入門
koinunopochi
0
110
たかがボタン、されどボタン ~button要素から深ぼるボタンUIの定義について~ / BuriKaigi 2026
yamanoku
1
280
2026/01/16_実体験から学ぶ 2025年の失敗と対策_Progate Bar
teba_eleven
1
190
スクラムを一度諦めたチームにアジャイルコーチが入ってどう変化したか / A Team's Second Try at Scrum with an Agile Coach
kaonavi
0
250
新米スクラムマスターの4ヶ月 -「スクラムイベントを回しているのに手応えがない」からの脱出 / Four Months as a New Scrum Master — When Scrum Events Were Running, but Nothing Felt Right
owata
0
160
Featured
See All Featured
Leveraging LLMs for student feedback in introductory data science courses - posit::conf(2025)
minecr
0
120
Sam Torres - BigQuery for SEOs
techseoconnect
PRO
0
170
The State of eCommerce SEO: How to Win in Today's Products SERPs - #SEOweek
aleyda
2
9.3k
Embracing the Ebb and Flow
colly
88
4.9k
30 Presentation Tips
portentint
PRO
1
190
A better future with KSS
kneath
240
18k
Leading Effective Engineering Teams in the AI Era
addyosmani
9
1.5k
Lightning Talk: Beautiful Slides for Beginners
inesmontani
PRO
1
420
A Tale of Four Properties
chriscoyier
162
24k
Code Reviewing Like a Champion
maltzj
527
40k
Music & Morning Musume
bryan
46
7k
Kristin Tynski - Automating Marketing Tasks With AI
techseoconnect
PRO
0
120
Transcript
None
Data Preparation and the Importance of how Machines Learn Rebecca
Vickery, Data Scientist, Holiday Extras
Machine learning
Source: Google images
Simple ML workflow Get data >> baseline model >> model
selection >> model tuning >> predict
Simple ML workflow Get data >> Features/Inputs What we want
to predict
Simple ML workflow Baseline model >> Accuracy score Perfect =
1.0 0.44
Simple ML workflow Model selection >> Best model = Random
Forest
Simple ML workflow Hyperparameter optimisation >> Best score = 1.0
Best Params = {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 500}
Source: Google images
What happens when we have this data set?
What happens when we have this data set?
None
Source: thedailybeast.com
Actual ML workflow Get data >> data preparation >> feature
engineering >> baseline model >> model selection >> model tuning >> predict
Label encoding
Problem Source: flaticon.com 4 is bigger than 1 so there
must be a relationship between these rows Source: flaticon.com 1 = neutered male 2 = spayed female 3 = intact male
Solution: One hot encoding
Ordinal data
Problem: Won’t work for all variables 366 different unique values
= 366 new features
Solution: Feature engineering? Single Colour Multi Colour
Problem: We will lose a lot of information Source: thetelegraph.com
Solution: Weight of evidence For each colour (e.g. Tan): WOE
= ln ( ( pi /p) / ( ni / n) ) pi = number of times Tan appears in positive class (1) p = total number of positive classes (1) ni = number of times Tan appears in negative class (0) n = total number of negative classes (0)
Solution: Weight of evidence Output is a positive or negative
number
Solution(s) WOE is one of many solutions for this
Problem(s) Source: Photo by Louis Reed on Unsplash
Solution: Scikit-learn pipelines
Solution: category_encoders pip install category_encoders
Pipeline example
Less time But still some work to do
“There are only two Machine Learning approaches that win competitions:
Handcrafted & Neural Networks.” Anthony Goldbloom, CEO & Founder, Kaggle
Thanks for listening Find me at...