Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data Preparation and the Importance of How Machines Learn
Search
Rebecca Vickery
February 05, 2020
Technology
0
150
Data Preparation and the Importance of How Machines Learn
Rebecca Vickery
February 05, 2020
Tweet
Share
More Decks by Rebecca Vickery
See All by Rebecca Vickery
Pair Programming with AI
rebeccavickery
1
67
Machine Learning for Everyone
rebeccavickery
0
17
Scaling Machine Learning at Holiday Extras (Big Data LDN 2019))
rebeccavickery
0
120
Scaling_Machine_Learning_at_Holiday_Extras_-_MUC.pdf
rebeccavickery
0
1.2k
Gender Bias, Why we Need More Women in Tech
rebeccavickery
0
1.2k
The Fastest Way to Learn Data Science
rebeccavickery
0
48
Employing Google Cloud Machine Learning Engine to Develop Models in Production
rebeccavickery
0
1.2k
Other Decks in Technology
See All in Technology
Cloud Native Java with Spring Boot (CNCF Aarhus, April 2024)
thomasvitale
1
180
Terraformあれやこれ/terraform-this-and-that
emiki
8
1.5k
require(ESM)とECMAScript仕様
uhyo
3
830
LayerXにおけるLLMプロダクト開発の今までとこれから
layerx
PRO
1
430
開発パフォーマンスを最大化するための開発体制
ham0215
2
460
On Your Data を超えていく!
hirotomotaguchi
2
700
「スニダン」開発組織の構造に込めた意図 ~組織作りはパッションや政治ではない!~
rinchsan
3
570
Azure Container Apps + Bicep 〜 こんな感じで運用しています
kaz29
3
540
KubeConにproposalを送りたい人へのアドバイス
sat
PRO
3
260
MapLibreとAmazon Location Service
dayjournal
1
160
複雑な構成要素を持つUIとの向き合い方 〜新・支出グラフでの実例〜 / B43 TECH TALK
nakamuuu
0
140
ゼロから始めるVue.jsコミュニティ貢献 / first-vuejs-community-contribution-link-and-motivation
lmi
1
130
Featured
See All Featured
Become a Pro
speakerdeck
PRO
11
4.5k
Automating Front-end Workflow
addyosmani
1356
200k
Building an army of robots
kneath
300
41k
Principles of Awesome APIs and How to Build Them.
keavy
121
16k
No one is an island. Learnings from fostering a developers community.
thoeni
16
2.1k
How STYLIGHT went responsive
nonsquared
92
4.8k
Ruby is Unlike a Banana
tanoku
96
10k
Practical Orchestrator
shlominoach
182
9.7k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
221
21k
Docker and Python
trallard
34
2.7k
Stop Working from a Prison Cell
hatefulcrawdad
266
19k
Statistics for Hackers
jakevdp
789
220k
Transcript
None
Data Preparation and the Importance of how Machines Learn Rebecca
Vickery, Data Scientist, Holiday Extras
Machine learning
Source: Google images
Simple ML workflow Get data >> baseline model >> model
selection >> model tuning >> predict
Simple ML workflow Get data >> Features/Inputs What we want
to predict
Simple ML workflow Baseline model >> Accuracy score Perfect =
1.0 0.44
Simple ML workflow Model selection >> Best model = Random
Forest
Simple ML workflow Hyperparameter optimisation >> Best score = 1.0
Best Params = {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 500}
Source: Google images
What happens when we have this data set?
What happens when we have this data set?
None
Source: thedailybeast.com
Actual ML workflow Get data >> data preparation >> feature
engineering >> baseline model >> model selection >> model tuning >> predict
Label encoding
Problem Source: flaticon.com 4 is bigger than 1 so there
must be a relationship between these rows Source: flaticon.com 1 = neutered male 2 = spayed female 3 = intact male
Solution: One hot encoding
Ordinal data
Problem: Won’t work for all variables 366 different unique values
= 366 new features
Solution: Feature engineering? Single Colour Multi Colour
Problem: We will lose a lot of information Source: thetelegraph.com
Solution: Weight of evidence For each colour (e.g. Tan): WOE
= ln ( ( pi /p) / ( ni / n) ) pi = number of times Tan appears in positive class (1) p = total number of positive classes (1) ni = number of times Tan appears in negative class (0) n = total number of negative classes (0)
Solution: Weight of evidence Output is a positive or negative
number
Solution(s) WOE is one of many solutions for this
Problem(s) Source: Photo by Louis Reed on Unsplash
Solution: Scikit-learn pipelines
Solution: category_encoders pip install category_encoders
Pipeline example
Less time But still some work to do
“There are only two Machine Learning approaches that win competitions:
Handcrafted & Neural Networks.” Anthony Goldbloom, CEO & Founder, Kaggle
Thanks for listening Find me at...