Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Data Preparation and the Importance of How Mach...
Search
Rebecca Vickery
February 05, 2020
Technology
0
160
Data Preparation and the Importance of How Machines Learn
Rebecca Vickery
February 05, 2020
Tweet
Share
More Decks by Rebecca Vickery
See All by Rebecca Vickery
Pair Programming with AI
rebeccavickery
1
91
Machine Learning for Everyone
rebeccavickery
0
26
Scaling Machine Learning at Holiday Extras (Big Data LDN 2019))
rebeccavickery
0
130
Scaling_Machine_Learning_at_Holiday_Extras_-_MUC.pdf
rebeccavickery
0
1.2k
Gender Bias, Why we Need More Women in Tech
rebeccavickery
0
1.2k
The Fastest Way to Learn Data Science
rebeccavickery
0
54
Employing Google Cloud Machine Learning Engine to Develop Models in Production
rebeccavickery
0
1.3k
Other Decks in Technology
See All in Technology
AWS Security Agentの紹介/introducing-aws-security-agent
tomoki10
0
330
AI-DLCを現場にインストールしてみた:プロトタイプ開発で分かったこと・やめたこと
recruitengineers
PRO
2
180
IAMユーザーゼロの運用は果たして可能なのか
yama3133
2
500
寫了幾年 Code,然後呢?軟體工程師必須重新認識的 DevOps
cheng_wei_chen
1
1.5k
1人1サービス開発しているチームでのClaudeCodeの使い方
noayaoshiro
2
490
「図面」から「法則」へ 〜メタ視点で読み解く現代のソフトウェアアーキテクチャ〜
scova0731
0
400
S3を正しく理解するための内部構造の読解
nrinetcom
PRO
3
200
100以上の新規コネクタ提供を可能にしたアーキテクチャ
ooyukioo
0
130
Amazon Bedrock Knowledge Bases × メタデータ活用で実現する検証可能な RAG 設計
tomoaki25
4
1.4k
RAG/Agent開発のアップデートまとめ
taka0709
0
190
AWS運用を効率化する!AWS Organizationsを軸にした一元管理の実践/nikkei-tech-talk-202512
nikkei_engineer_recruiting
0
130
AWS re:Invent 2025~初参加の成果と学び~
kubomasataka
0
150
Featured
See All Featured
The Director’s Chair: Orchestrating AI for Truly Effective Learning
tmiket
0
59
Primal Persuasion: How to Engage the Brain for Learning That Lasts
tmiket
0
180
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
0
67
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
The Organizational Zoo: Understanding Human Behavior Agility Through Metaphoric Constructive Conversations (based on the works of Arthur Shelley, Ph.D)
kimpetersen
PRO
0
190
So, you think you're a good person
axbom
PRO
0
1.8k
The Limits of Empathy - UXLibs8
cassininazir
1
190
Dominate Local Search Results - an insider guide to GBP, reviews, and Local SEO
greggifford
PRO
0
11
Balancing Empowerment & Direction
lara
5
810
For a Future-Friendly Web
brad_frost
180
10k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
3.3k
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
60
37k
Transcript
None
Data Preparation and the Importance of how Machines Learn Rebecca
Vickery, Data Scientist, Holiday Extras
Machine learning
Source: Google images
Simple ML workflow Get data >> baseline model >> model
selection >> model tuning >> predict
Simple ML workflow Get data >> Features/Inputs What we want
to predict
Simple ML workflow Baseline model >> Accuracy score Perfect =
1.0 0.44
Simple ML workflow Model selection >> Best model = Random
Forest
Simple ML workflow Hyperparameter optimisation >> Best score = 1.0
Best Params = {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 500}
Source: Google images
What happens when we have this data set?
What happens when we have this data set?
None
Source: thedailybeast.com
Actual ML workflow Get data >> data preparation >> feature
engineering >> baseline model >> model selection >> model tuning >> predict
Label encoding
Problem Source: flaticon.com 4 is bigger than 1 so there
must be a relationship between these rows Source: flaticon.com 1 = neutered male 2 = spayed female 3 = intact male
Solution: One hot encoding
Ordinal data
Problem: Won’t work for all variables 366 different unique values
= 366 new features
Solution: Feature engineering? Single Colour Multi Colour
Problem: We will lose a lot of information Source: thetelegraph.com
Solution: Weight of evidence For each colour (e.g. Tan): WOE
= ln ( ( pi /p) / ( ni / n) ) pi = number of times Tan appears in positive class (1) p = total number of positive classes (1) ni = number of times Tan appears in negative class (0) n = total number of negative classes (0)
Solution: Weight of evidence Output is a positive or negative
number
Solution(s) WOE is one of many solutions for this
Problem(s) Source: Photo by Louis Reed on Unsplash
Solution: Scikit-learn pipelines
Solution: category_encoders pip install category_encoders
Pipeline example
Less time But still some work to do
“There are only two Machine Learning approaches that win competitions:
Handcrafted & Neural Networks.” Anthony Goldbloom, CEO & Founder, Kaggle
Thanks for listening Find me at...