Machine Learning Pipeline
Data Sources
DATA
DATA
Gradient Boosted Trees
Neural Networks
Average
Probability
Continuous
Output
Slide 5
Slide 5 text
Machine Learning Pipeline
Data Sources
DATA
DATA
Gradient Boosted Trees
Neural Networks
Average
Probability
Continuous
Output
Slide 6
Slide 6 text
Machine Learning Pipeline
Data Sources
DATA
DATA
Gradient Boosted Trees
Neural Networks
Average
Probability
Continuous
Output
Feature
Engineering
Feature
Selection
Slide 7
Slide 7 text
Data Pre-processing Journey
• Common issues found in variables
• Feature / Variable engineering: solutions to the data issues
• Feature selection: do we need to select features?
• Feature / Variable selection methods
• Overview and knowledge sources
Slide 8
Slide 8 text
Data Pre-processing Journey
• Common issues found in variables
• Feature / Variable engineering: solutions to the data issues
• Feature selection: do we need to select features?
• Feature / Variable selection methods
• Overview and knowledge sources
Slide 9
Slide 9 text
Missing data
Missing values within
a variable
Labels
Strings in
categorical
variables
Distribution
Normal vs skewed
Outliers
Unusual or
unexpected values
1st
Problems in Variables
Slide 10
Slide 10 text
Missing Data
• Missing values for certain observations
• Affects all machine learning models
• Scikit-learn
Random
Systematic
Slide 11
Slide 11 text
Labels in categorical variables
Overfitting in tree based
algorithms
Categories
Rare
Labels
Cardinality
• Cardinality: high number of labels
• Rare Labels: infrequent categories
• Categories: strings
• Scikit-learn
Slide 12
Slide 12 text
Distributions
• Linear model assumptions:
• Variables follow a Gaussian
distribution
• Other models: no assumption
• Better spread of values may
benefit performance
Gaussian vs Skewed
Slide 13
Slide 13 text
Outliers
Linear
models
Adaboost
Tremendous
weights
Bad
generalisation
Slide 14
Slide 14 text
Feature Magnitude - Scale
Machine learning models sensitive to feature scale:
• Linear and Logistic Regression
• Neural Networks
• Support Vector Machines
• KNN
• K-means clustering
• Linear Discriminant Analysis (LDA)
• Principal Component Analysis (PCA)
Tree based ML models insensitive to feature scale:
• Classification and Regression Trees
• Random Forests
• Gradient Boosted Trees
Slide 15
Slide 15 text
Data Pre-processing Journey
• Common issues found in variables
• Feature / Variable engineering: solutions to the data issues
• Feature selection: do we need to select features?
• Feature / Variable selection methods
• Overview and knowledge sources
Slide 16
Slide 16 text
Missing Data Imputation
Complete
case
analysis
Mean /
Median
imputation
Random
sample
Arbitrary
number
End of
distribution
Binary NA
indicator
• May remove a big chunk of dataset
• Alters distribution
• Element of randomness
• Still need to fill in the NA
• Alters distribution
Slide 17
Slide 17 text
More on Missing Data Imputation
Use neighbouring variables
to predict the missing value
• KNN
• Regression
AI derived NA
imputation
Complex
Slide 18
Slide 18 text
Label Encoding
One hot
encoding
Count /
frequency
imputation
Mean
encoding
Ordinal
encoding
Weight of
evidence
Color
2
2
2
1
2
Target
0
1
1
0
1
Color
0.5
0.5
1
0
1
Target
0
1
1
0
1
Color
2
2
1
3
1
Slide 19
Slide 19 text
Label Encoding
• Expands the feature space
• Account for zero
values as it uses
logarithm
• Prone to overfitting
One hot
encoding
Count /
frequency
imputation
Mean
encoding
Ordinal
encoding
Weight of
evidence
• No monotonic
relationship
Slide 20
Slide 20 text
Rare Labels
Infrequent labels Rare
Slide 21
Slide 21 text
Distribution: Gaussian Transformation
Skewed Gaussian
Variable transformation
• Logarithmic ln(x)
• Exponential x Exp (any power)
• Reciprocal (1 / x)
• Box-Cox (x Exp (l) – 1) / l
• l varies from -5 to 5
Slide 22
Slide 22 text
Distribution: Discretisation
Skewed Improved value spread
Discretisation
• Equal width bins
• Bins (max – min) / n bins
• Generally does not improve the
spread
• Equal frequency bins
• Bins determined by quantiles
• Equal number of observations per
bin
• Generally improves spread
Slide 23
Slide 23 text
Outliers
Trimming • Remove the observations from dataset
Top | bottom
coding
• Cap top and bottom values
Discretisation • Equal bin / equal width
Slide 24
Slide 24 text
Data Pre-processing Journey
• Common issues found in variables
• Feature / Variable engineering: solutions to the data issues
• Feature selection: do we need to select features?
• Feature / Variable selection methods
• Overview and knowledge sources
Slide 25
Slide 25 text
Why Do We Select Features?
• Simple models are easier to interpret
• Shorter training times
• Enhanced generalisation by reducing overfitting
• Easier to implement by software developers Model production
• Reduced risk of data errors during model use
• Data redundancy
Slide 26
Slide 26 text
Constant variables
Only 1 value per
variable
Quasi – constant
Variables
> 99% of observations
show same value
Duplication
Same variable multiple
times in the dataset
Correlation
Correlated variables
provide the same
information
1st
Variable Redundancy
Slide 27
Slide 27 text
Data Pre-processing Journey
• Common issues found in variables
• Feature / Variable engineering: solutions to the data issues
• Feature selection: do we need to select features?
• Feature / Variable selection methods
• Overview and knowledge sources
Filter methods
Pros Cons
Poor model
performance
Does not
capture feature
interaction
Does not
capture
redundancy
Fast
computation
Model agnostic
Quick feature
removal
Independent of
ML algorithm
Based only on
variable
characteristics
Slide 30
Slide 30 text
Filter methods
Rank features
following criteria
Selects highest
ranking features
Chi-square | Fisher Score
Univariate parametric
tests (anova)
Mutual information
Variance
Slide 31
Slide 31 text
Wrapper methods
Pros Cons
Often
impracticable
Computation
expensive
Not model
agnostic
Best feature
subset for a
given algorithm
Best
performance
Considers
feature
interaction
Considers ML
algorithm
Evaluates
subsets of
features
Slide 32
Slide 32 text
Wrapper methods
Search for a
subset of
features
Build ML
model with
selected
subset
Evaluate
model
performance
Repeat
Forward
feature
selection
• Adds 1 feature at a time
Backward
feature
elimination
• Removes 1 feature at a
time
Exhaustive
feature
search
• Searches across all
possible feature
combinations
Slide 33
Slide 33 text
Embedded methods
Pros Cons
Feature
selection during
training of ML
algorithm
Slide 34
Slide 34 text
Embedded methods
Train ML model
Derive feature
importance
Remove non-
important
features
LASSO
Tree derived feature
importance
Regression coefficients
Slide 35
Slide 35 text
Data Pre-processing Journey
• Common issues found in variables
• Feature / Variable engineering: solutions to the data issues
• Feature selection: do we need to select features?
• Feature / Variable selection methods
• Overview and knowledge sources
Slide 36
Slide 36 text
Knowledge Resources
Feature
Engineering +
Selection
Udemy.com, includes code
Summary of learnings
from the winners
Feature Engine
Python package for feature engineering
Work in progress