Slide 1

Slide 1 text

Soledad Galli, PhD DSF meetup with Busuu London, 16th October 2018 Engineering and selecting features for machine learning

Slide 2

Slide 2 text

Machine Learning Pipeline Gradient Boosted Trees Neural Networks Data Sources DATA Average Probability DATA Continuous Output OR

Slide 3

Slide 3 text

Machine Learning Finance and Insurance Claims Fraud Marketing Pricing Credit Risk

Slide 4

Slide 4 text

Machine Learning Pipeline Data Sources DATA DATA Gradient Boosted Trees Neural Networks Average Probability Continuous Output

Slide 5

Slide 5 text

Machine Learning Pipeline Data Sources DATA DATA Gradient Boosted Trees Neural Networks Average Probability Continuous Output

Slide 6

Slide 6 text

Machine Learning Pipeline Data Sources DATA DATA Gradient Boosted Trees Neural Networks Average Probability Continuous Output Feature Engineering Feature Selection

Slide 7

Slide 7 text

Data Pre-processing Journey • Common issues found in variables • Feature / Variable engineering: solutions to the data issues • Feature selection: do we need to select features? • Feature / Variable selection methods • Overview and knowledge sources

Slide 8

Slide 8 text

Data Pre-processing Journey • Common issues found in variables • Feature / Variable engineering: solutions to the data issues • Feature selection: do we need to select features? • Feature / Variable selection methods • Overview and knowledge sources

Slide 9

Slide 9 text

Missing data Missing values within a variable Labels Strings in categorical variables Distribution Normal vs skewed Outliers Unusual or unexpected values 1st Problems in Variables

Slide 10

Slide 10 text

Missing Data • Missing values for certain observations • Affects all machine learning models • Scikit-learn Random Systematic

Slide 11

Slide 11 text

Labels in categorical variables Overfitting in tree based algorithms Categories Rare Labels Cardinality • Cardinality: high number of labels • Rare Labels: infrequent categories • Categories: strings • Scikit-learn

Slide 12

Slide 12 text

Distributions • Linear model assumptions: • Variables follow a Gaussian distribution • Other models: no assumption • Better spread of values may benefit performance Gaussian vs Skewed

Slide 13

Slide 13 text

Outliers Linear models Adaboost Tremendous weights Bad generalisation

Slide 14

Slide 14 text

Feature Magnitude - Scale Machine learning models sensitive to feature scale: • Linear and Logistic Regression • Neural Networks • Support Vector Machines • KNN • K-means clustering • Linear Discriminant Analysis (LDA) • Principal Component Analysis (PCA) Tree based ML models insensitive to feature scale: • Classification and Regression Trees • Random Forests • Gradient Boosted Trees

Slide 15

Slide 15 text

Data Pre-processing Journey • Common issues found in variables • Feature / Variable engineering: solutions to the data issues • Feature selection: do we need to select features? • Feature / Variable selection methods • Overview and knowledge sources

Slide 16

Slide 16 text

Missing Data Imputation Complete case analysis Mean / Median imputation Random sample Arbitrary number End of distribution Binary NA indicator • May remove a big chunk of dataset • Alters distribution • Element of randomness • Still need to fill in the NA • Alters distribution

Slide 17

Slide 17 text

More on Missing Data Imputation Use neighbouring variables to predict the missing value • KNN • Regression AI derived NA imputation Complex

Slide 18

Slide 18 text

Label Encoding One hot encoding Count / frequency imputation Mean encoding Ordinal encoding Weight of evidence Color 2 2 2 1 2 Target 0 1 1 0 1 Color 0.5 0.5 1 0 1 Target 0 1 1 0 1 Color 2 2 1 3 1

Slide 19

Slide 19 text

Label Encoding • Expands the feature space • Account for zero values as it uses logarithm • Prone to overfitting One hot encoding Count / frequency imputation Mean encoding Ordinal encoding Weight of evidence • No monotonic relationship

Slide 20

Slide 20 text

Rare Labels Infrequent labels Rare

Slide 21

Slide 21 text

Distribution: Gaussian Transformation Skewed Gaussian Variable transformation • Logarithmic  ln(x) • Exponential  x Exp (any power) • Reciprocal  (1 / x) • Box-Cox  (x Exp (l) – 1) / l • l varies from -5 to 5

Slide 22

Slide 22 text

Distribution: Discretisation Skewed Improved value spread Discretisation • Equal width bins • Bins  (max – min) / n bins • Generally does not improve the spread • Equal frequency bins • Bins determined by quantiles • Equal number of observations per bin • Generally improves spread

Slide 23

Slide 23 text

Outliers Trimming • Remove the observations from dataset Top | bottom coding • Cap top and bottom values Discretisation • Equal bin / equal width

Slide 24

Slide 24 text

Data Pre-processing Journey • Common issues found in variables • Feature / Variable engineering: solutions to the data issues • Feature selection: do we need to select features? • Feature / Variable selection methods • Overview and knowledge sources

Slide 25

Slide 25 text

Why Do We Select Features? • Simple models are easier to interpret • Shorter training times • Enhanced generalisation by reducing overfitting • Easier to implement by software developers  Model production • Reduced risk of data errors during model use • Data redundancy

Slide 26

Slide 26 text

Constant variables Only 1 value per variable Quasi – constant Variables > 99% of observations show same value Duplication Same variable multiple times in the dataset Correlation Correlated variables provide the same information 1st Variable Redundancy

Slide 27

Slide 27 text

Data Pre-processing Journey • Common issues found in variables • Feature / Variable engineering: solutions to the data issues • Feature selection: do we need to select features? • Feature / Variable selection methods • Overview and knowledge sources

Slide 28

Slide 28 text

Feature Selection Methods Filter methods Wrapper methods Embedded methods

Slide 29

Slide 29 text

Filter methods Pros Cons Poor model performance Does not capture feature interaction Does not capture redundancy Fast computation Model agnostic Quick feature removal Independent of ML algorithm Based only on variable characteristics

Slide 30

Slide 30 text

Filter methods Rank features following criteria Selects highest ranking features Chi-square | Fisher Score Univariate parametric tests (anova) Mutual information Variance

Slide 31

Slide 31 text

Wrapper methods Pros Cons Often impracticable Computation expensive Not model agnostic Best feature subset for a given algorithm Best performance Considers feature interaction Considers ML algorithm Evaluates subsets of features

Slide 32

Slide 32 text

Wrapper methods Search for a subset of features Build ML model with selected subset Evaluate model performance Repeat Forward feature selection • Adds 1 feature at a time Backward feature elimination • Removes 1 feature at a time Exhaustive feature search • Searches across all possible feature combinations

Slide 33

Slide 33 text

Embedded methods Pros Cons Feature selection during training of ML algorithm

Slide 34

Slide 34 text

Embedded methods Train ML model Derive feature importance Remove non- important features LASSO Tree derived feature importance Regression coefficients

Slide 35

Slide 35 text

Data Pre-processing Journey • Common issues found in variables • Feature / Variable engineering: solutions to the data issues • Feature selection: do we need to select features? • Feature / Variable selection methods • Overview and knowledge sources

Slide 36

Slide 36 text

Knowledge Resources Feature Engineering + Selection Udemy.com, includes code Summary of learnings from the winners Feature Engine Python package for feature engineering Work in progress