Slide 1

Slide 1 text

Soledad Galli, PhD Pydata, 42nd Meetup London 2018

Slide 2

Slide 2 text

What is feature engineering?

Slide 3

Slide 3 text

Feature engineering Process of using domain knowledge of the data to create features or variables to use in machine learning.

Slide 4

Slide 4 text

Feature engineering Fundamental • To make machine learning algorithms work Time consuming • Big effort in data cleaning and preparation Key • Pre-processing variables is key for good machine learning models

Slide 5

Slide 5 text

Feature engineering • Problems found in data • Impact on machine learning models • How to address these problems

Slide 6

Slide 6 text

Feature engineering • Problems found in data • Impact on machine learning models • How to address these problems

Slide 7

Slide 7 text

Missing data Missing values within a variable Labels Strings in categorical variables Distribution Normal vs skewed Outliers Unusual or unexpected values 1st Problems in variables

Slide 8

Slide 8 text

Missing data • Missing values for certain observations • Affects all machine learning models • Scikit-learn MCAR MAR MNAR

Slide 9

Slide 9 text

Labels Tree based methods Categories Rare Labels Cardinality • Cardinality: high number of labels • Rare Labels: infrequent categories • Categories: strings • Scikit-learn

Slide 10

Slide 10 text

Distributions • Linear model assumptions: • Variables follow a Gaussian distribution • Other models: no assumption • Better spread of values may benefit performance Gaussian vs Skewed

Slide 11

Slide 11 text

Outliers Linear models Adaboost Tremendous weights Bad generalisation

Slide 12

Slide 12 text

Feature magnitude - scale The machine learning models affected by the magnitude of the feature: • Linear and Logistic Regression • Neural Networks • Support Vector Machines • KNN • K-means clustering • Linear Discriminant Analysis (LDA) • Principal Component Analysis (PCA) Machine learning models insensitive to feature magnitude are the ones based on Trees: • Classification and Regression Trees • Random Forests • Gradient Boosted Trees

Slide 13

Slide 13 text

Feature engineering • Problems found in data • Impact on machine learning models • How to address these problems

Slide 14

Slide 14 text

Missing data Complete case analysis Mean / Median imputation Random sample Arbitrary number End of distribution NA indicator • May remove a big chunk of dataset • Alters distribution • Element of randomness • Still need to fill in the NA • Alters distribution

Slide 15

Slide 15 text

Labels One hot encoding Count / frequency imputation Mean encoding Ordinal encoding Weight of evidence • Expands the feature space • Account for zero values as it uses logarithm • No monotonic relationship • Prone to overfitting

Slide 16

Slide 16 text

Rare Labels Infrequent labels Rare

Slide 17

Slide 17 text

Distribution Transformation Logarithm Exponentiation Reciprocal Box-Cox Discretisation Equal width Equal frequency Decision trees

Slide 18

Slide 18 text

Distribution Discretisation BoxCox Fare (U$S)

Slide 19

Slide 19 text

Outliers Trimming • Remove the observations from dataset Top | bottom coding • Censor top and bottom values Discretisation • Equal bin / equal width / trees induced

Slide 20

Slide 20 text

Feature Engineering for Machine Learning https://www.udemy.com/feature-engineering-for-machine-learning/

Slide 21

Slide 21 text

Feature Engineering for Machine Learning https://www.udemy.com/feature-engineering-for-machine-learning/ Gathered multiple techniques used worldwide for feature transformation, learnt from Kaggle and the KDD competition websites, white papers, different blogs and forums, and from my experience as a Data Scientist. To provide a source of reference for data scientists, where they can learn and re-visit the techniques and code needed to modify variables prior to use in Machine Learning algorithms. DSCOACH2018 (discount voucher)