Feature Engineering for Machine Learning

Soledad Galli, PhD Pydata, 42nd Meetup London 2018

What is feature engineering?

Feature engineering Process of using domain knowledge of the data
to create features or variables to use in machine learning.

Feature engineering Fundamental • To make machine learning algorithms work
Time consuming • Big effort in data cleaning and preparation Key • Pre-processing variables is key for good machine learning models

Feature engineering • Problems found in data • Impact on
machine learning models • How to address these problems

Missing data Missing values within a variable Labels Strings in
categorical variables Distribution Normal vs skewed Outliers Unusual or unexpected values 1st Problems in variables

Missing data • Missing values for certain observations • Affects
all machine learning models • Scikit-learn MCAR MAR MNAR

Labels Tree based methods Categories Rare Labels Cardinality • Cardinality:
high number of labels • Rare Labels: infrequent categories • Categories: strings • Scikit-learn

Distributions • Linear model assumptions: • Variables follow a Gaussian
distribution • Other models: no assumption • Better spread of values may benefit performance Gaussian vs Skewed

Outliers Linear models Adaboost Tremendous weights Bad generalisation

Feature magnitude - scale The machine learning models affected by
the magnitude of the feature: • Linear and Logistic Regression • Neural Networks • Support Vector Machines • KNN • K-means clustering • Linear Discriminant Analysis (LDA) • Principal Component Analysis (PCA) Machine learning models insensitive to feature magnitude are the ones based on Trees: • Classification and Regression Trees • Random Forests • Gradient Boosted Trees

Feature engineering • Problems found in data • Impact on
machine learning models • How to address these problems

Missing data Complete case analysis Mean / Median imputation Random
sample Arbitrary number End of distribution NA indicator • May remove a big chunk of dataset • Alters distribution • Element of randomness • Still need to fill in the NA • Alters distribution

Labels One hot encoding Count / frequency imputation Mean encoding
Ordinal encoding Weight of evidence • Expands the feature space • Account for zero values as it uses logarithm • No monotonic relationship • Prone to overfitting

Rare Labels Infrequent labels Rare

Distribution Transformation Logarithm Exponentiation Reciprocal Box-Cox Discretisation Equal width Equal
frequency Decision trees

Distribution Discretisation BoxCox Fare (U$S)

Outliers Trimming • Remove the observations from dataset Top |
bottom coding • Censor top and bottom values Discretisation • Equal bin / equal width / trees induced

Feature Engineering for Machine Learning https://www.udemy.com/feature-engineering-for-machine-learning/

Feature Engineering for Machine Learning https://www.udemy.com/feature-engineering-for-machine-learning/ Gathered multiple techniques used
worldwide for feature transformation, learnt from Kaggle and the KDD competition websites, white papers, different blogs and forums, and from my experience as a Data Scientist. To provide a source of reference for data scientists, where they can learn and re-visit the techniques and code needed to modify variables prior to use in Machine Learning algorithms. DSCOACH2018 (discount voucher)

Feature Engineering for Machine Learning

Feature Engineering for Machine Learning

Sole

More Decks by Sole

Other Decks in Technology

Featured

Transcript

Soledad Galli, PhD Pydata, 42nd Meetup London 2018

What is feature engineering?

Feature engineering Process of using domain knowledge of the data

Feature engineering Fundamental • To make machine learning algorithms work

Feature engineering • Problems found in data • Impact on

Feature engineering • Problems found in data • Impact on

Missing data Missing values within a variable Labels Strings in

Missing data • Missing values for certain observations • Affects

Labels Tree based methods Categories Rare Labels Cardinality • Cardinality:

Distributions • Linear model assumptions: • Variables follow a Gaussian

Outliers Linear models Adaboost Tremendous weights Bad generalisation

Feature magnitude - scale The machine learning models affected by

Feature engineering • Problems found in data • Impact on

Missing data Complete case analysis Mean / Median imputation Random

Labels One hot encoding Count / frequency imputation Mean encoding

Rare Labels Infrequent labels Rare

Distribution Transformation Logarithm Exponentiation Reciprocal Box-Cox Discretisation Equal width Equal

Distribution Discretisation BoxCox Fare (U$S)

Outliers Trimming • Remove the observations from dataset Top |

Feature Engineering for Machine Learning https://www.udemy.com/feature-engineering-for-machine-learning/

Feature Engineering for Machine Learning https://www.udemy.com/feature-engineering-for-machine-learning/ Gathered multiple techniques used