I describe the different problems that we can find in variables in a dataset, how they affect the different machine learning models, and which techniques we can use to overcome them.
Feature engineering Fundamental • To make machine learning algorithms work Time consuming • Big effort in data cleaning and preparation Key • Pre-processing variables is key for good machine learning models
Missing data Missing values within a variable Labels Strings in categorical variables Distribution Normal vs skewed Outliers Unusual or unexpected values 1st Problems in variables
Labels Tree based methods Categories Rare Labels Cardinality • Cardinality: high number of labels • Rare Labels: infrequent categories • Categories: strings • Scikit-learn
Distributions • Linear model assumptions: • Variables follow a Gaussian distribution • Other models: no assumption • Better spread of values may benefit performance Gaussian vs Skewed
Feature magnitude - scale The machine learning models affected by the magnitude of the feature: • Linear and Logistic Regression • Neural Networks • Support Vector Machines • KNN • K-means clustering • Linear Discriminant Analysis (LDA) • Principal Component Analysis (PCA) Machine learning models insensitive to feature magnitude are the ones based on Trees: • Classification and Regression Trees • Random Forests • Gradient Boosted Trees
Missing data Complete case analysis Mean / Median imputation Random sample Arbitrary number End of distribution NA indicator • May remove a big chunk of dataset • Alters distribution • Element of randomness • Still need to fill in the NA • Alters distribution
Labels One hot encoding Count / frequency imputation Mean encoding Ordinal encoding Weight of evidence • Expands the feature space • Account for zero values as it uses logarithm • No monotonic relationship • Prone to overfitting
Outliers Trimming • Remove the observations from dataset Top | bottom coding • Censor top and bottom values Discretisation • Equal bin / equal width / trees induced
Feature Engineering for Machine Learning https://www.udemy.com/feature-engineering-for-machine-learning/ Gathered multiple techniques used worldwide for feature transformation, learnt from Kaggle and the KDD competition websites, white papers, different blogs and forums, and from my experience as a Data Scientist. To provide a source of reference for data scientists, where they can learn and re-visit the techniques and code needed to modify variables prior to use in Machine Learning algorithms. DSCOACH2018 (discount voucher)