Feature Engineering for Machine Learning

9d52ecb5233b7c5b8778d451d52b1034?s=47 Sole
February 07, 2018

Feature Engineering for Machine Learning

I describe the different problems that we can find in variables in a dataset, how they affect the different machine learning models, and which techniques we can use to overcome them.

9d52ecb5233b7c5b8778d451d52b1034?s=128

Sole

February 07, 2018
Tweet

Transcript

  1. Soledad Galli, PhD Pydata, 42nd Meetup London 2018

  2. What is feature engineering?

  3. Feature engineering Process of using domain knowledge of the data

    to create features or variables to use in machine learning.
  4. Feature engineering Fundamental • To make machine learning algorithms work

    Time consuming • Big effort in data cleaning and preparation Key • Pre-processing variables is key for good machine learning models
  5. Feature engineering • Problems found in data • Impact on

    machine learning models • How to address these problems
  6. Feature engineering • Problems found in data • Impact on

    machine learning models • How to address these problems
  7. Missing data Missing values within a variable Labels Strings in

    categorical variables Distribution Normal vs skewed Outliers Unusual or unexpected values 1st Problems in variables
  8. Missing data • Missing values for certain observations • Affects

    all machine learning models • Scikit-learn MCAR MAR MNAR
  9. Labels Tree based methods Categories Rare Labels Cardinality • Cardinality:

    high number of labels • Rare Labels: infrequent categories • Categories: strings • Scikit-learn
  10. Distributions • Linear model assumptions: • Variables follow a Gaussian

    distribution • Other models: no assumption • Better spread of values may benefit performance Gaussian vs Skewed
  11. Outliers Linear models Adaboost Tremendous weights Bad generalisation

  12. Feature magnitude - scale The machine learning models affected by

    the magnitude of the feature: • Linear and Logistic Regression • Neural Networks • Support Vector Machines • KNN • K-means clustering • Linear Discriminant Analysis (LDA) • Principal Component Analysis (PCA) Machine learning models insensitive to feature magnitude are the ones based on Trees: • Classification and Regression Trees • Random Forests • Gradient Boosted Trees
  13. Feature engineering • Problems found in data • Impact on

    machine learning models • How to address these problems
  14. Missing data Complete case analysis Mean / Median imputation Random

    sample Arbitrary number End of distribution NA indicator • May remove a big chunk of dataset • Alters distribution • Element of randomness • Still need to fill in the NA • Alters distribution
  15. Labels One hot encoding Count / frequency imputation Mean encoding

    Ordinal encoding Weight of evidence • Expands the feature space • Account for zero values as it uses logarithm • No monotonic relationship • Prone to overfitting
  16. Rare Labels Infrequent labels Rare

  17. Distribution Transformation Logarithm Exponentiation Reciprocal Box-Cox Discretisation Equal width Equal

    frequency Decision trees
  18. Distribution Discretisation BoxCox Fare (U$S)

  19. Outliers Trimming • Remove the observations from dataset Top |

    bottom coding • Censor top and bottom values Discretisation • Equal bin / equal width / trees induced
  20. Feature Engineering for Machine Learning https://www.udemy.com/feature-engineering-for-machine-learning/

  21. Feature Engineering for Machine Learning https://www.udemy.com/feature-engineering-for-machine-learning/ Gathered multiple techniques used

    worldwide for feature transformation, learnt from Kaggle and the KDD competition websites, white papers, different blogs and forums, and from my experience as a Data Scientist. To provide a source of reference for data scientists, where they can learn and re-visit the techniques and code needed to modify variables prior to use in Machine Learning algorithms. DSCOACH2018 (discount voucher)