Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Feature Engineering for Machine Learning

Sole
February 07, 2018

Feature Engineering for Machine Learning

I describe the different problems that we can find in variables in a dataset, how they affect the different machine learning models, and which techniques we can use to overcome them.

Sole

February 07, 2018
Tweet

More Decks by Sole

Other Decks in Technology

Transcript

  1. Feature engineering Process of using domain knowledge of the data

    to create features or variables to use in machine learning.
  2. Feature engineering Fundamental • To make machine learning algorithms work

    Time consuming • Big effort in data cleaning and preparation Key • Pre-processing variables is key for good machine learning models
  3. Feature engineering • Problems found in data • Impact on

    machine learning models • How to address these problems
  4. Feature engineering • Problems found in data • Impact on

    machine learning models • How to address these problems
  5. Missing data Missing values within a variable Labels Strings in

    categorical variables Distribution Normal vs skewed Outliers Unusual or unexpected values 1st Problems in variables
  6. Missing data • Missing values for certain observations • Affects

    all machine learning models • Scikit-learn MCAR MAR MNAR
  7. Labels Tree based methods Categories Rare Labels Cardinality • Cardinality:

    high number of labels • Rare Labels: infrequent categories • Categories: strings • Scikit-learn
  8. Distributions • Linear model assumptions: • Variables follow a Gaussian

    distribution • Other models: no assumption • Better spread of values may benefit performance Gaussian vs Skewed
  9. Feature magnitude - scale The machine learning models affected by

    the magnitude of the feature: • Linear and Logistic Regression • Neural Networks • Support Vector Machines • KNN • K-means clustering • Linear Discriminant Analysis (LDA) • Principal Component Analysis (PCA) Machine learning models insensitive to feature magnitude are the ones based on Trees: • Classification and Regression Trees • Random Forests • Gradient Boosted Trees
  10. Feature engineering • Problems found in data • Impact on

    machine learning models • How to address these problems
  11. Missing data Complete case analysis Mean / Median imputation Random

    sample Arbitrary number End of distribution NA indicator • May remove a big chunk of dataset • Alters distribution • Element of randomness • Still need to fill in the NA • Alters distribution
  12. Labels One hot encoding Count / frequency imputation Mean encoding

    Ordinal encoding Weight of evidence • Expands the feature space • Account for zero values as it uses logarithm • No monotonic relationship • Prone to overfitting
  13. Outliers Trimming • Remove the observations from dataset Top |

    bottom coding • Censor top and bottom values Discretisation • Equal bin / equal width / trees induced
  14. Feature Engineering for Machine Learning https://www.udemy.com/feature-engineering-for-machine-learning/ Gathered multiple techniques used

    worldwide for feature transformation, learnt from Kaggle and the KDD competition websites, white papers, different blogs and forums, and from my experience as a Data Scientist. To provide a source of reference for data scientists, where they can learn and re-visit the techniques and code needed to modify variables prior to use in Machine Learning algorithms. DSCOACH2018 (discount voucher)