Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Feature Engineering for Machine Learning

Sole
February 07, 2018

Feature Engineering for Machine Learning

I describe the different problems that we can find in variables in a dataset, how they affect the different machine learning models, and which techniques we can use to overcome them.

Sole

February 07, 2018
Tweet

More Decks by Sole

Other Decks in Technology

Transcript

  1. Soledad Galli, PhD
    Pydata, 42nd Meetup
    London 2018

    View Slide

  2. What is feature
    engineering?

    View Slide

  3. Feature engineering
    Process of using domain knowledge of the data to create features
    or variables to use in machine learning.

    View Slide

  4. Feature engineering
    Fundamental
    • To make machine learning algorithms work
    Time consuming
    • Big effort in data cleaning and preparation
    Key
    • Pre-processing variables is key for good machine learning models

    View Slide

  5. Feature engineering
    • Problems found in data
    • Impact on machine learning models
    • How to address these problems

    View Slide

  6. Feature engineering
    • Problems found in data
    • Impact on machine learning models
    • How to address these problems

    View Slide

  7. Missing data
    Missing values within
    a variable
    Labels
    Strings in
    categorical
    variables
    Distribution
    Normal vs skewed
    Outliers
    Unusual or
    unexpected values
    1st
    Problems in variables

    View Slide

  8. Missing data
    • Missing values for certain observations
    • Affects all machine learning models
    • Scikit-learn
    MCAR
    MAR
    MNAR

    View Slide

  9. Labels
    Tree based methods
    Categories
    Rare
    Labels
    Cardinality
    • Cardinality: high number of labels
    • Rare Labels: infrequent categories
    • Categories: strings
    • Scikit-learn

    View Slide

  10. Distributions
    • Linear model assumptions:
    • Variables follow a Gaussian
    distribution
    • Other models: no assumption
    • Better spread of values may
    benefit performance
    Gaussian vs Skewed

    View Slide

  11. Outliers
    Linear
    models
    Adaboost
    Tremendous
    weights
    Bad
    generalisation

    View Slide

  12. Feature magnitude - scale
    The machine learning models affected by the
    magnitude of the feature:
    • Linear and Logistic Regression
    • Neural Networks
    • Support Vector Machines
    • KNN
    • K-means clustering
    • Linear Discriminant Analysis (LDA)
    • Principal Component Analysis (PCA)
    Machine learning models insensitive to feature
    magnitude are the ones based on Trees:
    • Classification and Regression Trees
    • Random Forests
    • Gradient Boosted Trees

    View Slide

  13. Feature engineering
    • Problems found in data
    • Impact on machine learning models
    • How to address these problems

    View Slide

  14. Missing data
    Complete
    case
    analysis
    Mean /
    Median
    imputation
    Random
    sample
    Arbitrary
    number
    End of
    distribution
    NA
    indicator
    • May remove a big chunk of dataset
    • Alters distribution
    • Element of randomness
    • Still need to fill in the NA
    • Alters distribution

    View Slide

  15. Labels
    One hot
    encoding
    Count /
    frequency
    imputation
    Mean
    encoding
    Ordinal
    encoding
    Weight of
    evidence
    • Expands the feature space
    • Account for zero
    values as it uses
    logarithm
    • No monotonic
    relationship
    • Prone to overfitting

    View Slide

  16. Rare Labels
    Infrequent labels Rare

    View Slide

  17. Distribution
    Transformation
    Logarithm
    Exponentiation
    Reciprocal
    Box-Cox
    Discretisation
    Equal width
    Equal frequency
    Decision trees

    View Slide

  18. Distribution
    Discretisation
    BoxCox
    Fare (U$S)

    View Slide

  19. Outliers
    Trimming • Remove the observations from dataset
    Top | bottom
    coding
    • Censor top and bottom values
    Discretisation • Equal bin / equal width / trees induced

    View Slide

  20. Feature Engineering for Machine Learning
    https://www.udemy.com/feature-engineering-for-machine-learning/

    View Slide

  21. Feature Engineering for Machine Learning
    https://www.udemy.com/feature-engineering-for-machine-learning/
    Gathered multiple techniques used worldwide for
    feature transformation, learnt from Kaggle and the
    KDD competition websites, white papers, different
    blogs and forums, and from my experience as a Data
    Scientist.
    To provide a source of reference for data scientists,
    where they can learn and re-visit the techniques and
    code needed to modify variables prior to use in
    Machine Learning algorithms.
    DSCOACH2018 (discount voucher)

    View Slide