Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Engineering and selecting features for machine learning

Sole
October 17, 2018

Engineering and selecting features for machine learning

Overview of the machine learning pipeline and the different techniques available to engineer and select features prior to model building and model deployment into production.

Final slide contains the links to the different online resources recommended in the deck. Download it and have a look! Enjoy

Sole

October 17, 2018
Tweet

More Decks by Sole

Other Decks in Technology

Transcript

  1. Soledad Galli, PhD DSF meetup with Busuu London, 16th October

    2018 Engineering and selecting features for machine learning
  2. Machine Learning Pipeline Gradient Boosted Trees Neural Networks Data Sources

    DATA Average Probability DATA Continuous Output OR
  3. Machine Learning Pipeline Data Sources DATA DATA Gradient Boosted Trees

    Neural Networks Average Probability Continuous Output
  4. Machine Learning Pipeline Data Sources DATA DATA Gradient Boosted Trees

    Neural Networks Average Probability Continuous Output
  5. Machine Learning Pipeline Data Sources DATA DATA Gradient Boosted Trees

    Neural Networks Average Probability Continuous Output Feature Engineering Feature Selection
  6. Data Pre-processing Journey • Common issues found in variables •

    Feature / Variable engineering: solutions to the data issues • Feature selection: do we need to select features? • Feature / Variable selection methods • Overview and knowledge sources
  7. Data Pre-processing Journey • Common issues found in variables •

    Feature / Variable engineering: solutions to the data issues • Feature selection: do we need to select features? • Feature / Variable selection methods • Overview and knowledge sources
  8. Missing data Missing values within a variable Labels Strings in

    categorical variables Distribution Normal vs skewed Outliers Unusual or unexpected values 1st Problems in Variables
  9. Missing Data • Missing values for certain observations • Affects

    all machine learning models • Scikit-learn Random Systematic
  10. Labels in categorical variables Overfitting in tree based algorithms Categories

    Rare Labels Cardinality • Cardinality: high number of labels • Rare Labels: infrequent categories • Categories: strings • Scikit-learn
  11. Distributions • Linear model assumptions: • Variables follow a Gaussian

    distribution • Other models: no assumption • Better spread of values may benefit performance Gaussian vs Skewed
  12. Feature Magnitude - Scale Machine learning models sensitive to feature

    scale: • Linear and Logistic Regression • Neural Networks • Support Vector Machines • KNN • K-means clustering • Linear Discriminant Analysis (LDA) • Principal Component Analysis (PCA) Tree based ML models insensitive to feature scale: • Classification and Regression Trees • Random Forests • Gradient Boosted Trees
  13. Data Pre-processing Journey • Common issues found in variables •

    Feature / Variable engineering: solutions to the data issues • Feature selection: do we need to select features? • Feature / Variable selection methods • Overview and knowledge sources
  14. Missing Data Imputation Complete case analysis Mean / Median imputation

    Random sample Arbitrary number End of distribution Binary NA indicator • May remove a big chunk of dataset • Alters distribution • Element of randomness • Still need to fill in the NA • Alters distribution
  15. More on Missing Data Imputation Use neighbouring variables to predict

    the missing value • KNN • Regression AI derived NA imputation Complex
  16. Label Encoding One hot encoding Count / frequency imputation Mean

    encoding Ordinal encoding Weight of evidence Color 2 2 2 1 2 Target 0 1 1 0 1 Color 0.5 0.5 1 0 1 Target 0 1 1 0 1 Color 2 2 1 3 1
  17. Label Encoding • Expands the feature space • Account for

    zero values as it uses logarithm • Prone to overfitting One hot encoding Count / frequency imputation Mean encoding Ordinal encoding Weight of evidence • No monotonic relationship
  18. Distribution: Gaussian Transformation Skewed Gaussian Variable transformation • Logarithmic 

    ln(x) • Exponential  x Exp (any power) • Reciprocal  (1 / x) • Box-Cox  (x Exp (l) – 1) / l • l varies from -5 to 5
  19. Distribution: Discretisation Skewed Improved value spread Discretisation • Equal width

    bins • Bins  (max – min) / n bins • Generally does not improve the spread • Equal frequency bins • Bins determined by quantiles • Equal number of observations per bin • Generally improves spread
  20. Outliers Trimming • Remove the observations from dataset Top |

    bottom coding • Cap top and bottom values Discretisation • Equal bin / equal width
  21. Data Pre-processing Journey • Common issues found in variables •

    Feature / Variable engineering: solutions to the data issues • Feature selection: do we need to select features? • Feature / Variable selection methods • Overview and knowledge sources
  22. Why Do We Select Features? • Simple models are easier

    to interpret • Shorter training times • Enhanced generalisation by reducing overfitting • Easier to implement by software developers  Model production • Reduced risk of data errors during model use • Data redundancy
  23. Constant variables Only 1 value per variable Quasi – constant

    Variables > 99% of observations show same value Duplication Same variable multiple times in the dataset Correlation Correlated variables provide the same information 1st Variable Redundancy
  24. Data Pre-processing Journey • Common issues found in variables •

    Feature / Variable engineering: solutions to the data issues • Feature selection: do we need to select features? • Feature / Variable selection methods • Overview and knowledge sources
  25. Filter methods Pros Cons Poor model performance Does not capture

    feature interaction Does not capture redundancy Fast computation Model agnostic Quick feature removal Independent of ML algorithm Based only on variable characteristics
  26. Filter methods Rank features following criteria Selects highest ranking features

    Chi-square | Fisher Score Univariate parametric tests (anova) Mutual information Variance
  27. Wrapper methods Pros Cons Often impracticable Computation expensive Not model

    agnostic Best feature subset for a given algorithm Best performance Considers feature interaction Considers ML algorithm Evaluates subsets of features
  28. Wrapper methods Search for a subset of features Build ML

    model with selected subset Evaluate model performance Repeat Forward feature selection • Adds 1 feature at a time Backward feature elimination • Removes 1 feature at a time Exhaustive feature search • Searches across all possible feature combinations
  29. Embedded methods Train ML model Derive feature importance Remove non-

    important features LASSO Tree derived feature importance Regression coefficients
  30. Data Pre-processing Journey • Common issues found in variables •

    Feature / Variable engineering: solutions to the data issues • Feature selection: do we need to select features? • Feature / Variable selection methods • Overview and knowledge sources
  31. Knowledge Resources Feature Engineering + Selection Udemy.com, includes code Summary

    of learnings from the winners Feature Engine Python package for feature engineering Work in progress