Feature Selection & Extraction

Dfda3ce33093a2ce23246410c5087a92?s=47 Ali Akbar S.
September 23, 2019

Feature Selection & Extraction

Talking about multicollinearity, information gain, principal component analysis (PCA), and deep learning method for feature extraction

Dfda3ce33093a2ce23246410c5087a92?s=128

Ali Akbar S.

September 23, 2019
Tweet

Transcript

  1. 2.

    Outline 1. Feature selection a. Multicollinearity b. Information gain 2.

    Feature extraction a. Principal component analysis (PCA) b. Deep learning
  2. 4.

    ▫ Datasets typically high dimensional ▫ ML methods are statistical

    by nature ▫ As dimensionality grows, fewer observations per region – while statistics need repetition Curse of Dimensionality
  3. 7.

    y = w 0 + w 1 x Find a

    linear relationship between dependent and independent variables 7
  4. 17.

    17 y = w 0 + w 1 x 1

    + w 2 x 2 + w 3 x 3 + ...
  5. 18.

    18 Example mpg = w 0 + w 1 horsepower

    + w 2 weight mpg = -28.34 - 0.17 horsepower + 0.86 model_year This will result in MAE = 3.14!
  6. 22.

    22 The previous model yields MAE=3.38. While it is still

    performing well, the coefficients are somewhat meaningless now.
  7. 24.
  8. 26.

    26 Multicollinearity height = 44.57 - 19.27 leg_left + 20.88

    leg_right So, the longer your left leg is, the shorter you are(?)
  9. 30.

    Entropy - Measuring Impurity where S is the subset of

    the data and p (+) and p (-) are the probability of positive or negative cases in subset S. Interpretation: If X ∈ S, how many bits are needed to determine whether X is positive or negative?
  10. 31.

    Information Gain ▫ We want more instances in pure sets

    ▫ The difference between before and after splitting with V is the possible value of A and S V is the subset where X A = V
  11. 32.

    Reducing features from 30 to 14 to predict the survival

    of Titanic passengers (Besbes, 2016)
  12. 36.

    References 1. VanderPlas, J. (2016). Python Data Science Handbook. (In

    Depth: Principal Component Analysis) 2. Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press. (Section 11.1-11.2)
  13. 37.

    Curse of Dimensionality ▫ High-dimensional datasets, e.g. images, text, speech

    ▫ Only a few “useful” dimensions ▫ Attributes with low variance
  14. 41.

    ▫ Speech → MFCC ▫ Text → bag of words,

    TF-IDF ▫ Image → Scale-Invariant Feature Transform (SIFT) Feature Extraction
  15. 42.

    Principal Component Analysis 1. Defines a set of principal components

    a. 1st: direction of the greatest variability in the data b. 2nd: perpendicular to 1st, greatest variability of what’s left c. … and so on until d (original dimensionality) 2. First m << d components become m new dimensions
  16. 48.

    1. Center the data at zero: x i,a ← x

    i,a - µ 2. Compute covariance matrix Σ 3. Find eigenvectors e for that matrix Finding Principal Components
  17. 49.

    Let’s try to use the covariance matrix to transform a

    random vector! Finding Principal Components - Illustration
  18. 50.
  19. 51.
  20. 52.
  21. 53.
  22. 54.
  23. 55.
  24. 57.

    ▫ Want vectors e which aren’t turned: Σe = λe

    ▫ Solve det(Σ - λI) = 0 ▫ Find ith eigenvector by solving Σe i = λ i e i ▫ Principal components are eigenvectors with largest eigenvalues Principal Components
  25. 58.

    Example Given Σ = [2.0 0.8; 0.8 0.6], the eigenvalues

    are given by The eigenvalues are then
  26. 60.

    Projecting to New Dimensions ▫ e 1 ...e m are

    new dimension vectors ▫ For every data point x i : - Center to the mean, i.e. x i - µ - Project to the new dimensions, i.e. (x i -µ)Te j for j = 1...m
  27. 61.

    How many dimensions? Pick e i that “explains” the most

    variance ▫ Sort eigenvectors such that λ 1 ≥ λ 2 ≥ … ≥ λ d ▫ Pick first m eigenvectors which explain 90% or 95% of the total variance
  28. 65.

    ▫ Basically, PCA on text data ▫ Truncated Singular Value

    Decomposition (SVD) ▫ Capturing co-occurring words Latent Semantic Analysis (LSA)
  29. 66.

    ▫ Automatic short-answer scoring ▫ Precision, recall, F1 scores on

    training set A (5-fold CV): - Without SVD (578 unique tokens): 84.52% | 92.15% | 88.06% - With SVD (100 features): 79.38% | 98.43% | 87.86% ▫ 3rd place on test set UKARA 1.0 Challenge Track 1
  30. 68.

    Some Issues ▫ Sensitive to outliers when computing the covariance

    matrix → normalise the data ▫ Linearity assumption → transformation
  31. 70.

    References ▫ Chollet, F. (May 2016). “Building Autoencoders in Keras”.

    The Keras Blog. ▫ Wibisono, O. (October 2017). “Autoencoder: Alternatif PCA yang Lebih Mumpuni”. Tentang Data.