Ali Akbar S.
September 23, 2019
98

# Feature Selection & Extraction

Talking about multicollinearity, information gain, principal component analysis (PCA), and deep learning method for feature extraction

## Ali Akbar S.

September 23, 2019

## Transcript

2. ### Outline 1. Feature selection a. Multicollinearity b. Information gain 2.

Feature extraction a. Principal component analysis (PCA) b. Deep learning

4. ### ▫ Datasets typically high dimensional ▫ ML methods are statistical

by nature ▫ As dimensionality grows, fewer observations per region – while statistics need repetition Curse of Dimensionality

7. ### y = w 0 + w 1 x Find a

linear relationship between dependent and independent variables 7

13. ### Since mpg decreases slower than horsepower, we might want to

take the log of horsepower 13

)

17. ### 17 y = w 0 + w 1 x 1

+ w 2 x 2 + w 3 x 3 + ...
18. ### 18 Example mpg = w 0 + w 1 horsepower

+ w 2 weight mpg = -28.34 - 0.17 horsepower + 0.86 model_year This will result in MAE = 3.14!

20. ### Let’s try to predict mpg from horsepower and weight. 20

Multicollinearity

weight
22. ### 22 The previous model yields MAE=3.38. While it is still

performing well, the coefficients are somewhat meaningless now.

of legs. 23

25. ### Multicollinearity height = 44.71 + 1.62 left_leg Great! Should we

another variable? 25
26. ### 26 Multicollinearity height = 44.57 - 19.27 leg_left + 20.88

leg_right So, the longer your left leg is, the shorter you are(?)

28. ### 28 Non-monotonicity Some variables are non-monotonic, e.g. using price to

predict revenue → need transformation

30. ### Entropy - Measuring Impurity where S is the subset of

the data and p (+) and p (-) are the probability of positive or negative cases in subset S. Interpretation: If X ∈ S, how many bits are needed to determine whether X is positive or negative?
31. ### Information Gain ▫ We want more instances in pure sets

▫ The difference between before and after splitting with V is the possible value of A and S V is the subset where X A = V
32. ### Reducing features from 30 to 14 to predict the survival

of Titanic passengers (Besbes, 2016)

36. ### References 1. VanderPlas, J. (2016). Python Data Science Handbook. (In

Depth: Principal Component Analysis) 2. Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press. (Section 11.1-11.2)
37. ### Curse of Dimensionality ▫ High-dimensional datasets, e.g. images, text, speech

▫ Only a few “useful” dimensions ▫ Attributes with low variance

41. ### ▫ Speech → MFCC ▫ Text → bag of words,

TF-IDF ▫ Image → Scale-Invariant Feature Transform (SIFT) Feature Extraction
42. ### Principal Component Analysis 1. Defines a set of principal components

a. 1st: direction of the greatest variability in the data b. 2nd: perpendicular to 1st, greatest variability of what’s left c. … and so on until d (original dimensionality) 2. First m << d components become m new dimensions

48. ### 1. Center the data at zero: x i,a ← x

i,a - µ 2. Compute covariance matrix Σ 3. Find eigenvectors e for that matrix Finding Principal Components
49. ### Let’s try to use the covariance matrix to transform a

random vector! Finding Principal Components - Illustration

51. ### ▫ Want vectors e which aren’t turned: Σe = λe

▫ Solve det(Σ - λI) = 0 ▫ Find ith eigenvector by solving Σe i = λ i e i ▫ Principal components are eigenvectors with largest eigenvalues Principal Components
52. ### Example Given Σ = [2.0 0.8; 0.8 0.6], the eigenvalues

are given by The eigenvalues are then

follows
54. ### Projecting to New Dimensions ▫ e 1 ...e m are

new dimension vectors ▫ For every data point x i : - Center to the mean, i.e. x i - µ - Project to the new dimensions, i.e. (x i -µ)Te j for j = 1...m
55. ### How many dimensions? Pick e i that “explains” the most

variance ▫ Sort eigenvectors such that λ 1 ≥ λ 2 ≥ … ≥ λ d ▫ Pick first m eigenvectors which explain 90% or 95% of the total variance

city!

59. ### ▫ Basically, PCA on text data ▫ Truncated Singular Value

Decomposition (SVD) ▫ Capturing co-occurring words Latent Semantic Analysis (LSA)
60. ### ▫ Automatic short-answer scoring ▫ Precision, recall, F1 scores on

training set A (5-fold CV): - Without SVD (578 unique tokens): 84.52% | 92.15% | 88.06% - With SVD (100 features): 79.38% | 98.43% | 87.86% ▫ 3rd place on test set UKARA 1.0 Challenge Track 1

62. ### Some Issues ▫ Sensitive to outliers when computing the covariance

matrix → normalise the data ▫ Linearity assumption → transformation

64. ### References ▫ Chollet, F. (May 2016). “Building Autoencoders in Keras”.

The Keras Blog. ▫ Wibisono, O. (October 2017). “Autoencoder: Alternatif PCA yang Lebih Mumpuni”. Tentang Data.
65. ### Autoencoder: A neural network to map an input to itself

(Chollet, 2016)

67. ### Dimensionality reduction of MNIST using AE (above) and PCA (below)

(Wibisono, 2017)

70. ### Thank you! Ali Akbar Septiandri Twitter: @__aliakbars__ Home: http:/ /uai.aliakbars.com

Airy: https:/ /medium.com/airy-science