Feature Selection & Extraction

Feature Selection & Extraction Ali Akbar Septiandri

Outline 1. Feature selection a. Multicollinearity b. Information gain 2.
Feature extraction a. Principal component analysis (PCA) b. Deep learning

First thing first, why?

▫ Datasets typically high dimensional ▫ ML methods are statistical
by nature ▫ As dimensionality grows, fewer observations per region – while statistics need repetition Curse of Dimensionality

FEATURE SELECTION

Multicollinearity

y = w 0 + w 1 x Find a
linear relationship between dependent and independent variables 7

8 weight = 809.98 + 21.20 x horsepower

9 mpg = 39.15 + -0.16 x horsepower

We want a line that can minimize the residuals 10

Residual plot to measure goodness of fit 11

Mean Absolute Error = 3.89 Can we do better? 12

Since mpg decreases slower than horsepower, we might want to
take the log of horsepower 13

14 y = w 0 + w 1 log(x 1
)

MAE = 3.56 15

Can we do even better? 16

17 y = w 0 + w 1 x 1
+ w 2 x 2 + w 3 x 3 + ...

18 Example mpg = w 0 + w 1 horsepower
+ w 2 weight mpg = -28.34 - 0.17 horsepower + 0.86 model_year This will result in MAE = 3.14!

Multicollinearity, non-monotonicity Potential Problems

Let’s try to predict mpg from horsepower and weight. 20
Multicollinearity

21 Multicollinearity mpg = 44.03 - 0.01 horsepower - 0.03
weight

22 The previous model yields MAE=3.38. While it is still
performing well, the coefficients are somewhat meaningless now.

Multicollinearity Another example: Let’s try to predict height from length
of legs. 23

Multicollinearity height = 44.71 + 1.62 left_leg Great! Should we
another variable? 25

26 Multicollinearity height = 44.57 - 19.27 leg_left + 20.88
leg_right So, the longer your left leg is, the shorter you are(?)

Correlation heatmap

28 Non-monotonicity Some variables are non-monotonic, e.g. using price to
predict revenue → need transformation

Information Gain

Entropy - Measuring Impurity where S is the subset of
the data and p (+) and p (-) are the probability of positive or negative cases in subset S. Interpretation: If X ∈ S, how many bits are needed to determine whether X is positive or negative?

Information Gain ▫ We want more instances in pure sets
▫ The difference between before and after splitting with V is the possible value of A and S V is the subset where X A = V

Reducing features from 30 to 14 to predict the survival
of Titanic passengers (Besbes, 2016)

Top 4% solution on Kaggle Titanic challenge

FEATURE EXTRACTION

Principal Component Analysis

References 1. VanderPlas, J. (2016). Python Data Science Handbook. (In
Depth: Principal Component Analysis) 2. Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press. (Section 11.1-11.2)

Curse of Dimensionality ▫ High-dimensional datasets, e.g. images, text, speech
▫ Only a few “useful” dimensions ▫ Attributes with low variance

In MNIST, only a few pixels matter from R784

In social networks, not all people are connected

What else?

▫ Speech → MFCC ▫ Text → bag of words,
TF-IDF ▫ Image → Scale-Invariant Feature Transform (SIFT) Feature Extraction

Principal Component Analysis 1. Defines a set of principal components
a. 1st: direction of the greatest variability in the data b. 2nd: perpendicular to 1st, greatest variability of what’s left c. … and so on until d (original dimensionality) 2. First m << d components become m new dimensions

Example data in 2D (VanderPlas, 2016)

Two principal components (VanderPlas, 2016)

Data projection using principal components (VanderPlas, 2016)

PCA on Iris dataset

1. Center the data at zero: x i,a ← x
i,a - µ 2. Compute covariance matrix Σ 3. Find eigenvectors e for that matrix Finding Principal Components

Let’s try to use the covariance matrix to transform a
random vector! Finding Principal Components - Illustration

The vector’s direction is not changing after several iterations!

▫ Want vectors e which aren’t turned: Σe = λe
▫ Solve det(Σ - λI) = 0 ▫ Find ith eigenvector by solving Σe i = λ i e i ▫ Principal components are eigenvectors with largest eigenvalues Principal Components

Example Given Σ = [2.0 0.8; 0.8 0.6], the eigenvalues
are given by The eigenvalues are then

Example (cont.) The 1st eigenvector can then be calculated as
follows

Projecting to New Dimensions ▫ e 1 ...e m are
new dimension vectors ▫ For every data point x i : - Center to the mean, i.e. x i - µ - Project to the new dimensions, i.e. (x i -µ)Te j for j = 1...m

How many dimensions? Pick e i that “explains” the most
variance ▫ Sort eigenvectors such that λ 1 ≥ λ 2 ≥ … ≥ λ d ▫ Pick first m eigenvectors which explain 90% or 95% of the total variance

Alternative: Scree Plot

PCA on customer vs property will cluster properties by their
city!

When is it the most helpful?

▫ Basically, PCA on text data ▫ Truncated Singular Value
Decomposition (SVD) ▫ Capturing co-occurring words Latent Semantic Analysis (LSA)

▫ Automatic short-answer scoring ▫ Precision, recall, F1 scores on
training set A (5-fold CV): - Without SVD (578 unique tokens): 84.52% | 92.15% | 88.06% - With SVD (100 features): 79.38% | 98.43% | 87.86% ▫ 3rd place on test set UKARA 1.0 Challenge Track 1

Faster training with PCA

Some Issues ▫ Sensitive to outliers when computing the covariance
matrix → normalise the data ▫ Linearity assumption → transformation

Deep Learning

References ▫ Chollet, F. (May 2016). “Building Autoencoders in Keras”.
The Keras Blog. ▫ Wibisono, O. (October 2017). “Autoencoder: Alternatif PCA yang Lebih Mumpuni”. Tentang Data.

Autoencoder: A neural network to map an input to itself
(Chollet, 2016)

Learning curve of an autoencoder on MNIST (Wibisono, 2017)

Dimensionality reduction of MNIST using AE (above) and PCA (below)
(Wibisono, 2017)

Pre-trained models for feature extraction (Wibisono, 2016)

Classifying amenities with ResNet50 and t-SNE

Thank you! Ali Akbar Septiandri Twitter: @__aliakbars__ Home: http:/ /uai.aliakbars.com
Airy: https:/ /medium.com/airy-science

Feature Selection & Extraction

Feature Selection & Extraction

More Decks by Ali Akbar S.

Other Decks in Science

Featured

Transcript