Slide 1

Slide 1 text

Feature Selection & Extraction Ali Akbar Septiandri

Slide 2

Slide 2 text

Outline 1. Feature selection a. Multicollinearity b. Information gain 2. Feature extraction a. Principal component analysis (PCA) b. Deep learning

Slide 3

Slide 3 text

First thing first, why?

Slide 4

Slide 4 text

▫ Datasets typically high dimensional ▫ ML methods are statistical by nature ▫ As dimensionality grows, fewer observations per region – while statistics need repetition Curse of Dimensionality

Slide 5

Slide 5 text

FEATURE SELECTION

Slide 6

Slide 6 text

Multicollinearity

Slide 7

Slide 7 text

y = w 0 + w 1 x Find a linear relationship between dependent and independent variables 7

Slide 8

Slide 8 text

8 weight = 809.98 + 21.20 x horsepower

Slide 9

Slide 9 text

9 mpg = 39.15 + -0.16 x horsepower

Slide 10

Slide 10 text

We want a line that can minimize the residuals 10

Slide 11

Slide 11 text

Residual plot to measure goodness of fit 11

Slide 12

Slide 12 text

Mean Absolute Error = 3.89 Can we do better? 12

Slide 13

Slide 13 text

Since mpg decreases slower than horsepower, we might want to take the log of horsepower 13

Slide 14

Slide 14 text

14 y = w 0 + w 1 log(x 1 )

Slide 15

Slide 15 text

MAE = 3.56 15

Slide 16

Slide 16 text

Can we do even better? 16

Slide 17

Slide 17 text

17 y = w 0 + w 1 x 1 + w 2 x 2 + w 3 x 3 + ...

Slide 18

Slide 18 text

18 Example mpg = w 0 + w 1 horsepower + w 2 weight mpg = -28.34 - 0.17 horsepower + 0.86 model_year This will result in MAE = 3.14!

Slide 19

Slide 19 text

Multicollinearity, non-monotonicity Potential Problems

Slide 20

Slide 20 text

Let’s try to predict mpg from horsepower and weight. 20 Multicollinearity

Slide 21

Slide 21 text

21 Multicollinearity mpg = 44.03 - 0.01 horsepower - 0.03 weight

Slide 22

Slide 22 text

22 The previous model yields MAE=3.38. While it is still performing well, the coefficients are somewhat meaningless now.

Slide 23

Slide 23 text

Multicollinearity Another example: Let’s try to predict height from length of legs. 23

Slide 24

Slide 24 text

Demo

Slide 25

Slide 25 text

Multicollinearity height = 44.71 + 1.62 left_leg Great! Should we another variable? 25

Slide 26

Slide 26 text

26 Multicollinearity height = 44.57 - 19.27 leg_left + 20.88 leg_right So, the longer your left leg is, the shorter you are(?)

Slide 27

Slide 27 text

Correlation heatmap

Slide 28

Slide 28 text

28 Non-monotonicity Some variables are non-monotonic, e.g. using price to predict revenue → need transformation

Slide 29

Slide 29 text

Information Gain

Slide 30

Slide 30 text

Entropy - Measuring Impurity where S is the subset of the data and p (+) and p (-) are the probability of positive or negative cases in subset S. Interpretation: If X ∈ S, how many bits are needed to determine whether X is positive or negative?

Slide 31

Slide 31 text

Information Gain ▫ We want more instances in pure sets ▫ The difference between before and after splitting with V is the possible value of A and S V is the subset where X A = V

Slide 32

Slide 32 text

Reducing features from 30 to 14 to predict the survival of Titanic passengers (Besbes, 2016)

Slide 33

Slide 33 text

Top 4% solution on Kaggle Titanic challenge

Slide 34

Slide 34 text

FEATURE EXTRACTION

Slide 35

Slide 35 text

Principal Component Analysis

Slide 36

Slide 36 text

References 1. VanderPlas, J. (2016). Python Data Science Handbook. (In Depth: Principal Component Analysis) 2. Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press. (Section 11.1-11.2)

Slide 37

Slide 37 text

Curse of Dimensionality ▫ High-dimensional datasets, e.g. images, text, speech ▫ Only a few “useful” dimensions ▫ Attributes with low variance

Slide 38

Slide 38 text

In MNIST, only a few pixels matter from R784

Slide 39

Slide 39 text

In social networks, not all people are connected

Slide 40

Slide 40 text

What else?

Slide 41

Slide 41 text

▫ Speech → MFCC ▫ Text → bag of words, TF-IDF ▫ Image → Scale-Invariant Feature Transform (SIFT) Feature Extraction

Slide 42

Slide 42 text

Principal Component Analysis 1. Defines a set of principal components a. 1st: direction of the greatest variability in the data b. 2nd: perpendicular to 1st, greatest variability of what’s left c. … and so on until d (original dimensionality) 2. First m << d components become m new dimensions

Slide 43

Slide 43 text

Example data in 2D (VanderPlas, 2016)

Slide 44

Slide 44 text

Two principal components (VanderPlas, 2016)

Slide 45

Slide 45 text

Data projection using principal components (VanderPlas, 2016)

Slide 46

Slide 46 text

PCA on Iris dataset

Slide 47

Slide 47 text

PCA on Iris dataset

Slide 48

Slide 48 text

1. Center the data at zero: x i,a ← x i,a - µ 2. Compute covariance matrix Σ 3. Find eigenvectors e for that matrix Finding Principal Components

Slide 49

Slide 49 text

Let’s try to use the covariance matrix to transform a random vector! Finding Principal Components - Illustration

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

The vector’s direction is not changing after several iterations!

Slide 57

Slide 57 text

▫ Want vectors e which aren’t turned: Σe = λe ▫ Solve det(Σ - λI) = 0 ▫ Find ith eigenvector by solving Σe i = λ i e i ▫ Principal components are eigenvectors with largest eigenvalues Principal Components

Slide 58

Slide 58 text

Example Given Σ = [2.0 0.8; 0.8 0.6], the eigenvalues are given by The eigenvalues are then

Slide 59

Slide 59 text

Example (cont.) The 1st eigenvector can then be calculated as follows

Slide 60

Slide 60 text

Projecting to New Dimensions ▫ e 1 ...e m are new dimension vectors ▫ For every data point x i : - Center to the mean, i.e. x i - µ - Project to the new dimensions, i.e. (x i -µ)Te j for j = 1...m

Slide 61

Slide 61 text

How many dimensions? Pick e i that “explains” the most variance ▫ Sort eigenvectors such that λ 1 ≥ λ 2 ≥ … ≥ λ d ▫ Pick first m eigenvectors which explain 90% or 95% of the total variance

Slide 62

Slide 62 text

Alternative: Scree Plot

Slide 63

Slide 63 text

PCA on customer vs property will cluster properties by their city!

Slide 64

Slide 64 text

When is it the most helpful?

Slide 65

Slide 65 text

▫ Basically, PCA on text data ▫ Truncated Singular Value Decomposition (SVD) ▫ Capturing co-occurring words Latent Semantic Analysis (LSA)

Slide 66

Slide 66 text

▫ Automatic short-answer scoring ▫ Precision, recall, F1 scores on training set A (5-fold CV): - Without SVD (578 unique tokens): 84.52% | 92.15% | 88.06% - With SVD (100 features): 79.38% | 98.43% | 87.86% ▫ 3rd place on test set UKARA 1.0 Challenge Track 1

Slide 67

Slide 67 text

Faster training with PCA

Slide 68

Slide 68 text

Some Issues ▫ Sensitive to outliers when computing the covariance matrix → normalise the data ▫ Linearity assumption → transformation

Slide 69

Slide 69 text

Deep Learning

Slide 70

Slide 70 text

References ▫ Chollet, F. (May 2016). “Building Autoencoders in Keras”. The Keras Blog. ▫ Wibisono, O. (October 2017). “Autoencoder: Alternatif PCA yang Lebih Mumpuni”. Tentang Data.

Slide 71

Slide 71 text

Autoencoder: A neural network to map an input to itself (Chollet, 2016)

Slide 72

Slide 72 text

Learning curve of an autoencoder on MNIST (Wibisono, 2017)

Slide 73

Slide 73 text

Dimensionality reduction of MNIST using AE (above) and PCA (below) (Wibisono, 2017)

Slide 74

Slide 74 text

Pre-trained models for feature extraction (Wibisono, 2016)

Slide 75

Slide 75 text

Classifying amenities with ResNet50 and t-SNE

Slide 76

Slide 76 text

Thank you! Ali Akbar Septiandri Twitter: @__aliakbars__ Home: http:/ /uai.aliakbars.com Airy: https:/ /medium.com/airy-science