Dimensionality reduction - squeezing out the good stuff with PCA by Aabir Abubaker Kar

DIMENSIONALITY REDUCTION SQUEEZING OUT THE GOOD STUFF WITH PCA Aabir
Abubaker Kar @aabir_ak

CONCEPTUAL ITINERARY Data and what it’s made of Why you
should care about understanding data The basic principles of data analysis Introduction to the data space K-means clustering Principal Component Analysis (PCA) A brief mention of tSNE 10 steps to being a data scientist (patent pending) © 2018 Aabir Abubaker Kar

WHAT IS DATA MADE OF? Data is a structured collection
of numbers Images Audio Databases (the structure can get arbitrarily complicated) © 2018 Aabir Abubaker Kar

WHY YOU SHOULD CARE ABOUT UNDERSTANDING DATA Data Understanding Structured
too rigidly to be useful - Andrew Ng calls it the new crude A complex, rich resource that can be probed, moulded, and harnessed Is focussed on facts, immutability and answers Is based on hypotheses, uncertainty and questions Is ﬁxed-scale - data has stagnant structure Is multiscale - insights at different scales can contribute to each other Does not in the crude form contribute to science or learning Is the foundation of new science and learning © 2018 Aabir Abubaker Kar

THE BASIC PRINCIPLES OF DATA ANALYSIS A datapoint can be
known by the company it keeps Most data will display patterns that are indicative of the meta- information we are searching for  © 2018 Aabir Abubaker Kar

SOME DATA Sample ID Property 1 Property 2 A 12
1 B 1 12 C 4 11 D 2 9 E 2 11 F 3 10 G 12 2 H 5 11 I 6 5 J 6 4 K 2 12 L 10 3 M 9 2 N 10 2 THE DATA SPACE • Each row is a sample • Each sample has 2 properties • Each sample can be plotted as a point in the data space • Important: by default, we assume that each property is equally important • This means we should make sure to scale them IN 2 DIMENSIONS 0 3.5 7 10.5 14 0 3.5 7 10.5 14 Property 1 Property 2 Samples © 2018 Aabir Abubaker Kar

IN 2 DIMENSIONS THE DATA SPACE 0 3.5 7 10.5
14 0 3.5 7 10.5 14 Height Weight Samples: birds • Each row is a sample • Each sample has 2 properties • Each sample can be plotted as a point in the data space • Important: by default, we assume that each property is equally important • This means we should make sure to scale them © 2018 Aabir Abubaker Kar

IN 2 DIMENSIONS THE DATA SPACE 0 3.5 7 10.5
14 0 3.5 7 10.5 14 Debt (x $10K) Assets (x $10K) Samples: bank-account holders • Each row is a sample • Each sample has 2 properties • Each sample can be plotted as a point in the data space • Important: by default, we assume that each property is equally important • This means we should make sure to scale them © 2018 Aabir Abubaker Kar

Shipping time (days) IN 2 DIMENSIONS THE DATA SPACE 0
3.5 7 10.5 14 0 3.5 7 10.5 14 Order value (x 1K$) Samples: product orders • Each row is a sample • Each sample has 2 properties • Each sample can be plotted as a point in the data space • Important: by default, we assume that each property is equally important • This means we should make sure to scale them © 2018 Aabir Abubaker Kar

THE DATA SPACE IN 3 DIMENSIONS SOME DATA Sample ID
Property 1 Property 2 Property 3 A 12 1 50 B 1 12 25 C 4 11 30 D 2 9 34 E 2 11 35 F 3 10 28 G 12 2 48 H 5 11 36 I 6 5 41 J 6 4 43 K 2 12 26 L 10 3 40 M 9 2 37 N 10 2 37 © 2018 Aabir Abubaker Kar

THE HIGH-DIMENSIONAL DATA SPACE (IT’S THE SAME THING) LOTS OF
DATA Sample ID Property 1Property 2 Property 3Property 4 … … A 12 1 50 … … … B 1 12 25 … … … C 4 11 30 … … … D 2 9 34 … … … E 2 11 35 … … … F 3 10 28 … … … G 12 2 48 … … … H 5 11 36 … … … I 6 5 41 … … … J 6 4 43 … … … K 2 12 26 … … … L 10 3 40 … … … M 9 2 37 … … … N 10 2 37 … … … • This concept scales up arbitrarily well • Every row is a point in a high-dimensional space • Remember: by default, we assume that each property is equally important © 2018 Aabir Abubaker Kar

THE HIGH-DIMENSIONAL DATA SPACE (IT’S THE SAME THING) So you
can’t picture the 100th dimension? © 2018 Aabir Abubaker Kar

0 7 14 0 7 14 0 7 14 0
7 14 CLUSTERING REVISITING 2-D • You can see 2 clusters - why are shorter birds heavier and taller birds lighter? • Answer: either these birds have a serious obesity/ malnutrition problem… • Or this data is incorrect. Fix it. Height Weight Samples: birds Male Female © 2018 Aabir Abubaker Kar

0 7 14 0 7 14 CLUSTERING GENERALIZING TO N
DIMENSIONS • We can’t objectively identify clusters visually • We can’t see beyond 3 dimensions • Required: algorithms for clustering of high- dimensional data • K-means clustering - simple, yet powerful Height Weight Samples: birds © 2018 Aabir Abubaker Kar

K-MEANS CLUSTERING • Input: • N datapoints each in the
d- dimensional space • k, the number of clusters to be found • Output: • a list containing the cluster index for each datapoint • How it works: • Iteratively finds the k clusters such that each point is closest to the centroid (high-dimensional average) of its cluster Credits: Andrey A. Shabalin, shabal.in © 2018 Aabir Abubaker Kar

K-MEANS CLUSTERING • Limitations: • Not reproducible  as the algorithm
might not converge and is sensitive to initialization • True clusters may be of unequal size or density  and centroids don’t represent these differences well • True clusters may be non-spherical (say, boomerang shaped)  and as centroids are computed on the ‘nearest neighbors’, they won’t converge • There may be outliers in the data  so centroids get skewed (the average of 2, 3, 3, 100 is 27) • Applications: • Automatic loan approval • Customer behavior classification • Cancer diagnosis • Image classification (NNs do it better) • Student performance analysis © 2018 Aabir Abubaker Kar

K-MEANS CLUSTERING IMPLEMENTATION import numpy as np from sklearn.cluster import
KMeans data = np.load('kmeans_data.npy') print("Dimensions of the data: ", data.shape) kmeans_1 = KMeans(n_clusters=12) cluster_indices = kmeans_1.fit_predict(data) print(cluster_indices) © 2018 Aabir Abubaker Kar

WHAT ‘EXPLAINS’ DATA? VARIANCE IN THE DATA SPACE • The
distance between datapoints might not always make sense as a metric of similarity. • Another useful way to think about a dataset is its ‘informativeness’. • Can there be some ‘direction’ or ‘orientation’ in the dataspace that contains more information than others? 0 7 14 0 7 14 © 2018 Aabir Abubaker Kar

WHAT ‘EXPLAINS’ DATA? VARIANCE IN THE DATA SPACE • The
informativeness of a direction can be easily quantified using the variance. • The most informative direction is the direction in which data is most spread out. This is the direction of maximum variance. • Alternatively, if the variance along some direction is low - that direction is more ‘predictable’ and therefore contains less information about the data. 0 7 14 0 7 14 © 2018 Aabir Abubaker Kar

0 7 14 0 7 14 • Look at the
projections of these points on the x- and y- axes • Properties (height and weight) are directions along which the data varies • This data separates into clusters along both ‘property- directions’ - this indicates high variance along both directions VARIANCE IN 2 DIMENSIONS REVISITING 2-D Height Samples: birds Weight © 2018 Aabir Abubaker Kar

0 7 14 0 7 14 • Formally, we can
find the variance for the properties height and weight • • REVISITING 2-D Height Samples: birds Weight σ2 x = 1 N N ∑ i=1 (x − ¯ x)2 σ2 height = 15.2, σ2 weight = 19.36 VARIANCE IN 2 DIMENSIONS © 2018 Aabir Abubaker Kar

0 7 14 0 7 14 • In other words,
we need both directions (properties) to capture the variability of our data • In n dimensions, this means we’d need n directions (all n properties) • But what if we didn’t? REVISITING 2-D Height Samples: birds Weight VARIANCE IN 2 DIMENSIONS © 2018 Aabir Abubaker Kar

0 7 14 0 7 14 • Consider a hybrid
‘direction’ made of 2 units of height and 1 unit of weight • Project the points down onto this direction • Each point has a coordinate on this new direction • The variance of these projected values is: REVISITING 2-D Height Samples: birds Weight 2 1 σ2 2×height+1×weight = 27.04 VARIANCE IN 2 DIMENSIONS 11.6 units © 2018 Aabir Abubaker Kar

• The new direction ‘explains’ more of the data than
either of the original directions (height and weight)    • One way to see this is to cluster the newly projected data using only this direction • Just one coordinate now explains the same two clusters σ2 2×height+1×weight = 27.04 0 7 14 0 7 14 REVISITING 2-D Height Samples: birds Weight 2 1 VARIANCE IN 2 DIMENSIONS σ2 height = 15.2, σ2 weight = 19.36 © 2018 Aabir Abubaker Kar

-4 3 10 -4 3 10 • Centering the data
by subtracting the mean helps find directions passing through the origin • We can find the directions of maximum and minimum variance REVISITING 2-D Samples: birds Normalized height Normalized weight σ2 1 = 30.25, σ2 2 = 1.44 D irection 1 D irection 2 VARIANCE IN 2 DIMENSIONS © 2018 Aabir Abubaker Kar

PRINCIPAL COMPONENT ANALYSIS GENERALIZING TO N DIMENSIONS • Linear algebra
provides an efficient, fast way to calculate the directions explaining maximum variance even for ridiculously high- dimensional data • It projects data onto these directions, calculates the explained variances and much more • Welcome to Principal Component Analysis Samples -4 3 10 -4 3 10 Normalized property #1 Normalized property #1 © 2018 Aabir Abubaker Kar

• Input: • N datapoints each in a d-dimensional space
• Optionally, d2, the number of new directions to be found • Output: • d2 new ‘directions’, in decreasing order of explained variance • the projections of the N datapoints along these new directions • How it works: • jargon alert it finds the eigenvectors of the covariance matrix PRINCIPAL COMPONENT ANALYSIS (PCA) © 2018 Aabir Abubaker Kar

• Limitations: • No guarantee that PCA is prioritizing the
‘true’ directions • Depends on the topology of the data • Requires smart ‘normalisation’ (each of original d dimensions should have similar magnitudes) • To some extent, is a black box • Applications: • Feature extraction • Behaviour classification • Data exploration • Preliminary processing for many applications PRINCIPAL COMPONENT ANALYSIS (PCA) © 2018 Aabir Abubaker Kar

PCA IMPLEMENTATION import numpy as np from sklearn.decomposition import PCA
data = np.load('pca_data.npy') print("Dimensions of the data: ", data.shape) pca_1 = PCA(n_components=10) transformed_data = pca_1.fit_transform(data) components = pca_1.components_ exp_variance_ratio = pca_1.explained_variance_ratio_ © 2018 Aabir Abubaker Kar

PCA DEMO - THE NETFLIX PROBLEM • Netflix has information
on millions of movies and tens of thousands of customers • They’d like to recommend to you movies that are ‘closest’ to movies you’ve already watched • A simple way to do this would be to find other customers who have watched similar movies to you - then recommend to you the movies they’ve watched that you haven’t • Datapoints that are closer to each other along the ‘most informative’ directions are more similar - this is a kind of clustering! • Question: why is k-means clustering unlikely to be helpful here? © 2018 Aabir Abubaker Kar

• tSNE (based on SNE) is similar to PCA -
it also often a first step to data analysis • It measures informativeness in a different way - using the concept of entropy from information theory (particularly KL divergence) • It transforms the dataspace (as in PCA), but here, it places datapoints onto a new space where proximity indicates high ‘mutual information’ • Datapoints that are closer together in the new space contain more information about each other • It has similar uses as PCA, with some strengths (stronger justification in information theory) and weaknesses (uses random distributions for transformation, so not reproducible). • Note: tSNE struggles with finding a large number of components (>3) TSNE © 2018 Aabir Abubaker Kar

• tSNE looks at the distribution of values that each
direction takes • It then identifies points whose values best represent each other relative to the distribution • It places these points close to each other in the transformed space • In high-dimensional space, this is where information theory comes in HOW TSNE WORKS (QUALITATIVELY) -4 3 10 -4 3 10 0 1 2 3 0 1 2 3 © 2018 Aabir Abubaker Kar

TSNE IMPLEMENTATION import numpy as np from sklearn.manifold import TSNE
data = np.load('tsne_data.npy') print("Dimensions of the data: ", data.shape) kmeans_1 = TSNE(n_clusters=2, perplexity=30) transformed_data = kmeans_1.fit_predict(data) © 2018 Aabir Abubaker Kar

10 STEPS TO BEING A DATA SCIENTIST (PATENT PENDING) 1.
from sklearn import * 2. Read the docs 3. Read the docs 4. Read the docs 5. Read the docs 6. Read the docs 7. Read the docs 8. Read the docs 9. Read the docs 10. Read the docs © 2018 Aabir Abubaker Kar

Dimensionality reduction - squeezing out the go...

Dimensionality reduction - squeezing out the good stuff with PCA by Aabir Abubaker Kar

More Decks by Pycon ZA

Other Decks in Programming

Featured

Transcript