A Guide to Dimension Reduction

A Bluffer’s Guide to Dimension Reduction Leland McInnes

Bluffer’s Guides are lighthearted and humorous surveys providing a condensed
overview of a potentially complicated subject.

Focus on the intuition and core ideas

* = I’m lying, but in a good way

There are only two Dimension reduction techniques*

Matrix Factorization Neighbour Graphs

Matrix Factorization Principal Component Analysis Non-negative Matrix Factorization Latent Dirichlet
Allocation Word2Vec GloVe Generalised Low Rank Models Linear Autoencoder Probablistic PCA Sparse PCA

Neighbour Graphs Locally Linear Embedding Laplacian Eigenmaps Hessian Eigenmaps Local
Tangent Space Alignment t-SNE UMAP Isomap JSE Spectral Embedding LargeVis NerV

Autoencoders?

Matrix Factorization

X ≈ UV Where X is an NxD matrix U
is an Nxd matrix V is an dxD matrix X U V N × D N × d d × D

N ∑ i=1 D ∑ j=1 Loss (Xij , (UV)ij)
Subject to constraints… Minimize

Generalized Low Rank Models Udell, Horn, Zadeh, Boyd 2016

Principal Component Analysis

We can do an awful lot with mean squared error

Classic PCA N ∑ i=1 D ∑ j=1 (Xij −
(UV)ij) 2 with no constraints Minimize

We can make PCA more interpretable by constraining how many
archetypes can be combined

Sparse PCA N ∑ i=1 D ∑ j=1 (Xij −
(UV)ij) 2 Subject to Minimize ∥U∥2 = 1 and ∥U∥0 ≤ k

What if we turn the dial to 11?

K-Means* N ∑ i=1 D ∑ j=1 (Xij − (UV)ij)
2 Subject to Minimize ∥U∥2 = 1 and ∥U∥0 = 1

Non-Negative Matrix Factorization

Only allowing additive combinations of archetypes might be more interpretable…

NMF N ∑ i=1 D ∑ j=1 (Xij − (UV)ij)
2 Subject to Minimize Uij ≥ 0 and Vij ≥ 0

NMF N ∑ i=1 D ∑ j=1 (UV)ij − Xij
log ((UV)ij) Subject to Minimize Uij ≥ 0 and Vij ≥ 0

Exponential Family PCA

X ∼ Pr( ⋅ ∣ Θ) where Suppose Θ =
UV

Let the loss be the negative log likelihood of observing
X given O X Θ

How to parameterize Pr(.|O) ? Use the exponential family of
distributions! Pr( ⋅ ∣ Θ)

−log(P(Xi ∣ Θi )) ∝ G(Θi ) − Xi ⋅
Θi In general for an exponential family distribution

Normal Matrix Factorization N ∑ i=1 D ∑ j=1 1
2 ((UV)ij )2 − Xij ⋅ (UV)ij With no constraints Minimize

2 ((UV)ij )2 − Xij ⋅ (UV)ij + 1 2 (Xij )2 With no constraints Minimize

2 (Xij − (UV)ij) 2 With no constraints Minimize

Poisson Matrix Factorization N ∑ i=1 D ∑ j=1 exp(UV)ij
− Xij ⋅ (UV)ij With no constraints Minimize

Binomial Matrix Factorization Bernoulli Matrix Factorization Gamma Matrix Factorization Beta
Matrix Factorization Exponential Matrix Factorization …

Latent Dirichlet Allocation

What if Oi were parameters for a multinomial distribution? Θi

Multinomial Matrix Factorization N ∑ i=1 D ∑ j=1 −
(UV)ij ⋅ log (Xij) Subject to Minimize (UV)1 = 1 and (UV)ij ≥ 0

We can add a latent variable k

Let Uik = P(i|k), Vkj = P(k|j) Then Θij =
∑ k Uik ⋅ Vkj = ∑ k P(i|k) ⋅ P(k|j) = P(i|j)

Probabilistic Latent Semantic Indexing N ∑ i=1 D ∑ j=1
− (UV)ij ⋅ log (Xij) Subject to Minimize U1 = 1, V1 = 1 and Uij ≥ 0,Vij ≥ 0

Let’s be Bayesian!

We can apply a Dirichlet prior over the multinomial distributions
for U and V U V

And that’s LDA* (modulo all the technical details involved in
the Bayesian inference used for optimization)

Neighbour Graphs

How is the graph constructed?

How is the graph laid out in a low dimensional
space?

Isomap

Graph Construction K-Nearest Neighbours weighted by ambient distance

Complete graph weighted by shortest path length

Consider the weighted adjacency matrix Aij = { w(i, j)
if (i, j) ∈ E 0 otherwise

Factor the matrix! (largest eigenvectors)

Spectral Embedding

Graph Construction Kernel weighted edges*

Compute the graph Laplacian* Lij = −w(i, j) di ×
dj if i ≠ j 1 − w(i, i) di if i = j Where di is the total weight of row i di i

We have a matrix again…

Factor the matrix! (smallest eigenvectors*)

t-SNE (t-distributed Stochastic Neighbour Embedding)

Graph Construction K-Nearest Neighbours* weighted by a kernel with bandwidth
adapted to the K neighbours

Graph Construction Normalize outgoing edge weights to sum to one

Graph Construction Symmetrize by averaging edge weights between each pair
of vertices

Graph Construction Renormalize so the total edge weight is one

Use a force directed graph layout!*

UMAP (Uniform Manifold Approximation and Projection)

Graph Construction K-Nearest Neighbours weighted according to fancy math* I
have fun mathematics to explain this which this margin is too small to contain

Use a force directed graph layout!*

Summary

Dimension reduction is built on only a couple of primitives

Framing the problem as a matrix factorization or neighbour graph
algorithm captures most of the core intuitions

This provides a general framework for understanding almost all dimension
reduction techniques

Questions? [email protected] @leland_mcinnes

A Guide to Dimension Reduction

A Guide to Dimension Reduction

More Decks by Leland McInnes

Featured

Transcript