A Guide to Dimension Reduction

4c76f001e0a3d59cc5a269df70940dfd?s=47 Leland McInnes
October 18, 2018
900

A Guide to Dimension Reduction

Talk given at PyData NYC 2018 on Dimension Reduction: a quick tour of a broad swathe of the field with a focus on core ideas and intuitions rather than technical details.

4c76f001e0a3d59cc5a269df70940dfd?s=128

Leland McInnes

October 18, 2018
Tweet

Transcript

  1. A Bluffer’s Guide to Dimension Reduction Leland McInnes

  2. Bluffer’s Guides are lighthearted and humorous surveys providing a condensed

    overview of a potentially complicated subject.
  3. Focus on the intuition and core ideas

  4. * = I’m lying, but in a good way

  5. There are only two Dimension reduction techniques*

  6. Matrix Factorization Neighbour Graphs

  7. Matrix Factorization Principal Component Analysis Non-negative Matrix Factorization Latent Dirichlet

    Allocation Word2Vec GloVe Generalised Low Rank Models Linear Autoencoder Probablistic PCA Sparse PCA
  8. Neighbour Graphs Locally Linear Embedding Laplacian Eigenmaps Hessian Eigenmaps Local

    Tangent Space Alignment t-SNE UMAP Isomap JSE Spectral Embedding LargeVis NerV
  9. Autoencoders?

  10. Matrix Factorization

  11. None
  12. None
  13. None
  14. None
  15. X ≈ UV Where X is an NxD matrix U

    is an Nxd matrix V is an dxD matrix X U V N × D N × d d × D
  16. N ∑ i=1 D ∑ j=1 Loss (Xij , (UV)ij)

    Subject to constraints… Minimize
  17. Generalized Low Rank Models Udell, Horn, Zadeh, Boyd 2016

  18. Principal Component Analysis

  19. We can do an awful lot with mean squared error

  20. Classic PCA N ∑ i=1 D ∑ j=1 (Xij −

    (UV)ij) 2 with no constraints Minimize
  21. We can make PCA more interpretable by constraining how many

    archetypes can be combined
  22. Sparse PCA N ∑ i=1 D ∑ j=1 (Xij −

    (UV)ij) 2 Subject to Minimize ∥U∥2 = 1 and ∥U∥0 ≤ k
  23. What if we turn the dial to 11?

  24. K-Means* N ∑ i=1 D ∑ j=1 (Xij − (UV)ij)

    2 Subject to Minimize ∥U∥2 = 1 and ∥U∥0 = 1
  25. Non-Negative Matrix Factorization

  26. Only allowing additive combinations of archetypes might be more interpretable…

  27. NMF N ∑ i=1 D ∑ j=1 (Xij − (UV)ij)

    2 Subject to Minimize Uij ≥ 0 and Vij ≥ 0
  28. NMF N ∑ i=1 D ∑ j=1 (UV)ij − Xij

    log ((UV)ij) Subject to Minimize Uij ≥ 0 and Vij ≥ 0
  29. Exponential Family PCA

  30. X ∼ Pr( ⋅ ∣ Θ) where Suppose Θ =

    UV
  31. Let the loss be the negative log likelihood of observing

    X given O X Θ
  32. How to parameterize Pr(.|O) ? Use the exponential family of

    distributions! Pr( ⋅ ∣ Θ)
  33. −log(P(Xi ∣ Θi )) ∝ G(Θi ) − Xi ⋅

    Θi In general for an exponential family distribution
  34. Normal Matrix Factorization N ∑ i=1 D ∑ j=1 1

    2 ((UV)ij )2 − Xij ⋅ (UV)ij With no constraints Minimize
  35. Normal Matrix Factorization N ∑ i=1 D ∑ j=1 1

    2 ((UV)ij )2 − Xij ⋅ (UV)ij + 1 2 (Xij )2 With no constraints Minimize
  36. Normal Matrix Factorization N ∑ i=1 D ∑ j=1 1

    2 (Xij − (UV)ij) 2 With no constraints Minimize
  37. Poisson Matrix Factorization N ∑ i=1 D ∑ j=1 exp(UV)ij

    − Xij ⋅ (UV)ij With no constraints Minimize
  38. Binomial Matrix Factorization Bernoulli Matrix Factorization Gamma Matrix Factorization Beta

    Matrix Factorization Exponential Matrix Factorization …
  39. Latent Dirichlet Allocation

  40. What if Oi were parameters for a multinomial distribution? Θi

  41. Multinomial Matrix Factorization N ∑ i=1 D ∑ j=1 −

    (UV)ij ⋅ log (Xij) Subject to Minimize (UV)1 = 1 and (UV)ij ≥ 0
  42. We can add a latent variable k

  43. Let Uik = P(i|k), Vkj = P(k|j) Then Θij =

    ∑ k Uik ⋅ Vkj = ∑ k P(i|k) ⋅ P(k|j) = P(i|j)
  44. Probabilistic Latent Semantic Indexing N ∑ i=1 D ∑ j=1

    − (UV)ij ⋅ log (Xij) Subject to Minimize U1 = 1, V1 = 1 and Uij ≥ 0,Vij ≥ 0
  45. Let’s be Bayesian!

  46. We can apply a Dirichlet prior over the multinomial distributions

    for U and V U V
  47. And that’s LDA* (modulo all the technical details involved in

    the Bayesian inference used for optimization)
  48. Neighbour Graphs

  49. *

  50. *

  51. How is the graph constructed?

  52. How is the graph laid out in a low dimensional

    space?
  53. Isomap

  54. Graph Construction K-Nearest Neighbours weighted by ambient distance

  55. Complete graph weighted by shortest path length

  56. None
  57. Consider the weighted adjacency matrix Aij = { w(i, j)

    if (i, j) ∈ E 0 otherwise
  58. Factor the matrix! (largest eigenvectors)

  59. Spectral Embedding

  60. Graph Construction Kernel weighted edges*

  61. Compute the graph Laplacian* Lij = −w(i, j) di ×

    dj if i ≠ j 1 − w(i, i) di if i = j Where di is the total weight of row i di i
  62. We have a matrix again…

  63. Factor the matrix! (smallest eigenvectors*)

  64. t-SNE (t-distributed Stochastic Neighbour Embedding)

  65. Graph Construction K-Nearest Neighbours* weighted by a kernel with bandwidth

    adapted to the K neighbours
  66. Graph Construction Normalize outgoing edge weights to sum to one

  67. Graph Construction Symmetrize by averaging edge weights between each pair

    of vertices
  68. Graph Construction Renormalize so the total edge weight is one

  69. Use a force directed graph layout!*

  70. None
  71. None
  72. UMAP (Uniform Manifold Approximation and Projection)

  73. Graph Construction K-Nearest Neighbours weighted according to fancy math* I

    have fun mathematics to explain this which this margin is too small to contain
  74. Use a force directed graph layout!*

  75. Summary

  76. Dimension reduction is built on only a couple of primitives

  77. Framing the problem as a matrix factorization or neighbour graph

    algorithm captures most of the core intuitions
  78. This provides a general framework for understanding almost all dimension

    reduction techniques
  79. Questions? leland.mcinnes@gmail.com @leland_mcinnes