Slide 1

Slide 1 text

Probabilistic and Bayesian Matrix Factorizations for Text Clustering (a.k.a. A Series of Unfortunate Events) George Ho

Slide 2

Slide 2 text

tldr of Reddit project (Skipping a lot of stuff)

Slide 3

Slide 3 text

What is Reddit? ● Comprised of many communities, called subreddits. ○ Each has its own rules and moderators. ● 5th most popular website in the U.S. ● Free speech!

Slide 4

Slide 4 text

/r/theredpill

Slide 5

Slide 5 text

/r/The_Donald

Slide 6

Slide 6 text

We have a way to take subreddits and tell a story about them

Slide 7

Slide 7 text

Non-negative matrix factorization ● Unsupervised learning. ● Strong notions of additivity. ○ Part-based decomposition! ● Gives us a latent space.

Slide 8

Slide 8 text

Shortcomings 1. NMF always returns clusters, even if they are bad. 2. Short comments get clustered basically randomly.

Slide 9

Slide 9 text

Bayesian Machine Learning in 3 Slides “Yeah, that should be enough.”

Slide 10

Slide 10 text

tldr: Bayesianism and Bayes’ Theorem ● If something is unknown, it is a random variable, and therefore has a probability distribution.

Slide 11

Slide 11 text

tldr: MCMC vs. VI Markov-chain Monte Carlo ● Obtains samples from the posterior ● Exact (at least asymptotically) ● Slow Variational inference ● Approximates the posterior simply ● Approximate ● Fast (Disclaimer: I am not qualified to even pretend I know this stuff)

Slide 12

Slide 12 text

tldr: Why use Bayesian ML? ● Allows for expressive and informative priors ● Returns principled uncertainty estimates ● Can be conceptually easier than frequentist methods http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/

Slide 13

Slide 13 text

Shortcomings resolved 1. NMF always returns clusters, even if they are bad a. Returns principled uncertainty estimates 2. Short comments get clustered basically randomly a. Allows for expressive and informative priors

Slide 14

Slide 14 text

Probabilistic Matrix Factorization (PMF)

Slide 15

Slide 15 text

tldr: PMF ● Gaussian prior on the rows (columns) of W (H). ○ Zero mean ○ Variance is a hyperparameter (controlling regularization) ● Likelihood is assumed to be Gaussian

Slide 16

Slide 16 text

NeurIPS 2007 https://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf

Slide 17

Slide 17 text

PMF doesn’t cluster very well… steel government immigration rule difference economy different https://gist.github.com/eigenfoo/5ea37677119c28cdefdff49526322ceb party case work country student unite factual worker college order argument agreement produce political

Slide 18

Slide 18 text

Why not? In high dimensions, Gaussians are very weird. ● Gaussians are practically indistinguishable from uniform distributions on the unit (hyper)sphere. ● Random Gaussian vectors are approximately orthogonal. https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/

Slide 19

Slide 19 text

Try it with MCMC, instead of MAP! ● The authors use a Gibbs sampler. The general wisdom is to use a more robust sampler, like the No-U-Turn Sampler (NUTS). ● NUTS returns the worst error possible: sampler diverges! ○ Posterior is very difficult to sample from ■ High dimensional ■ Extremely multimodal ■ Possibly correlated…

Slide 20

Slide 20 text

Other criticisms of PMF ● Does the prior make sense? ○ Do we really expect word counts to be distributed from -∞ to ∞? ● Hyperparameter tuning sucks ○ And is fundamentally un-Bayesian! ○ The hyperparameters are unknown. Therefore, they should have priors!

Slide 21

Slide 21 text

Bayesian Probabilistic Matrix Factorization (BPMF)

Slide 22

Slide 22 text

tldr: BPMF ● Exactly the same as PMF except we place a (hyper)prior on the parameters of the priors ○ The hyperprior is nontrivial… ■ Wishart prior on the covariance (I experimented with LKJ priors) ■ Gaussian hyperprior on the mean

Slide 23

Slide 23 text

ICML 2008 https://www.cs.toronto.edu/~amnih/papers/bpmf.pdf

Slide 24

Slide 24 text

Tried running BPMF with MCMC ● Takes ~5 minutes to factorize a 200×10 matrix. Not encouraging! ● Even worse: sampler diverges again! ● Didn’t even bother trying it with VI ○ If you can’t even trust MCMC samples, why should we be able to approximate the posterior? ○ A diverging sampler is a sign that the samples are flat-out wrong.

Slide 25

Slide 25 text

We can’t get PMF/BPMF to work, and when we can, the clusters suck.

Slide 26

Slide 26 text

Do these methods even give good clusters? ● Two metrics for clustering: Calinski-Harabaz and Davies-Bouldin ○ Measure the well-separatedness of clusters… not whether the clusters have any semantic meaning! ● NMF always produces better scores than PMF ● So PMF and BPMF produce better matrix reconstructions… but fail to produce well-separated clusterings?

Slide 27

Slide 27 text

Lessons Learned

Slide 28

Slide 28 text

1. Fully Bayesian methods are not well-suited to big-data applications.

Slide 29

Slide 29 text

● Posterior is probably very difficult to sample from: ○ Large dimensionality (posterior over matrices!) ○ Extremely multimodal ○ Possibly correlated ● There are some things to try… ○ Reparameterizing the model (e.g. noncentered parameterization) ○ Initializing the sampler better ● … but this definitely isn’t Bayesian home turf!

Slide 30

Slide 30 text

2. The literature focuses on dimensionality reduction, not clustering. Isn’t there a difference?

Slide 31

Slide 31 text

Dimensionality reduction vs text clustering NMF doesn’t just give us a latent space… It also gives us an easy way to reconstruct the original space. So it both reduces the dimensionality and clusters!

Slide 32

Slide 32 text

● Short comments are assigned the Gaussian prior. ○ This is probably not amenable to a good clustering! ● PMF/BPMF were born out of collaborative filtering, but we are trying to do clustering. These tasks are not obviously the same… are they?

Slide 33

Slide 33 text

Thank You! Questions? eigenfoo.xyz @_eigenfoo eigenfoo Blog post and slide deck: eigenfoo.xyz/matrix-factorizations