George Ho
October 03, 2018
1.2k

# Probabilistic and Bayesian Matrix Factorizations for Text Clustering

Most of the work in matrix factorization techniques focus on dimensionality reduction: that is, the problem of finding two factor matrices that faithfully reconstruct the original matrix when multiplied together. However, I was interested in applying the exact same techniques to a separate task: text clustering.

A natural question is: why is matrix factorization a good technique to use for text clustering? Because it is simultaneously a clustering and a feature engineering technique: not only does it offer us a latent representation of the original data, but it also gives us a way to easily reconstruct the original data from the latent variables! This is something that latent Dirichlet allocation, for instance, cannot do.

I experimented with using these techniques to cluster subreddits. In a nutshell, nothing seemed to work out very well, and I opine on why I think that’s the case in this slide deck. This talk was delivered to a graduate-level course in frequentist machine learning.

October 03, 2018

## Transcript

1. Probabilistic and Bayesian
Matrix Factorizations
for Text Clustering
(a.k.a. A Series of Unfortunate Events)
George Ho

2. tldr of Reddit project
(Skipping a lot of stuﬀ)

3. What is Reddit?
● Comprised of many communities,
called subreddits.
○ Each has its own rules and
moderators.
● 5th most popular website in the U.S.
● Free speech!

4. /r/theredpill

5. /r/The_Donald

6. We have a way
to take subreddits and

7. Non-negative matrix factorization
● Unsupervised learning.
○ Part-based decomposition!
● Gives us a latent space.

8. Shortcomings
1. NMF always returns clusters, even if they are bad.
2. Short comments get clustered basically randomly.

9. Bayesian Machine Learning
in 3 Slides
“Yeah, that should be enough.”

10. tldr: Bayesianism and Bayes’ Theorem
● If something is unknown, it is a random variable,
and therefore has a probability distribution.

11. tldr: MCMC vs. VI
Markov-chain Monte Carlo
● Obtains samples from the posterior
● Exact (at least asymptotically)
● Slow
Variational inference
● Approximates the posterior simply
● Approximate
● Fast
(Disclaimer: I am not qualiﬁed to even pretend I know this stuff)

12. tldr: Why use Bayesian ML?
● Allows for expressive and informative priors
● Returns principled uncertainty estimates
● Can be conceptually easier than frequentist methods
http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/

13. Shortcomings resolved
1. NMF always returns clusters, even if they are bad
a. Returns principled uncertainty estimates
2. Short comments get clustered basically randomly
a. Allows for expressive and informative priors

14. Probabilistic Matrix Factorization
(PMF)

15. tldr: PMF
● Gaussian prior on the
rows (columns) of W (H).
○ Zero mean
○ Variance is a hyperparameter
(controlling regularization)
● Likelihood is assumed to be
Gaussian

16. NeurIPS 2007
https://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf

17. PMF doesn’t cluster very well…
steel
government
immigration
rule
difference
economy
different
https://gist.github.com/eigenfoo/5ea37677119c28cdefdff49526322ceb
party
case
work
country
student
unite
factual
worker
college
order
argument
agreement
produce
political

18. Why not?
In high dimensions, Gaussians are very weird.
● Gaussians are practically indistinguishable from uniform distributions
on the unit (hyper)sphere.
● Random Gaussian vectors are approximately orthogonal.
https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/

19. Try it with MCMC, instead of MAP!
● The authors use a Gibbs sampler. The general wisdom is to use a more
robust sampler, like the No-U-Turn Sampler (NUTS).
● NUTS returns the worst error possible: sampler diverges!
○ Posterior is very difﬁcult to sample from
■ High dimensional
■ Extremely multimodal
■ Possibly correlated…

20. Other criticisms of PMF
● Does the prior make sense?
○ Do we really expect word counts to be distributed from -∞ to ∞?
● Hyperparameter tuning sucks
○ And is fundamentally un-Bayesian!
○ The hyperparameters are unknown. Therefore, they should have
priors!

21. Bayesian Probabilistic
Matrix Factorization
(BPMF)

22. tldr: BPMF
● Exactly the same as PMF except we place a (hyper)prior on the
parameters of the priors
○ The hyperprior is nontrivial…
■ Wishart prior on the covariance
(I experimented with LKJ priors)
■ Gaussian hyperprior on the mean

23. ICML 2008
https://www.cs.toronto.edu/~amnih/papers/bpmf.pdf

24. Tried running BPMF with MCMC
● Takes ~5 minutes to factorize a 200×10 matrix. Not encouraging!
● Even worse: sampler diverges again!
● Didn’t even bother trying it with VI
○ If you can’t even trust MCMC samples, why should we be able to
approximate the posterior?
○ A diverging sampler is a sign that the samples are ﬂat-out wrong.

25. We can’t get PMF/BPMF to work,
and when we can,
the clusters suck.

26. Do these methods even give good clusters?
● Two metrics for clustering: Calinski-Harabaz and Davies-Bouldin
○ Measure the well-separatedness of clusters…
not whether the clusters have any semantic meaning!
● NMF always produces better scores than PMF
● So PMF and BPMF produce better matrix reconstructions…
but fail to produce well-separated clusterings?

27. Lessons Learned

28. 1. Fully Bayesian methods
are not well-suited
to big-data applications.

29. ● Posterior is probably very difﬁcult to sample from:
○ Large dimensionality (posterior over matrices!)
○ Extremely multimodal
○ Possibly correlated
● There are some things to try…
○ Reparameterizing the model (e.g. noncentered parameterization)
○ Initializing the sampler better
● … but this deﬁnitely isn’t Bayesian home turf!

30. 2. The literature focuses on
dimensionality reduction,
not clustering.
Isn’t there a diﬀerence?

31. Dimensionality reduction vs text clustering
NMF doesn’t just give us a latent
space…
It also gives us an easy way to
reconstruct the original space.
So it both reduces the
dimensionality and clusters!

32. ● Short comments are assigned the Gaussian prior.
○ This is probably not amenable to a good clustering!
● PMF/BPMF were born out of collaborative ﬁltering, but we are trying
to do clustering. These tasks are not obviously the same… are they?

33. Thank You!
Questions?
eigenfoo.xyz
@_eigenfoo
eigenfoo
Blog post and slide deck:
eigenfoo.xyz/matrix-factorizations