$30 off During Our Annual Pro Sale. View Details »

Probabilistic and Bayesian Matrix Factorizations for Text Clustering

George Ho
October 03, 2018

Probabilistic and Bayesian Matrix Factorizations for Text Clustering

Most of the work in matrix factorization techniques focus on dimensionality reduction: that is, the problem of finding two factor matrices that faithfully reconstruct the original matrix when multiplied together. However, I was interested in applying the exact same techniques to a separate task: text clustering.

A natural question is: why is matrix factorization a good technique to use for text clustering? Because it is simultaneously a clustering and a feature engineering technique: not only does it offer us a latent representation of the original data, but it also gives us a way to easily reconstruct the original data from the latent variables! This is something that latent Dirichlet allocation, for instance, cannot do.

I experimented with using these techniques to cluster subreddits. In a nutshell, nothing seemed to work out very well, and I opine on why I think that’s the case in this slide deck. This talk was delivered to a graduate-level course in frequentist machine learning.

George Ho

October 03, 2018
Tweet

More Decks by George Ho

Other Decks in Research

Transcript

  1. Probabilistic and Bayesian
    Matrix Factorizations
    for Text Clustering
    (a.k.a. A Series of Unfortunate Events)
    George Ho

    View Slide

  2. tldr of Reddit project
    (Skipping a lot of stuff)

    View Slide

  3. What is Reddit?
    ● Comprised of many communities,
    called subreddits.
    ○ Each has its own rules and
    moderators.
    ● 5th most popular website in the U.S.
    ● Free speech!

    View Slide

  4. /r/theredpill

    View Slide

  5. /r/The_Donald

    View Slide

  6. We have a way
    to take subreddits and
    tell a story about them

    View Slide

  7. Non-negative matrix factorization
    ● Unsupervised learning.
    ● Strong notions of additivity.
    ○ Part-based decomposition!
    ● Gives us a latent space.

    View Slide

  8. Shortcomings
    1. NMF always returns clusters, even if they are bad.
    2. Short comments get clustered basically randomly.

    View Slide

  9. Bayesian Machine Learning
    in 3 Slides
    “Yeah, that should be enough.”

    View Slide

  10. tldr: Bayesianism and Bayes’ Theorem
    ● If something is unknown, it is a random variable,
    and therefore has a probability distribution.

    View Slide

  11. tldr: MCMC vs. VI
    Markov-chain Monte Carlo
    ● Obtains samples from the posterior
    ● Exact (at least asymptotically)
    ● Slow
    Variational inference
    ● Approximates the posterior simply
    ● Approximate
    ● Fast
    (Disclaimer: I am not qualified to even pretend I know this stuff)

    View Slide

  12. tldr: Why use Bayesian ML?
    ● Allows for expressive and informative priors
    ● Returns principled uncertainty estimates
    ● Can be conceptually easier than frequentist methods
    http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/

    View Slide

  13. Shortcomings resolved
    1. NMF always returns clusters, even if they are bad
    a. Returns principled uncertainty estimates
    2. Short comments get clustered basically randomly
    a. Allows for expressive and informative priors

    View Slide

  14. Probabilistic Matrix Factorization
    (PMF)

    View Slide

  15. tldr: PMF
    ● Gaussian prior on the
    rows (columns) of W (H).
    ○ Zero mean
    ○ Variance is a hyperparameter
    (controlling regularization)
    ● Likelihood is assumed to be
    Gaussian

    View Slide

  16. NeurIPS 2007
    https://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf

    View Slide

  17. PMF doesn’t cluster very well…
    steel
    government
    immigration
    rule
    difference
    economy
    different
    https://gist.github.com/eigenfoo/5ea37677119c28cdefdff49526322ceb
    party
    case
    work
    country
    student
    unite
    factual
    worker
    college
    order
    argument
    agreement
    produce
    political

    View Slide

  18. Why not?
    In high dimensions, Gaussians are very weird.
    ● Gaussians are practically indistinguishable from uniform distributions
    on the unit (hyper)sphere.
    ● Random Gaussian vectors are approximately orthogonal.
    https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/

    View Slide

  19. Try it with MCMC, instead of MAP!
    ● The authors use a Gibbs sampler. The general wisdom is to use a more
    robust sampler, like the No-U-Turn Sampler (NUTS).
    ● NUTS returns the worst error possible: sampler diverges!
    ○ Posterior is very difficult to sample from
    ■ High dimensional
    ■ Extremely multimodal
    ■ Possibly correlated…

    View Slide

  20. Other criticisms of PMF
    ● Does the prior make sense?
    ○ Do we really expect word counts to be distributed from -∞ to ∞?
    ● Hyperparameter tuning sucks
    ○ And is fundamentally un-Bayesian!
    ○ The hyperparameters are unknown. Therefore, they should have
    priors!

    View Slide

  21. Bayesian Probabilistic
    Matrix Factorization
    (BPMF)

    View Slide

  22. tldr: BPMF
    ● Exactly the same as PMF except we place a (hyper)prior on the
    parameters of the priors
    ○ The hyperprior is nontrivial…
    ■ Wishart prior on the covariance
    (I experimented with LKJ priors)
    ■ Gaussian hyperprior on the mean

    View Slide

  23. ICML 2008
    https://www.cs.toronto.edu/~amnih/papers/bpmf.pdf

    View Slide

  24. Tried running BPMF with MCMC
    ● Takes ~5 minutes to factorize a 200×10 matrix. Not encouraging!
    ● Even worse: sampler diverges again!
    ● Didn’t even bother trying it with VI
    ○ If you can’t even trust MCMC samples, why should we be able to
    approximate the posterior?
    ○ A diverging sampler is a sign that the samples are flat-out wrong.

    View Slide

  25. We can’t get PMF/BPMF to work,
    and when we can,
    the clusters suck.

    View Slide

  26. Do these methods even give good clusters?
    ● Two metrics for clustering: Calinski-Harabaz and Davies-Bouldin
    ○ Measure the well-separatedness of clusters…
    not whether the clusters have any semantic meaning!
    ● NMF always produces better scores than PMF
    ● So PMF and BPMF produce better matrix reconstructions…
    but fail to produce well-separated clusterings?

    View Slide

  27. Lessons Learned

    View Slide

  28. 1. Fully Bayesian methods
    are not well-suited
    to big-data applications.

    View Slide

  29. ● Posterior is probably very difficult to sample from:
    ○ Large dimensionality (posterior over matrices!)
    ○ Extremely multimodal
    ○ Possibly correlated
    ● There are some things to try…
    ○ Reparameterizing the model (e.g. noncentered parameterization)
    ○ Initializing the sampler better
    ● … but this definitely isn’t Bayesian home turf!

    View Slide

  30. 2. The literature focuses on
    dimensionality reduction,
    not clustering.
    Isn’t there a difference?

    View Slide

  31. Dimensionality reduction vs text clustering
    NMF doesn’t just give us a latent
    space…
    It also gives us an easy way to
    reconstruct the original space.
    So it both reduces the
    dimensionality and clusters!

    View Slide

  32. ● Short comments are assigned the Gaussian prior.
    ○ This is probably not amenable to a good clustering!
    ● PMF/BPMF were born out of collaborative filtering, but we are trying
    to do clustering. These tasks are not obviously the same… are they?

    View Slide

  33. Thank You!
    Questions?
    eigenfoo.xyz
    @_eigenfoo
    eigenfoo
    Blog post and slide deck:
    eigenfoo.xyz/matrix-factorizations

    View Slide