Visualizing Topic Models

Visualizing Topic Models

Talk I gave at the Data Science Summit/Dato conference 2015.

Video recording of talk: https://www.youtube.com/watch?v=tGxW2BzC_DU

The majority of the talk was walking through the visualization found in this notebook:

http://nbviewer.ipython.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb

Abstract from conference:
Visualizing Topic Models

Topic models are a versatile tool for data scientists. They can be used, as originally intended, to cluster and explore a corpus of documents or they can be seen as a dimension-reduction technique which provides probabilistic and interpretable results. In this short talk I’ll give an introduction to topic models, how to use them in GraphLab Create, and how to interpret the latent structure they reveal. I will demonstrate pyLDAvis which is a visualization tool for topic models. I’ll show how the visualization can be used to help develop a topic model and also share the results with others.

C694a032be7518a0d704318895f8fe1d?s=128

Ben Mabey

July 21, 2015
Tweet

Transcript

  1. Visualizing Topic Models Ben Mabey @bmabey

  2. 2

  3. 2 Latent Dirichlet Allocation (LDA)

  4. 2 0 1 … k doc a 0.25 0.14 …

    0.02 doc b 0.01 0.30 … 0.09 … … … … 0.31 doc D 0.13 0.07 … 0.01 Document-Topic Distributions Latent Dirichlet Allocation (LDA)
  5. 2 0 1 … k doc a 0.25 0.14 …

    0.02 doc b 0.01 0.30 … 0.09 … … … … 0.31 doc D 0.13 0.07 … 0.01 Document-Topic Distributions 0 1 … k bird 0.002 0.01 … 0.004 coffee 0.001 0.003 … 0.009 … … … … 0.031 work 0.002 0.006 … 0.021 Term-Topic Distributions Latent Dirichlet Allocation (LDA)
  6. 3

  7. 3 250k+ stories July 2007 - May 2014

  8. 3 250k+ stories July 2007 - May 2014 POS tagging

    w/spaCy
  9. 3 250k+ stories July 2007 - May 2014 POS tagging

    w/spaCy Phrase detection w/Gensim
  10. 3 250k+ stories July 2007 - May 2014 POS tagging

    w/spaCy Phrase detection w/Gensim Stopword removal & only kept nouns or phrases with nouns
  11. 3 250k+ stories July 2007 - May 2014 POS tagging

    w/spaCy Phrase detection w/Gensim Stopword removal & only kept nouns or phrases with nouns Fit LDA models varying the number of topics
  12. 4 Game written by 14 year old passes Angry Birds

    as the top free iphone app
  13. 4 Game written by 14 year old passes Angry Birds

    as the top free iphone app Topic P(T|D) 58 0.19 38 0.14 16 0.06 … … Document-Topic Distribution
  14. 4 Game written by 14 year old passes Angry Birds

    as the top free iphone app Topic P(T|D) 58 0.19 38 0.14 16 0.06 … … Document-Topic Distribution 58 38 16 app game language developer player code mobile video game programming user gaming java app store developer programmer Sorted Topic-Term Distributions
  15. 5 Topic P(T|D) mobile apps 0.19 38 0.14 16 0.06

    … … Table 2 58mobile apps 38video games 16programming app game language developer player code application video game programming user gaming java app store developer programmer mobile play programming language mobile apps 38 16 app game language developer player code mobile video game programming user gaming java app store developer programmer Game written by 14 year old passes Angry Birds as the top free iphone app Document-Topic Distribution Sorted Topic-Term Distributions
  16. 6 Topic P(T|D) mobile apps 0.19 video games 0.14 16

    0.06 … … Table 2 58mobile apps 38video games 16programming app game language developer player code application video game programming user gaming java app store developer programmer mobile play programming language mobile apps video games 16 app game language developer player code mobile video game programming user gaming java app store developer programmer Game written by 14 year old passes Angry Birds as the top free iphone app Document-Topic Distribution Sorted Topic-Term Distributions
  17. 7 Topic P(T|D) mobile apps 0.19 video games 0.14 programming

    0.06 … … Table 2 58mobile apps 38video games 16programming app game language developer player code application video game programming user gaming java app store developer programmer mobile play programming language mobile apps video games programming app game language developer player code mobile video game programming user gaming java app store developer programmer Game written by 14 year old passes Angry Birds as the top free iphone app Document-Topic Distribution Sorted Topic-Term Distributions
  18. 8 Interpreting Topic Models What  is  the  meaning  of  each

     topic?  
  19. 8 Interpreting Topic Models What  is  the  meaning  of  each

     topic?   How  prevalent  is  each  topic?
  20. 8 Interpreting Topic Models What  is  the  meaning  of  each

     topic?   How  prevalent  is  each  topic? How  do  the  topics  relate  to  each  other?
  21. 8 Interpreting Topic Models What  is  the  meaning  of  each

     topic?   How  prevalent  is  each  topic? How  do  the  topics  relate  to  each  other? How  do  the  documents  relate  to  each  other?
  22. 9 Visualizing Topic Models https://de.dariah.eu/tatom/topic_model_visualization.html

  23. 10 Visualizing Topic Models https://de.dariah.eu/tatom/topic_model_visualization.html

  24. 11 Visualizing Topic Models https://de.dariah.eu/tatom/topic_model_visualization.html

  25. 12 Visualizing Topic Models https://dhs.stanford.edu/algorithmic-literacy/using-word-clouds-for-topic-modeling-results/ Please don’t…

  26. LDAvis 13 https://github.com/cpsievert/LDAvis

  27. pyLDAvis 14 https://github.com/bmabey/pyLDAvis py

  28. pyLDAvis 14 https://github.com/bmabey/pyLDAvis py

  29. Demo Time! 15

  30. Distinctiveness & Saliency 16 Termite: Visualization Techniques for Assessing Textual

    Topic Models Jason Chuang, Christopher D. Manning and Jeffrey Heer. 2012 measure  how  much  information  a  term  conveys  about  topics
  31. Distinctiveness & Saliency 17 coding tech news video games distinctiveness

    P(w) saliency game 10 10 50 0.03 0.28 0.01 apple 20 40 20 -0.16 0.32 -0.05 angry birds 1 1 30 0.25 0.13 0.03 python 50 5 10 0.17 0.26 0.05 TOTAL 81 56 110 P(T|game) 0.14 0.14 0.71 P(T|apple) 0.25 0.50 0.25 P(T|angry birds) 0.03 0.03 0.94 P(T|pyhton) 0.77 0.08 0.15 P(T) 0.33 0.23 0.45
  32. Distinctiveness & Saliency 17 coding tech news video games distinctiveness

    P(w) saliency game 10 10 50 0.03 0.28 0.01 apple 20 40 20 -0.16 0.32 -0.05 angry birds 1 1 30 0.25 0.13 0.03 python 50 5 10 0.17 0.26 0.05 TOTAL 81 56 110 P(T|game) 0.14 0.14 0.71 P(T|apple) 0.25 0.50 0.25 P(T|angry birds) 0.03 0.03 0.94 P(T|pyhton) 0.77 0.08 0.15 P(T) 0.33 0.23 0.45 computes the KL divergence between the distribution of topics given a term and the marginal distribution of topics
  33. Distinctiveness & Saliency 17 coding tech news video games distinctiveness

    P(w) saliency game 10 10 50 0.03 0.28 0.01 apple 20 40 20 -0.16 0.32 -0.05 angry birds 1 1 30 0.25 0.13 0.03 python 50 5 10 0.17 0.26 0.05 TOTAL 81 56 110 P(T|game) 0.14 0.14 0.71 P(T|apple) 0.25 0.50 0.25 P(T|angry birds) 0.03 0.03 0.94 P(T|pyhton) 0.77 0.08 0.15 P(T) 0.33 0.23 0.45 computes the KL divergence between the distribution of topics given a term and the marginal distribution of topics
  34. Distinctiveness & Saliency 18 coding tech news video games distinctiveness

    P(w) saliency game 10 10 50 0.03 0.28 0.01 apple 20 40 20 -0.16 0.32 -0.05 angry birds 1 1 30 0.25 0.13 0.03 python 50 5 10 0.17 0.26 0.05 TOTAL 81 56 110 P(T|game) 0.14 0.14 0.71 P(T|apple) 0.25 0.50 0.25 P(T|angry birds) 0.03 0.03 0.94 P(T|pyhton) 0.77 0.08 0.15 P(T) 0.33 0.23 0.45 computes the KL divergence between the distribution of topics given a term and the marginal distribution of topics
  35. Distinctiveness & Saliency 19 coding tech news video games distinctiveness

    P(w) saliency game 10 10 50 0.03 0.28 0.01 apple 20 40 20 -0.16 0.32 -0.05 angry birds 1 1 30 0.56 0.13 0.07 python 50 5 10 0.17 0.26 0.05 TOTAL 81 56 110 P(T|game) 0.14 0.14 0.71 P(T|apple) 0.25 0.50 0.25 P(T|angry birds) 0.03 0.03 0.94 P(T|pyhton) 0.77 0.08 0.15 P(T) 0.33 0.23 0.45 computes the KL divergence between the distribution of topics given a term and the marginal distribution of topics
  36. Distinctiveness & Saliency 20 coding tech news video games distinctiveness

    P(w) saliency game 10 10 50 0.15 0.28 0.04 apple 20 40 20 0.18 0.32 0.06 angry birds 1 1 30 0.56 0.13 0.07 python 50 5 10 0.41 0.26 0.11 TOTAL 81 56 110 P(T|game) 0.14 0.14 0.71 P(T|apple) 0.25 0.50 0.25 P(T|angry birds) 0.03 0.03 0.94 P(T|pyhton) 0.77 0.08 0.15 P(T) 0.33 0.23 0.45 computes the KL divergence between the distribution of topics given a term and the marginal distribution of topics
  37. Distinctiveness & Saliency 21 coding tech news video games distinctiveness

    P(w) saliency game 10 10 50 0.15 0.28 0.04 apple 20 40 20 0.18 0.32 0.06 angry birds 1 1 30 0.56 0.13 0.07 python 50 5 10 0.41 0.26 0.11 TOTAL 81 56 110 P(T|game) 0.14 0.14 0.71 P(T|apple) 0.25 0.50 0.25 P(T|angry birds) 0.03 0.03 0.94 P(T|pyhton) 0.77 0.08 0.15 P(T) 0.33 0.23 0.45 distinctiveness weighted by the term's overall frequency computes the KL divergence between the distribution of topics given a term and the marginal distribution of topics
  38. Distinctiveness & Saliency 21 coding tech news video games distinctiveness

    P(w) saliency game 10 10 50 0.15 0.28 0.04 apple 20 40 20 0.18 0.32 0.06 angry birds 1 1 30 0.56 0.13 0.07 python 50 5 10 0.41 0.26 0.11 TOTAL 81 56 110 P(T|game) 0.14 0.14 0.71 P(T|apple) 0.25 0.50 0.25 P(T|angry birds) 0.03 0.03 0.94 P(T|pyhton) 0.77 0.08 0.15 P(T) 0.33 0.23 0.45 distinctiveness weighted by the term's overall frequency computes the KL divergence between the distribution of topics given a term and the marginal distribution of topics
  39. Distinctiveness & Saliency 21 coding tech news video games distinctiveness

    P(w) saliency game 10 10 50 0.15 0.28 0.04 apple 20 40 20 0.18 0.32 0.06 angry birds 1 1 30 0.56 0.13 0.07 python 50 5 10 0.41 0.26 0.11 TOTAL 81 56 110 P(T|game) 0.14 0.14 0.71 P(T|apple) 0.25 0.50 0.25 P(T|angry birds) 0.03 0.03 0.94 P(T|pyhton) 0.77 0.08 0.15 P(T) 0.33 0.23 0.45 distinctiveness weighted by the term's overall frequency computes the KL divergence between the distribution of topics given a term and the marginal distribution of topics
  40. Distinctiveness & Saliency 22 measure  how  much  information  a  term

     conveys  about  topics…
  41. Distinctiveness & Saliency 22 measure  how  much  information  a  term

     conveys  about  topics… globally
  42. 23

  43. Thank you! Learn more at http://github.com/bmabey/pyLDAvis Ben Mabey @bmabey http://nbviewer.ipython.org/github/bmabey/hacker_news_topic_modelling/