Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Visualizing Topic Models

Visualizing Topic Models

Talk I gave at the Data Science Summit/Dato conference 2015.

Video recording of talk: https://www.youtube.com/watch?v=tGxW2BzC_DU

The majority of the talk was walking through the visualization found in this notebook:

http://nbviewer.ipython.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb

Abstract from conference:
Visualizing Topic Models

Topic models are a versatile tool for data scientists. They can be used, as originally intended, to cluster and explore a corpus of documents or they can be seen as a dimension-reduction technique which provides probabilistic and interpretable results. In this short talk I’ll give an introduction to topic models, how to use them in GraphLab Create, and how to interpret the latent structure they reveal. I will demonstrate pyLDAvis which is a visualization tool for topic models. I’ll show how the visualization can be used to help develop a topic model and also share the results with others.

Ben Mabey

July 21, 2015
Tweet

More Decks by Ben Mabey

Other Decks in Programming

Transcript

  1. Visualizing Topic Models
    Ben Mabey
    @bmabey

    View Slide

  2. 2

    View Slide

  3. 2
    Latent Dirichlet Allocation
    (LDA)

    View Slide

  4. 2
    0 1 … k
    doc a 0.25 0.14 … 0.02
    doc b 0.01 0.30 … 0.09
    … … … … 0.31
    doc D 0.13 0.07 … 0.01
    Document-Topic Distributions
    Latent Dirichlet Allocation
    (LDA)

    View Slide

  5. 2
    0 1 … k
    doc a 0.25 0.14 … 0.02
    doc b 0.01 0.30 … 0.09
    … … … … 0.31
    doc D 0.13 0.07 … 0.01
    Document-Topic Distributions
    0 1 … k
    bird 0.002 0.01 … 0.004
    coffee 0.001 0.003 … 0.009
    … … … … 0.031
    work 0.002 0.006 … 0.021
    Term-Topic Distributions
    Latent Dirichlet Allocation
    (LDA)

    View Slide

  6. 3

    View Slide

  7. 3
    250k+ stories
    July 2007 - May 2014

    View Slide

  8. 3
    250k+ stories
    July 2007 - May 2014
    POS tagging w/spaCy

    View Slide

  9. 3
    250k+ stories
    July 2007 - May 2014
    POS tagging w/spaCy
    Phrase detection w/Gensim

    View Slide

  10. 3
    250k+ stories
    July 2007 - May 2014
    POS tagging w/spaCy
    Phrase detection w/Gensim
    Stopword removal &
    only kept nouns or phrases with nouns

    View Slide

  11. 3
    250k+ stories
    July 2007 - May 2014
    POS tagging w/spaCy
    Phrase detection w/Gensim
    Stopword removal &
    only kept nouns or phrases with nouns
    Fit LDA models varying
    the number of topics

    View Slide

  12. 4
    Game written by 14 year old passes Angry Birds as the top free iphone app

    View Slide

  13. 4
    Game written by 14 year old passes Angry Birds as the top free iphone app
    Topic P(T|D)
    58 0.19
    38 0.14
    16 0.06
    … …
    Document-Topic Distribution

    View Slide

  14. 4
    Game written by 14 year old passes Angry Birds as the top free iphone app
    Topic P(T|D)
    58 0.19
    38 0.14
    16 0.06
    … …
    Document-Topic Distribution
    58 38 16
    app game language
    developer player code
    mobile video game programming
    user gaming java
    app store developer programmer
    Sorted Topic-Term Distributions

    View Slide

  15. 5
    Topic P(T|D)
    mobile apps 0.19
    38 0.14
    16 0.06
    … …
    Table 2
    58mobile apps 38video games 16programming
    app game language
    developer player code
    application video game programming
    user gaming java
    app store developer programmer
    mobile play programming language
    mobile apps 38 16
    app game language
    developer player code
    mobile video game programming
    user gaming java
    app store developer programmer
    Game written by 14 year old passes Angry Birds as the top free iphone app
    Document-Topic Distribution Sorted Topic-Term Distributions

    View Slide

  16. 6
    Topic P(T|D)
    mobile apps 0.19
    video games 0.14
    16 0.06
    … …
    Table 2
    58mobile apps 38video games 16programming
    app game language
    developer player code
    application video game programming
    user gaming java
    app store developer programmer
    mobile play programming language
    mobile apps video games 16
    app game language
    developer player code
    mobile video game programming
    user gaming java
    app store developer programmer
    Game written by 14 year old passes Angry Birds as the top free iphone app
    Document-Topic Distribution Sorted Topic-Term Distributions

    View Slide

  17. 7
    Topic P(T|D)
    mobile apps 0.19
    video games 0.14
    programming 0.06
    … …
    Table 2
    58mobile apps 38video games 16programming
    app game language
    developer player code
    application video game programming
    user gaming java
    app store developer programmer
    mobile play programming language
    mobile apps video games programming
    app game language
    developer player code
    mobile video game programming
    user gaming java
    app store developer programmer
    Game written by 14 year old passes Angry Birds as the top free iphone app
    Document-Topic Distribution Sorted Topic-Term Distributions

    View Slide

  18. 8
    Interpreting Topic Models
    What  is  the  meaning  of  each  topic?  

    View Slide

  19. 8
    Interpreting Topic Models
    What  is  the  meaning  of  each  topic?  
    How  prevalent  is  each  topic?

    View Slide

  20. 8
    Interpreting Topic Models
    What  is  the  meaning  of  each  topic?  
    How  prevalent  is  each  topic?
    How  do  the  topics  relate  to  each  other?

    View Slide

  21. 8
    Interpreting Topic Models
    What  is  the  meaning  of  each  topic?  
    How  prevalent  is  each  topic?
    How  do  the  topics  relate  to  each  other?
    How  do  the  documents  relate  to  each  other?

    View Slide

  22. 9
    Visualizing Topic Models
    https://de.dariah.eu/tatom/topic_model_visualization.html

    View Slide

  23. 10
    Visualizing Topic Models
    https://de.dariah.eu/tatom/topic_model_visualization.html

    View Slide

  24. 11
    Visualizing Topic Models
    https://de.dariah.eu/tatom/topic_model_visualization.html

    View Slide

  25. 12
    Visualizing Topic Models
    https://dhs.stanford.edu/algorithmic-literacy/using-word-clouds-for-topic-modeling-results/
    Please don’t…

    View Slide

  26. LDAvis
    13
    https://github.com/cpsievert/LDAvis

    View Slide

  27. pyLDAvis
    14
    https://github.com/bmabey/pyLDAvis
    py

    View Slide

  28. pyLDAvis
    14
    https://github.com/bmabey/pyLDAvis
    py

    View Slide

  29. Demo Time!
    15

    View Slide

  30. Distinctiveness & Saliency
    16
    Termite: Visualization Techniques for Assessing Textual Topic Models
    Jason Chuang, Christopher D. Manning and Jeffrey Heer. 2012
    measure  how  much  information  a  term  conveys  about  topics

    View Slide

  31. Distinctiveness & Saliency
    17
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.03 0.28 0.01
    apple 20 40 20 -0.16 0.32 -0.05
    angry birds 1 1 30 0.25 0.13 0.03
    python 50 5 10 0.17 0.26 0.05
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45

    View Slide

  32. Distinctiveness & Saliency
    17
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.03 0.28 0.01
    apple 20 40 20 -0.16 0.32 -0.05
    angry birds 1 1 30 0.25 0.13 0.03
    python 50 5 10 0.17 0.26 0.05
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View Slide

  33. Distinctiveness & Saliency
    17
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.03 0.28 0.01
    apple 20 40 20 -0.16 0.32 -0.05
    angry birds 1 1 30 0.25 0.13 0.03
    python 50 5 10 0.17 0.26 0.05
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View Slide

  34. Distinctiveness & Saliency
    18
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.03 0.28 0.01
    apple 20 40 20 -0.16 0.32 -0.05
    angry birds 1 1 30 0.25 0.13 0.03
    python 50 5 10 0.17 0.26 0.05
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View Slide

  35. Distinctiveness & Saliency
    19
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.03 0.28 0.01
    apple 20 40 20 -0.16 0.32 -0.05
    angry birds 1 1 30 0.56 0.13 0.07
    python 50 5 10 0.17 0.26 0.05
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View Slide

  36. Distinctiveness & Saliency
    20
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.15 0.28 0.04
    apple 20 40 20 0.18 0.32 0.06
    angry birds 1 1 30 0.56 0.13 0.07
    python 50 5 10 0.41 0.26 0.11
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View Slide

  37. Distinctiveness & Saliency
    21
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.15 0.28 0.04
    apple 20 40 20 0.18 0.32 0.06
    angry birds 1 1 30 0.56 0.13 0.07
    python 50 5 10 0.41 0.26 0.11
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    distinctiveness weighted by the
    term's overall frequency
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View Slide

  38. Distinctiveness & Saliency
    21
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.15 0.28 0.04
    apple 20 40 20 0.18 0.32 0.06
    angry birds 1 1 30 0.56 0.13 0.07
    python 50 5 10 0.41 0.26 0.11
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    distinctiveness weighted by the
    term's overall frequency
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View Slide

  39. Distinctiveness & Saliency
    21
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.15 0.28 0.04
    apple 20 40 20 0.18 0.32 0.06
    angry birds 1 1 30 0.56 0.13 0.07
    python 50 5 10 0.41 0.26 0.11
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    distinctiveness weighted by the
    term's overall frequency
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View Slide

  40. Distinctiveness & Saliency
    22
    measure  how  much  information  a  term  conveys  about  topics…

    View Slide

  41. Distinctiveness & Saliency
    22
    measure  how  much  information  a  term  conveys  about  topics…
    globally

    View Slide

  42. 23

    View Slide

  43. Thank you!
    Learn more at http://github.com/bmabey/pyLDAvis
    Ben Mabey
    @bmabey
    http://nbviewer.ipython.org/github/bmabey/hacker_news_topic_modelling/

    View Slide