Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Visualizing Topic Models

Visualizing Topic Models

Talk I gave at the Data Science Summit/Dato conference 2015.

Video recording of talk: https://www.youtube.com/watch?v=tGxW2BzC_DU

The majority of the talk was walking through the visualization found in this notebook:

http://nbviewer.ipython.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb

Abstract from conference:
Visualizing Topic Models

Topic models are a versatile tool for data scientists. They can be used, as originally intended, to cluster and explore a corpus of documents or they can be seen as a dimension-reduction technique which provides probabilistic and interpretable results. In this short talk I’ll give an introduction to topic models, how to use them in GraphLab Create, and how to interpret the latent structure they reveal. I will demonstrate pyLDAvis which is a visualization tool for topic models. I’ll show how the visualization can be used to help develop a topic model and also share the results with others.

Ben Mabey

July 21, 2015
Tweet

More Decks by Ben Mabey

Other Decks in Programming

Transcript

  1. Visualizing Topic Models
    Ben Mabey
    @bmabey

    View full-size slide

  2. 2
    Latent Dirichlet Allocation
    (LDA)

    View full-size slide

  3. 2
    0 1 … k
    doc a 0.25 0.14 … 0.02
    doc b 0.01 0.30 … 0.09
    … … … … 0.31
    doc D 0.13 0.07 … 0.01
    Document-Topic Distributions
    Latent Dirichlet Allocation
    (LDA)

    View full-size slide

  4. 2
    0 1 … k
    doc a 0.25 0.14 … 0.02
    doc b 0.01 0.30 … 0.09
    … … … … 0.31
    doc D 0.13 0.07 … 0.01
    Document-Topic Distributions
    0 1 … k
    bird 0.002 0.01 … 0.004
    coffee 0.001 0.003 … 0.009
    … … … … 0.031
    work 0.002 0.006 … 0.021
    Term-Topic Distributions
    Latent Dirichlet Allocation
    (LDA)

    View full-size slide

  5. 3
    250k+ stories
    July 2007 - May 2014

    View full-size slide

  6. 3
    250k+ stories
    July 2007 - May 2014
    POS tagging w/spaCy

    View full-size slide

  7. 3
    250k+ stories
    July 2007 - May 2014
    POS tagging w/spaCy
    Phrase detection w/Gensim

    View full-size slide

  8. 3
    250k+ stories
    July 2007 - May 2014
    POS tagging w/spaCy
    Phrase detection w/Gensim
    Stopword removal &
    only kept nouns or phrases with nouns

    View full-size slide

  9. 3
    250k+ stories
    July 2007 - May 2014
    POS tagging w/spaCy
    Phrase detection w/Gensim
    Stopword removal &
    only kept nouns or phrases with nouns
    Fit LDA models varying
    the number of topics

    View full-size slide

  10. 4
    Game written by 14 year old passes Angry Birds as the top free iphone app

    View full-size slide

  11. 4
    Game written by 14 year old passes Angry Birds as the top free iphone app
    Topic P(T|D)
    58 0.19
    38 0.14
    16 0.06
    … …
    Document-Topic Distribution

    View full-size slide

  12. 4
    Game written by 14 year old passes Angry Birds as the top free iphone app
    Topic P(T|D)
    58 0.19
    38 0.14
    16 0.06
    … …
    Document-Topic Distribution
    58 38 16
    app game language
    developer player code
    mobile video game programming
    user gaming java
    app store developer programmer
    Sorted Topic-Term Distributions

    View full-size slide

  13. 5
    Topic P(T|D)
    mobile apps 0.19
    38 0.14
    16 0.06
    … …
    Table 2
    58mobile apps 38video games 16programming
    app game language
    developer player code
    application video game programming
    user gaming java
    app store developer programmer
    mobile play programming language
    mobile apps 38 16
    app game language
    developer player code
    mobile video game programming
    user gaming java
    app store developer programmer
    Game written by 14 year old passes Angry Birds as the top free iphone app
    Document-Topic Distribution Sorted Topic-Term Distributions

    View full-size slide

  14. 6
    Topic P(T|D)
    mobile apps 0.19
    video games 0.14
    16 0.06
    … …
    Table 2
    58mobile apps 38video games 16programming
    app game language
    developer player code
    application video game programming
    user gaming java
    app store developer programmer
    mobile play programming language
    mobile apps video games 16
    app game language
    developer player code
    mobile video game programming
    user gaming java
    app store developer programmer
    Game written by 14 year old passes Angry Birds as the top free iphone app
    Document-Topic Distribution Sorted Topic-Term Distributions

    View full-size slide

  15. 7
    Topic P(T|D)
    mobile apps 0.19
    video games 0.14
    programming 0.06
    … …
    Table 2
    58mobile apps 38video games 16programming
    app game language
    developer player code
    application video game programming
    user gaming java
    app store developer programmer
    mobile play programming language
    mobile apps video games programming
    app game language
    developer player code
    mobile video game programming
    user gaming java
    app store developer programmer
    Game written by 14 year old passes Angry Birds as the top free iphone app
    Document-Topic Distribution Sorted Topic-Term Distributions

    View full-size slide

  16. 8
    Interpreting Topic Models
    What  is  the  meaning  of  each  topic?  

    View full-size slide

  17. 8
    Interpreting Topic Models
    What  is  the  meaning  of  each  topic?  
    How  prevalent  is  each  topic?

    View full-size slide

  18. 8
    Interpreting Topic Models
    What  is  the  meaning  of  each  topic?  
    How  prevalent  is  each  topic?
    How  do  the  topics  relate  to  each  other?

    View full-size slide

  19. 8
    Interpreting Topic Models
    What  is  the  meaning  of  each  topic?  
    How  prevalent  is  each  topic?
    How  do  the  topics  relate  to  each  other?
    How  do  the  documents  relate  to  each  other?

    View full-size slide

  20. 9
    Visualizing Topic Models
    https://de.dariah.eu/tatom/topic_model_visualization.html

    View full-size slide

  21. 10
    Visualizing Topic Models
    https://de.dariah.eu/tatom/topic_model_visualization.html

    View full-size slide

  22. 11
    Visualizing Topic Models
    https://de.dariah.eu/tatom/topic_model_visualization.html

    View full-size slide

  23. 12
    Visualizing Topic Models
    https://dhs.stanford.edu/algorithmic-literacy/using-word-clouds-for-topic-modeling-results/
    Please don’t…

    View full-size slide

  24. LDAvis
    13
    https://github.com/cpsievert/LDAvis

    View full-size slide

  25. pyLDAvis
    14
    https://github.com/bmabey/pyLDAvis
    py

    View full-size slide

  26. pyLDAvis
    14
    https://github.com/bmabey/pyLDAvis
    py

    View full-size slide

  27. Demo Time!
    15

    View full-size slide

  28. Distinctiveness & Saliency
    16
    Termite: Visualization Techniques for Assessing Textual Topic Models
    Jason Chuang, Christopher D. Manning and Jeffrey Heer. 2012
    measure  how  much  information  a  term  conveys  about  topics

    View full-size slide

  29. Distinctiveness & Saliency
    17
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.03 0.28 0.01
    apple 20 40 20 -0.16 0.32 -0.05
    angry birds 1 1 30 0.25 0.13 0.03
    python 50 5 10 0.17 0.26 0.05
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45

    View full-size slide

  30. Distinctiveness & Saliency
    17
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.03 0.28 0.01
    apple 20 40 20 -0.16 0.32 -0.05
    angry birds 1 1 30 0.25 0.13 0.03
    python 50 5 10 0.17 0.26 0.05
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View full-size slide

  31. Distinctiveness & Saliency
    17
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.03 0.28 0.01
    apple 20 40 20 -0.16 0.32 -0.05
    angry birds 1 1 30 0.25 0.13 0.03
    python 50 5 10 0.17 0.26 0.05
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View full-size slide

  32. Distinctiveness & Saliency
    18
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.03 0.28 0.01
    apple 20 40 20 -0.16 0.32 -0.05
    angry birds 1 1 30 0.25 0.13 0.03
    python 50 5 10 0.17 0.26 0.05
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View full-size slide

  33. Distinctiveness & Saliency
    19
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.03 0.28 0.01
    apple 20 40 20 -0.16 0.32 -0.05
    angry birds 1 1 30 0.56 0.13 0.07
    python 50 5 10 0.17 0.26 0.05
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View full-size slide

  34. Distinctiveness & Saliency
    20
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.15 0.28 0.04
    apple 20 40 20 0.18 0.32 0.06
    angry birds 1 1 30 0.56 0.13 0.07
    python 50 5 10 0.41 0.26 0.11
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View full-size slide

  35. Distinctiveness & Saliency
    21
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.15 0.28 0.04
    apple 20 40 20 0.18 0.32 0.06
    angry birds 1 1 30 0.56 0.13 0.07
    python 50 5 10 0.41 0.26 0.11
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    distinctiveness weighted by the
    term's overall frequency
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View full-size slide

  36. Distinctiveness & Saliency
    21
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.15 0.28 0.04
    apple 20 40 20 0.18 0.32 0.06
    angry birds 1 1 30 0.56 0.13 0.07
    python 50 5 10 0.41 0.26 0.11
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    distinctiveness weighted by the
    term's overall frequency
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View full-size slide

  37. Distinctiveness & Saliency
    21
    coding tech news video games distinctiveness P(w) saliency
    game 10 10 50 0.15 0.28 0.04
    apple 20 40 20 0.18 0.32 0.06
    angry birds 1 1 30 0.56 0.13 0.07
    python 50 5 10 0.41 0.26 0.11
    TOTAL 81 56 110
    P(T|game) 0.14 0.14 0.71
    P(T|apple) 0.25 0.50 0.25
    P(T|angry birds) 0.03 0.03 0.94
    P(T|pyhton) 0.77 0.08 0.15
    P(T) 0.33 0.23 0.45
    distinctiveness weighted by the
    term's overall frequency
    computes the KL divergence between
    the distribution of topics given a term and
    the marginal distribution of topics

    View full-size slide

  38. Distinctiveness & Saliency
    22
    measure  how  much  information  a  term  conveys  about  topics…

    View full-size slide

  39. Distinctiveness & Saliency
    22
    measure  how  much  information  a  term  conveys  about  topics…
    globally

    View full-size slide

  40. Thank you!
    Learn more at http://github.com/bmabey/pyLDAvis
    Ben Mabey
    @bmabey
    http://nbviewer.ipython.org/github/bmabey/hacker_news_topic_modelling/

    View full-size slide