Upgrade to Pro — share decks privately, control downloads, hide ads and more …

トピックモデル - AS^2 LT

トピックモデル - AS^2 LT

At BrainPad, Inc.

Sorami Hisamoto

April 17, 2015
Tweet

More Decks by Sorami Hisamoto

Other Decks in Technology

Transcript

  1. τϐοΫϞσϧ
    Sorami Hisamoto
    AS^2 LT
    April 18, 2015

    View Slide

  2. ๏ Modeling latent “topics” of each data.
    ๏ Originally a method for text,

    but not limited to text.
    What is topic modeling?
    2
    Figure from [Blei+ 2003]
    Data
    e.g. document
    Topics
    e.g. word distribution

    View Slide

  3. http://developer.smartnews.com/blog/2013/08/19/lda-based-channel-categorization-in-smartnews/

    View Slide

  4. http://aial.shiroyagi.co.jp/2014/12/τϐοΫϞσϧʹجͮ͘ଟ༷ੑͷఆྔԽ

    View Slide

  5. http://aial.shiroyagi.co.jp/2014/12/τϐοΫϞσϧʹجͮ͘ଟ༷ੑͷఆྔԽ

    View Slide

  6. http://smrmkt.hatenablog.jp/entry/2014/12/25/205630

    View Slide

  7. http://smrmkt.hatenablog.jp/entry/2014/12/25/205630
    a-mp.jp/article/568

    View Slide

  8. http://smrmkt.hatenablog.jp/entry/2014/12/25/205630

    View Slide

  9. http://smrmkt.hatenablog.jp/entry/2014/12/25/205630

    View Slide

  10. http://smrmkt.hatenablog.jp/entry/2014/12/25/205630

    View Slide

  11. http://smrmkt.hatenablog.jp/entry/2014/12/25/205630

    View Slide

  12. http://mrorii.github.io/blog/2013/12/27/analyzing-dazai-osamu-literature-using-topic-models/

    View Slide

  13. http://mrorii.github.io/blog/2013/12/27/analyzing-dazai-osamu-literature-using-topic-models/

    View Slide

  14. http://mrorii.github.io/blog/2013/12/27/analyzing-dazai-osamu-literature-using-topic-models/

    View Slide

  15. ๏ Matrix Decompositions: LSI, SVD, …
    ๏ 1999: pLSI
    ๏ 2003: LDA

    Same method independently found in population genetics [Pritchard+ 200]
    ๏ 2003-: Extensions of LDA
    ๏ 2007-: Scalable algorithms
    7
    History of the topic models

    View Slide

  16. latent Dirichlet allocation (LDA) [Blei+ 2003]
    8
    ๏ “Document” is a set of “Words”.
    ๏ “Document” consists of multiple “Topics”.
    ๏ “Topic” is a distribution over a vocabulary (all possible words).
    ๏ “Words” are generated by “Topics”.

    View Slide

  17. latent Dirichlet allocation (LDA) [Blei+ 2003]
    8
    Two-stage generation process for each document
    1. Randomly choose a distribution over topics.
    2. For each word in the document
    a) Randomly choose a topic from the distribution over topic in step #1.
    b) Randomly choose a word from the corresponding topic.
    ๏ “Document” is a set of “Words”.
    ๏ “Document” consists of multiple “Topics”.
    ๏ “Topic” is a distribution over a vocabulary (all possible words).
    ๏ “Words” are generated by “Topics”.

    View Slide

  18. latent Dirichlet allocation (LDA) [Blei+ 2003]
    8
    Two-stage generation process for each document
    1. Randomly choose a distribution over topics.
    2. For each word in the document
    a) Randomly choose a topic from the distribution over topic in step #1.
    b) Randomly choose a word from the corresponding topic.
    ๏ “Document” is a set of “Words”.
    ๏ “Document” consists of multiple “Topics”.
    ๏ “Topic” is a distribution over a vocabulary (all possible words).
    ๏ “Words” are generated by “Topics”.

    View Slide

  19. latent Dirichlet allocation (LDA) [Blei+ 2003]
    8
    Two-stage generation process for each document
    1. Randomly choose a distribution over topics.
    2. For each word in the document
    a) Randomly choose a topic from the distribution over topic in step #1.
    b) Randomly choose a word from the corresponding topic.
    ๏ “Document” is a set of “Words”.
    ๏ “Document” consists of multiple “Topics”.
    ๏ “Topic” is a distribution over a vocabulary (all possible words).
    ๏ “Words” are generated by “Topics”.

    View Slide

  20. latent Dirichlet allocation (LDA) [Blei+ 2003]
    8
    Two-stage generation process for each document
    1. Randomly choose a distribution over topics.
    2. For each word in the document
    a) Randomly choose a topic from the distribution over topic in step #1.
    b) Randomly choose a word from the corresponding topic.
    ๏ “Document” is a set of “Words”.
    ๏ “Document” consists of multiple “Topics”.
    ๏ “Topic” is a distribution over a vocabulary (all possible words).
    ๏ “Words” are generated by “Topics”.

    View Slide

  21. 9
    Figure from [Blei 2011]

    View Slide

  22. 9
    Figure from [Blei 2011]
    Topic:
    distribution
    over vocabulary

    View Slide

  23. 9
    Figure from [Blei 2011]
    Topic:
    distribution
    over vocabulary
    Step 1:
    Choose a
    distribution over topics

    View Slide

  24. 9
    Figure from [Blei 2011]
    Topic:
    distribution
    over vocabulary
    Step 1:
    Choose a
    distribution over topics
    Step 2a:
    Choose a topic
    from distribution

    View Slide

  25. 9
    Figure from [Blei 2011]
    Topic:
    distribution
    over vocabulary
    Step 1:
    Choose a
    distribution over topics
    Step 2a:
    Choose a topic
    from distribution
    Step 2b:
    Choose a word
    from topic

    View Slide

  26. 10
    Figures from [Blei 2011]
    Graphical model representation

    View Slide

  27. 10
    Figures from [Blei 2011]
    topic
    Graphical model representation

    View Slide

  28. 10
    Figures from [Blei 2011]
    topic proportion topic
    Graphical model representation

    View Slide

  29. 10
    Figures from [Blei 2011]
    topic
    assignment
    topic proportion topic
    Graphical model representation

    View Slide

  30. 10
    Figures from [Blei 2011]
    observed word
    topic
    assignment
    topic proportion topic
    Graphical model representation

    View Slide

  31. 10
    Figures from [Blei 2011]
    Joint probability of hidden and observed variables
    observed word
    topic
    assignment
    topic proportion topic
    Graphical model representation

    View Slide

  32. 10
    Figures from [Blei 2011]
    Joint probability of hidden and observed variables
    observed word
    topic
    assignment
    topic proportion topic
    Graphical model representation

    View Slide

  33. 10
    Figures from [Blei 2011]
    Joint probability of hidden and observed variables
    observed word
    topic
    assignment
    topic proportion topic
    Graphical model representation

    View Slide

  34. Geometric interpretation
    11
    Figure from [Blei+ 2003]

    View Slide

  35. Geometric interpretation
    11
    Figure from [Blei+ 2003]
    Topic:
    in word simplex

    View Slide

  36. Geometric interpretation
    11
    Figure from [Blei+ 2003]
    Step 1:
    Choose a
    distribution over topics
    Topic:
    in word simplex
    В

    View Slide

  37. Geometric interpretation
    11
    Figure from [Blei+ 2003]
    Step 1:
    Choose a
    distribution over topics
    Topic:
    in word simplex
    Step 2a:
    Choose a topic
    from distribution
    В
    ;

    View Slide

  38. Geometric interpretation
    11
    Figure from [Blei+ 2003]
    Step 1:
    Choose a
    distribution over topics
    Step 2b:
    Choose a word
    from topic
    Topic:
    in word simplex
    8
    Step 2a:
    Choose a topic
    from distribution
    В
    ;

    View Slide

  39. Geometric interpretation
    11
    Figure from [Blei+ 2003]
    Step 1:
    Choose a
    distribution over topics
    Step 2b:
    Choose a word
    from topic
    Topic:
    in word simplex
    8
    Step 2a:
    Choose a topic
    from distribution
    В
    ;
    LDA:

    finding the optimal sub-simplex

    to represent documents.

    View Slide

  40. Geometric interpretation
    11
    Figure from [Blei+ 2003]
    Step 1:
    Choose a
    distribution over topics
    Step 2b:
    Choose a word
    from topic
    Topic:
    in word simplex
    8
    Step 2a:
    Choose a topic
    from distribution
    В
    ;
    LDA:

    finding the optimal sub-simplex

    to represent documents.
    sub-simplex

    View Slide

  41. “reverse” the generation process
    ๏ We are interested in the posterior distribution.
    ๏ latent topic structure, given the observed documents.
    ๏ But it is difficult … → approximate:
    ๏ 1. Sampling-based methods (e.g. Gibbs sampling)
    ๏ 2. Variational methods (e.g. variational Bayes)
    ๏ etc…
    12

    View Slide

  42. “reverse” the generation process
    ๏ We are interested in the posterior distribution.
    ๏ latent topic structure, given the observed documents.
    ๏ But it is difficult … → approximate:
    ๏ 1. Sampling-based methods (e.g. Gibbs sampling)
    ๏ 2. Variational methods (e.g. variational Bayes)
    ๏ etc…
    12

    View Slide

  43. “reverse” the generation process
    ๏ We are interested in the posterior distribution.
    ๏ latent topic structure, given the observed documents.
    ๏ But it is difficult … → approximate:
    ๏ 1. Sampling-based methods (e.g. Gibbs sampling)
    ๏ 2. Variational methods (e.g. variational Bayes)
    ๏ etc…
    12

    View Slide

  44. “reverse” the generation process
    ๏ We are interested in the posterior distribution.
    ๏ latent topic structure, given the observed documents.
    ๏ But it is difficult … → approximate:
    ๏ 1. Sampling-based methods (e.g. Gibbs sampling)
    ๏ 2. Variational methods (e.g. variational Bayes)
    ๏ etc…
    12

    View Slide

  45. ๏ Hierarchical Dirichlet Process [Teh+ 2005]
    ๏ Correlated Topic Models [Blei+ 2006]
    ๏ Supervised Topic Models [Blei+ 2007]
    ๏ Topic Models with Power-law using Pitman-Yor process [Sato+ 2010]
    ๏ Time-series:
    ๏ Dynamic Topic Models [Blei+ 2006]
    ๏ Continuous Time Dynamic Topic Models [Wang+ 2008]
    ๏ Online Multiscale Dynamic Topic Models [Iwata+ 2010]
    ๏ Various learning methods
    ๏ Various scaling algorithms
    ๏ Various applications
    ๏ …
    13
    Extensions of LDA

    View Slide

  46. ๏ Text analysis

    Papers, Blogs, Classical texts …
    ๏ Video analysis
    ๏ Audio analysis
    ๏ Bioinformatics
    ๏ Network analysis
    ๏ …
    14
    Applications

    View Slide

  47. ๏ Gensim

    Python-based, Radim Řehůřek

    ๏ Mallet

    Java-based, UMass

    ๏ Stanford Topic Modeling Toolbox

    Java-based, Stanford

    15
    Tools

    View Slide

  48. References (1): books
    16
    “τϐοΫϞσϧʹΑΔ౷ܭతજࡏҙຯղੳ”

    ࠤ౻Ұ੣, 2015
    “τϐοΫϞσϧ (ػցֶशϓϩϑΣογϣφϧγϦʔζ) ”

    ؠా۩࣏, 2015

    View Slide

  49. ๏ [Blei&Lafferty 2009] Topic Models

    http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf
    ๏ [Blei 2011] Introduction to Probabilistic Topic Models

    https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf
    ๏ [Blei 2012] Review Articles: Probabilistic Topic Models

    Communications of The ACM

    http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
    ๏ [Blei 2012] Probabilistic Topic Models

    Machine Learning Summer School

    http://www.cs.princeton.edu/~blei/blei-mlss-2012.pdf
    ๏ Topic Models by David Blei (video)

    https://www.youtube.com/watch?v=DDq3OVp9dNA
    17
    References (2): papers, videos, and articles
    ๏ What is a good explanation of Latent Dirichlet Allocation? - Quora

    http://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation
    ๏ The LDA Buffet is Now Open by Matthew L. Jockers

    http://www.matthewjockers.net/2011/09/29/
    ๏ [ࠤ౻ 2012] ࢲͷϒοΫϚʔΫ Latent Topic Model (જࡏతτϐοΫϞσϧ)

    http://www.ai-gakkai.or.jp/my-bookmark_vol27-no3/
    ๏ [࣋ڮ&ੴࠇ 2013] ֬཰తτϐοΫϞσϧ

    ౷ܭ਺ཧݚڀॴ H24೥౓ެ։ߨ࠲

    http://www.ism.ac.jp/~daichi/lectures/ISM-2012-TopicModels-daichi.pdf
    ๏ Links to the Papers Related to Topic Models by Tomonori Masada

    http://tmasada.wikispaces.com/Links+to+the+Papers+Related+to+Topic+Models

    View Slide