[Tang+ 2014] "Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"

Transcript

6OEFSTUBOEJOH  UIF-JNJUJOH'BDUPSTPG5PQJD.PEFMJOH WJB1PTUFSJPS$POUSBDUJPO"OBMZTJT Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu

Mei and Ming Zhang. 31st International Conference on Machine Learning (ICML), Beijing, June 2014. Sorami Hisamoto August 20, 2014.

Summary 1. Theoretical results to explain the convergence behavior of

LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 2

Summary 1. Theoretical results to explain the convergence behavior of

Posterior Contraction Analysis Topic Modeling Empirical Study

๏ Modeling latent “topics” of each data. ๏ A lot

of applications. Not limited to text. ๏ LDA: the basic topic model (next slide) What is topic modeling? 5 Figure from [Blei+ 2003] Data e.g. document Topics! e.g. word distribution

latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that

“each document consists of multiple topics”. ๏ “Topic” is deﬁned as a distribution over a ﬁxed vocabulary. 6

latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that

“each document consists of multiple topics”. ๏ “Topic” is deﬁned as a distribution over a ﬁxed vocabulary. 6 Two-stage generation process for each document 1. Randomly choose a distribution over topics. 2. For each word in the document a) Randomly choose a topic from the distribution over topic in step #1. b) Randomly choose a word from the corresponding topic.

latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that

7 Figure from [Blei 2011]

7 Figure from [Blei 2011] Topic: distribution over vocabulary

7 Figure from [Blei 2011] Topic: distribution over vocabulary Step

1: Choose a distribution over topics

7 Figure from [Blei 2011] Topic: distribution over vocabulary Step

1: Choose a distribution over topics Step 2a: Choose a topic from distribution

7 Figure from [Blei 2011] Topic: distribution over vocabulary Step

1: Choose a distribution over topics Step 2a: Choose a topic from distribution Step 2b: Choose a word from topic

8 Figures from [Blei 2011] Graphical model representation

8 Figures from [Blei 2011] topic Graphical model representation

8 Figures from [Blei 2011] topic proportion topic Graphical model

representation

8 Figures from [Blei 2011] topic assignment topic proportion topic

Graphical model representation

8 Figures from [Blei 2011] observed word topic assignment topic

proportion topic Graphical model representation

8 Figures from [Blei 2011] Joint distribution of hidden and

observed variables observed word topic assignment topic proportion topic Graphical model representation

8 Figures from [Blei 2011] Joint distribution of hidden and

Geometric interpretation 9 Figure from [Blei+ 2003]

Geometric interpretation 9 Figure from [Blei+ 2003] Topic: in word

simplex

Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

a distribution over topics Topic: in word simplex В

Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

a distribution over topics Topic: in word simplex Step 2a: Choose a topic from distribution В ;

Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

a distribution over topics Step 2b: Choose a word from topic Topic: in word simplex 8 Step 2a: Choose a topic from distribution В ;

Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

a distribution over topics Step 2b: Choose a word from topic Topic: in word simplex 8 Step 2a: Choose a topic from distribution В ; LDA: ﬁnding the optimal sub-simplex to represent documents.

Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

a distribution over topics Step 2b: Choose a word from topic Topic: in word simplex 8 Step 2a: Choose a topic from distribution В ; LDA: ﬁnding the optimal sub-simplex to represent documents. ! ! sub-simplex

“reverse” the generation process ๏ We are interested in the

posterior distribution. ๏ latent topic structure, given the observed documents. ! ! ๏ But it is difﬁcult … → approximate: ๏ 1. Sampling-based methods (e.g. Gibbs sampling) ๏ 2. Variational methods (e.g. variational Bayes) ๏ etc… 10

“reverse” the generation process ๏ We are interested in the

FAQs on LDA ๏ Is my data topic-model “friendly”? ๏

Why did the LDA fail on my data? ๏ How many documents do I need to learn 100 topics? ! ๏ Machine learning folklores … 11

FAQs on LDA ๏ Is my data topic-model “friendly”? ๏

Why did the LDA fail on my data? ๏ How many documents do I need to learn 100 topics? ! ๏ Machine learning folklores … 11

Posterior Contraction Analysis Topic Modeling Empirical Study

Convergence behavior of the posterior ๏ How does posterior convergence

behavior change, as data increases? ๏ → Introduces a metric which describes the contracting neighborhood centred at the true topic values, where the posterior distribution will be shown to place most its probability mass on. ๏ The faster the contraction, the more eﬃcient the statistical inference. 14

… but it’s diﬃcult to see individual topics ๏ Issue

of identiﬁability ๏ “label-switching” issue: one can only identify the topic collection up to a permutation. ๏ Any vector that can be expressed as a convex combination of the topic parameters would be hard to identify and analyze. 15

Latent topic polytype in the LDA 16 Topic Polytope: convex

hull of the topics Figures from [Tang+ 2014]

Latent topic polytype in the LDA 16 Topic Polytope: convex

hull of the topics Figures from [Tang+ 2014] topics

Latent topic polytype in the LDA 16 Topic Polytope: convex

hull of the topics Distance between two polytopes: “minimum-matching” Euclidean Figures from [Tang+ 2014] topics

Latent topic polytype in the LDA 16 Topic Polytope: convex

hull of the topics Distance between two polytopes: “minimum-matching” Euclidean Figures from [Tang+ 2014] topics * Intuitively, this metric is a stable measure of the dissimilarity between two topic polytopes.

Geometric interpretation 17 Figure from [Blei+ 2003]

Geometric interpretation 17 Figure from [Blei+ 2003] ! ! Topic!

Polytope

Upper bound for the learning rate 18 Figures from [Tang+

2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents

Upper bound for the learning rate 18 Figures from [Tang+

Observations from the theorem 1 ๏ From (3), we should

have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19

Observations from the theorem 1 ๏ From (3), we should

Theorem for general situations 20 When neither condition A1 nor

A2 in theorem 1 holds:

Theorem for general situations 20 When neither condition A1 nor

A2 in theorem 1 holds: Upper bound deteriorates with K.

Theorem for general situations 20 When neither condition A1 nor

A2 in theorem 1 holds: Upper bound deteriorates with K. c.f. [Nguyen 2012] for more detail.

Posterior Contraction Analysis Topic Modeling Empirical Study

Empirical study: metrics 23 Distance between two polytopes:   “minimum-matching”

Euclidean

Empirical study: metrics 23 When the number of vertices of

polytope in general positions is smaller than   the number of dimensions, all such vertices are also the extreme points of their convex hull. Distance between two polytopes:   “minimum-matching” Euclidean

Empirical study: metrics 23 When the number of vertices of

polytope in general positions is smaller than   the number of dimensions, all such vertices are also the extreme points of their convex hull. Distance between two polytopes:   “minimum-matching” Euclidean

Experiments on synthetic data ๏ Create synthetic data set by

LDA generative process. ๏ Default settings: ๏ true number of topics K*: 3 ๏ vocabulary size |V|: 5,000 ๏ symmetric Dirichlet prior for topic proportions: 1 ๏ symmetric Dirichlet prior for word distributions: 0.01 ๏ Model inference: collapsed Gibbs sampling. ๏ Learning error: posterior mean of the metric. ๏ Reported results: averaged over 30 simulations. 24

Scenario I: ﬁxed N and increasing D 25 ๏ N=500

๏ D=10~7,000 ๏ K ๏ =3=K*: exact ﬁtted ๏ =10: over-ﬁtted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs)

Scenario I: ﬁxed N and increasing D 25 ๏ N=500

๏ D=10~7,000 ๏ K ๏ =3=K*: exact fitted ๏ =10: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) β = 0.01 β = 1 1. Same β but different K: When LDA is over-fitted (i.e. K > K*), the performance degenerates significantly.

Scenario I: ﬁxed N and increasing D 25 ๏ N=500

๏ D=10~7,000 ๏ K ๏ =3=K*: exact fitted ๏ =10: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) 1. Same β but different K: When LDA is over-fitted (i.e. K > K*), the performance degenerates significantly. K = K* K = K* K > K* K > K* 2. Same K but different β: When β larger, the error curves decay faster when less data is available. As more data available: becomes slower, then flats out. By contrast, small β results in a more efficient learning rate.

Scenario I: ﬁxed N and increasing D 25 ๏ N=500

๏ D=10~7,000 ๏ K ๏ =3=K*: exact fitted ๏ =10: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) 1. Same β but different K: When LDA is over-fitted (i.e. K > K*), the performance degenerates significantly. K = K* K = K* K > K* K > K* 2. Same K but different β: When β larger, the error curves decay faster when less data is available. As more data available: becomes slower, then flats out. By contrast, small β results in a more efficient learning rate. 3. K=K*:! Error rate seems to much (logD/D)^0.5 quite well. In overfitted case, rate is slower.!

Scenario II: ﬁxed D and increasing N 26 ๏ N=10~1,400

๏ D=1,000 ๏ K ๏ =3=K*: exact ﬁtted ๏ =5: over-ﬁtted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs)

Scenario II: ﬁxed D and increasing N 26 ๏ N=10~1,400

๏ D=1,000 ๏ K ๏ =3=K*: exact fitted ๏ =5: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) Behavior similar to Scenario I. ! In over-fitted cases (K>K*), error fails to vanish even N becomes large. Possibly due to log D / D in the upper bound. K > K* K > K*

Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏

K={3, 5} ๏ β={0.01, 1}

Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏

K={3, 5} ๏ β={0.01, 1} Similar to previous scenarios, LDA most effective in the exact-ﬁtted setting (K=K*) & topics are sparse (β small). ! When both conditions fail, the error rate fails to converge to zero, even if data size D=N increases. K >K* β = 1

Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏

K={3, 5} ๏ β={0.01, 1} Similar to previous scenarios, LDA most effective in the exact-ﬁtted setting (K=K*) & topics are sparse (β small). ! When both conditions fail, the error rate fails to converge to zero, even if data size D=N increases.

Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏

K={3, 5} ๏ β={0.01, 1} Similar to previous scenarios, LDA most effective in the exact-ﬁtted setting (K=K*) & topics are sparse (β small). ! When both conditions fail, the error rate fails to converge to zero, even if data size D=N increases. Empirical error decays at a faster rate than indicated by the upper bound (logD/D)^0.5 from Thm. 1. ! Rough estimate could be Ω(1/D), which actually matches the theoretical lower bound of the error contraction rate (cf. Thm. 3 in [Nguyen 2012]). ! This suggests that the upper bound given in Thm. 1 could be quite conservative in certain conﬁgurations and scenarios.

Exponential exponents of the error rate 28 ๏ 2 scenarios:

๏ Fixed N=5 and increasing D. ๏ D=N and both increasing.

Exponential exponents of the error rate 28 ๏ 2 scenarios:

๏ Fixed N=5 and increasing D. ๏ D=N and both increasing. K = K* K = K*

Exponential exponents of the error rate 28 ๏ 2 scenarios:

๏ Fixed N=5 and increasing D. ๏ D=N and both increasing. K = K* K = K* Exact-ﬁtted (K=K*)! Slope of the log error seems close to 1 → matches the lower bound Ω(1/D) K > K* K > K*

Exponential exponents of the error rate 28 ๏ 2 scenarios:

๏ Fixed N=5 and increasing D. ๏ D=N and both increasing. K = K* K = K* Exact-ﬁtted (K=K*)! Slope of the log error seems close to 1 → matches the lower bound Ω(1/D) K > K* K > K* Over-ﬁtted (K>K*)! Slope tend toward the range bounded by 1/2K = 0.1 and 2/K = 0.4 → approximations of the exponents of lower/upper bound by theory.

Experiments on real data sets ๏ Wikipedia, the New York

Times articles, and Twitter. ๏ To test the effects of the four limiting factors: N, D, α, β. ๏ Ground-truth topics unknown → use PMI or perplexity. 29

30 Fixed D, increasing N Fixed N, increasing D Fixed

N&D, varying α Fixed N&D, varying β New York Times Wikipedia Twitter

30 Fixed D, increasing N Fixed N, increasing D Fixed

N&D, varying α Fixed N&D, varying β New York Times Wikipedia Twitter Results consistent with theory & empirical analysis on synthetic data. ! With extreme data (e.g. very short or very few), or when hyper parameters not appropriately set, performance suffers. ! Results suggesting favorable ranges of parameters: small β, small α (Wikipedia) or large α (NYT, Twitter).

Implications and guidelines: 1 & 2 1. Number of documents:

D ๏ Impossible to guarantee identification of topics from small D, no matter how long. ๏ Once sufficiently large D, further increase may not significantly improve the result, unless N also suitably increased. ๏ In practice, the LDA achieves comparable results even if thousands of documents are sampled from much larger collection. 2. Length of document: N ๏ Poor result expected when N small, even if D is large. ๏ Ideally, N need to be sufficiently long, but need not too long. ๏ In practice, for very long documents, one can sample fraction of each document and the LDA still yields comparable topics. 31

Implications and guidelines: 3, 4, & 5 3. Number of

topics: K ๏ If K > K*, inference may become inefficient. ๏ In theory, the convergence rate deteriorates quickly to a nonparametric rate, depending on the number of topics used to fit LDA → Need to be careful not to use too large K. 4. Topic / document separation: LDA performs well when … ๏ Topics are well-separated. ๏ Individual documents area associated mostly with small subset of topics. 5. Hyperparameters ๏ It you think each documents associated with few topics, set α small (e.g. 0.1). ๏ If the topics are known to be word-sparse, set β small (e.g. 0.01) → more efficient learning. 32

Limitations of existing results 1. Geometrically intuitive assumptions ๏ e.g.

in reality we don’t know how separate the topics are, and whether their convex hull is geometrically degenerate or not. ๏ → may be beneﬁcial to impose additional geometric constraints on prior. 2. True / approximated posterior ๏ Here we considered true posterior distribution. ๏ In practice, posterior is obtained by approximation techniques → error. 33

To summarize … 1. Theoretical results to explain the convergence

behavior of LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 34

Some references (1) ๏ [Blei&Laﬀerty 2009] Topic Models  http://www.cs.princeton.edu/~blei/papers/BleiLaﬀerty2009.pdf ๏

[Blei 2011] Introduction to Probabilistic Topic Models  https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf ๏ [Blei 2012] Review Articles: Probabilistic Topic Models  Communications of The ACM  http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf ๏ [Blei 2012] Probabilistic Topic Models  Machine Learning Summer School  http://www.cs.princeton.edu/~blei/blei-mlss-2012.pdf ๏ Topic Models by David Blei (video)  https://www.youtube.com/watch?v=DDq3OVp9dNA ๏ What is a good explanation of Latent Dirichlet Allocation? - Quora  http://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation ๏ The LDA Buﬀet is Now Open; or, Latent Dirichlet Allocation for English Majors by Matthew L. Jockers  http://www.matthewjockers.net/2011/09/29/the-lda-buﬀet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/ ๏ [࣋ڮ&ੴࠇ 2013] ֬཰తτϐοΫϞσϧ  ౷ܭ਺ཧݚڀॴ H24೥౓ެ։ߨ࠲  http://www.ism.ac.jp/~daichi/lectures/ISM-2012-TopicModels-daichi.pdf 35

Some references (2) ๏ [Blei+ 2003] Latent Dirichlet Allocation  Journal

of Machine Learning Research  http://machinelearning.wustl.edu/mlpapers/paper_ﬁles/BleiNJ03.pdf ๏ [Nguyen 2012] Posterior contraction of the population polytope in ﬁnite admixture models  arXiv preprint arXiv:1206.0068  http://arxiv.org/abs/1206.0068 ๏ [Tang+ 2014] Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis  Proceedings of the 31st International Conference on Machine Learning (ICML)  http://jmlr.org/proceedings/papers/v32/tang14.pdf 36

[Tang+ 2014] "Understanding the Limiting Factor...

[Tang+ 2014] "Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"

More Decks by Sorami Hisamoto

Other Decks in Research

Featured

Transcript