[Tang+ 2014] "Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"

Slide 1

Slide 1 text

6OEFSTUBOEJOH  UIF-JNJUJOH'BDUPSTPG5PQJD.PEFMJOH WJB1PTUFSJPS$POUSBDUJPO"OBMZTJT Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei and Ming Zhang. 31st International Conference on Machine Learning (ICML), Beijing, June 2014. Sorami Hisamoto August 20, 2014.

Slide 2

Slide 2 text

Summary 1. Theoretical results to explain the convergence behavior of LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 2

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Posterior Contraction Analysis Topic Modeling Empirical Study

Slide 7

Slide 7 text

Posterior Contraction Analysis Topic Modeling Empirical Study

Slide 8

Slide 8 text

๏ Modeling latent “topics” of each data. ๏ A lot of applications. Not limited to text. ๏ LDA: the basic topic model (next slide) What is topic modeling? 5 Figure from [Blei+ 2003] Data e.g. document Topics! e.g. word distribution

Slide 9

Slide 9 text

latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that “each document consists of multiple topics”. ๏ “Topic” is deﬁned as a distribution over a ﬁxed vocabulary. 6

Slide 10

Slide 10 text

latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that “each document consists of multiple topics”. ๏ “Topic” is deﬁned as a distribution over a ﬁxed vocabulary. 6 Two-stage generation process for each document 1. Randomly choose a distribution over topics. 2. For each word in the document a) Randomly choose a topic from the distribution over topic in step #1. b) Randomly choose a word from the corresponding topic.

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

7 Figure from [Blei 2011]

Slide 14

Slide 14 text

7 Figure from [Blei 2011] Topic: distribution over vocabulary

Slide 15

Slide 15 text

7 Figure from [Blei 2011] Topic: distribution over vocabulary Step 1: Choose a distribution over topics

Slide 16

Slide 16 text

7 Figure from [Blei 2011] Topic: distribution over vocabulary Step 1: Choose a distribution over topics Step 2a: Choose a topic from distribution

Slide 17

Slide 17 text

7 Figure from [Blei 2011] Topic: distribution over vocabulary Step 1: Choose a distribution over topics Step 2a: Choose a topic from distribution Step 2b: Choose a word from topic

Slide 18

Slide 18 text

8 Figures from [Blei 2011] Graphical model representation

Slide 19

Slide 19 text

8 Figures from [Blei 2011] topic Graphical model representation

Slide 20

Slide 20 text

8 Figures from [Blei 2011] topic proportion topic Graphical model representation

Slide 21

Slide 21 text

8 Figures from [Blei 2011] topic assignment topic proportion topic Graphical model representation

Slide 22

Slide 22 text

8 Figures from [Blei 2011] observed word topic assignment topic proportion topic Graphical model representation

Slide 23

Slide 23 text

8 Figures from [Blei 2011] Joint distribution of hidden and observed variables observed word topic assignment topic proportion topic Graphical model representation

Slide 24

Slide 24 text

8 Figures from [Blei 2011] Joint distribution of hidden and observed variables observed word topic assignment topic proportion topic Graphical model representation

Slide 25

Slide 25 text

8 Figures from [Blei 2011] Joint distribution of hidden and observed variables observed word topic assignment topic proportion topic Graphical model representation

Slide 26

Slide 26 text

Geometric interpretation 9 Figure from [Blei+ 2003]

Slide 27

Slide 27 text

Geometric interpretation 9 Figure from [Blei+ 2003] Topic: in word simplex

Slide 28

Slide 28 text

Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose a distribution over topics Topic: in word simplex В

Slide 29

Slide 29 text

Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose a distribution over topics Topic: in word simplex Step 2a: Choose a topic from distribution В ;

Slide 30

Slide 30 text

Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose a distribution over topics Step 2b: Choose a word from topic Topic: in word simplex 8 Step 2a: Choose a topic from distribution В ;

Slide 31

Slide 31 text

Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose a distribution over topics Step 2b: Choose a word from topic Topic: in word simplex 8 Step 2a: Choose a topic from distribution В ; LDA: ﬁnding the optimal sub-simplex to represent documents.

Slide 32

Slide 32 text

Slide 33

Slide 33 text

“reverse” the generation process ๏ We are interested in the posterior distribution. ๏ latent topic structure, given the observed documents. ! ! ๏ But it is difﬁcult … → approximate: ๏ 1. Sampling-based methods (e.g. Gibbs sampling) ๏ 2. Variational methods (e.g. variational Bayes) ๏ etc… 10

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

FAQs on LDA ๏ Is my data topic-model “friendly”? ๏ Why did the LDA fail on my data? ๏ How many documents do I need to learn 100 topics? ! ๏ Machine learning folklores … 11

Slide 38

Slide 38 text

FAQs on LDA ๏ Is my data topic-model “friendly”? ๏ Why did the LDA fail on my data? ๏ How many documents do I need to learn 100 topics? ! ๏ Machine learning folklores … 11

Slide 39

Slide 39 text

Posterior Contraction Analysis Topic Modeling Empirical Study

Slide 40

Slide 40 text

Posterior Contraction Analysis Topic Modeling Empirical Study

Slide 41

Slide 41 text

Convergence behavior of the posterior ๏ How does posterior convergence behavior change, as data increases? ๏ → Introduces a metric which describes the contracting neighborhood centred at the true topic values, where the posterior distribution will be shown to place most its probability mass on. ๏ The faster the contraction, the more eﬃcient the statistical inference. 14

Slide 42

Slide 42 text

… but it’s diﬃcult to see individual topics ๏ Issue of identiﬁability ๏ “label-switching” issue: one can only identify the topic collection up to a permutation. ๏ Any vector that can be expressed as a convex combination of the topic parameters would be hard to identify and analyze. 15

Slide 43

Slide 43 text

Latent topic polytype in the LDA 16 Topic Polytope: convex hull of the topics Figures from [Tang+ 2014]

Slide 44

Slide 44 text

Latent topic polytype in the LDA 16 Topic Polytope: convex hull of the topics Figures from [Tang+ 2014] topics

Slide 45

Slide 45 text

Latent topic polytype in the LDA 16 Topic Polytope: convex hull of the topics Distance between two polytopes: “minimum-matching” Euclidean Figures from [Tang+ 2014] topics

Slide 46

Slide 46 text

Latent topic polytype in the LDA 16 Topic Polytope: convex hull of the topics Distance between two polytopes: “minimum-matching” Euclidean Figures from [Tang+ 2014] topics * Intuitively, this metric is a stable measure of the dissimilarity between two topic polytopes.

Slide 47

Slide 47 text

Geometric interpretation 17 Figure from [Blei+ 2003]

Slide 48

Slide 48 text

Geometric interpretation 17 Figure from [Blei+ 2003] ! ! Topic! Polytope

Slide 49

Slide 49 text

Upper bound for the learning rate 18 Figures from [Tang+ 2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents

Slide 50

Slide 50 text

Upper bound for the learning rate 18 Figures from [Tang+ 2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents

Slide 51

Slide 51 text

Upper bound for the learning rate 18 Figures from [Tang+ 2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents

Slide 52

Slide 52 text

Upper bound for the learning rate 18 Figures from [Tang+ 2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents

Slide 53

Slide 53 text

Observations from the theorem 1 ๏ From (3), we should have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19

Slide 54

Slide 54 text

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Theorem for general situations 20 When neither condition A1 nor A2 in theorem 1 holds:

Slide 60

Slide 60 text

Theorem for general situations 20 When neither condition A1 nor A2 in theorem 1 holds: Upper bound deteriorates with K.

Slide 61

Slide 61 text

Theorem for general situations 20 When neither condition A1 nor A2 in theorem 1 holds: Upper bound deteriorates with K. c.f. [Nguyen 2012] for more detail.

Slide 62

Slide 62 text

Posterior Contraction Analysis Topic Modeling Empirical Study

Slide 63

Slide 63 text

Posterior Contraction Analysis Topic Modeling Empirical Study

Slide 64

Slide 64 text

Empirical study: metrics 23 Distance between two polytopes:   “minimum-matching” Euclidean

Slide 65

Slide 65 text

Empirical study: metrics 23 When the number of vertices of polytope in general positions is smaller than   the number of dimensions, all such vertices are also the extreme points of their convex hull. Distance between two polytopes:   “minimum-matching” Euclidean

Slide 66

Slide 66 text

Slide 67

Slide 67 text

Experiments on synthetic data ๏ Create synthetic data set by LDA generative process. ๏ Default settings: ๏ true number of topics K*: 3 ๏ vocabulary size |V|: 5,000 ๏ symmetric Dirichlet prior for topic proportions: 1 ๏ symmetric Dirichlet prior for word distributions: 0.01 ๏ Model inference: collapsed Gibbs sampling. ๏ Learning error: posterior mean of the metric. ๏ Reported results: averaged over 30 simulations. 24

Slide 68

Slide 68 text

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Scenario I: fixed N and increasing D 25 ๏ N=500 ๏ D=10~7,000 ๏ K ๏ =3=K*: exact fitted ๏ =10: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) 1. Same β but different K: When LDA is over-fitted (i.e. K > K*), the performance degenerates significantly. K = K* K = K* K > K* K > K* 2. Same K but different β: When β larger, the error curves decay faster when less data is available. As more data available: becomes slower, then flats out. By contrast, small β results in a more efficient learning rate.

Slide 71

Slide 71 text

Slide 72

Slide 72 text

Slide 73

Slide 73 text

Scenario II: fixed D and increasing N 26 ๏ N=10~1,400 ๏ D=1,000 ๏ K ๏ =3=K*: exact fitted ๏ =5: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) Behavior similar to Scenario I. ! In over-fitted cases (K>K*), error fails to vanish even N becomes large. Possibly due to log D / D in the upper bound. K > K* K > K*

Slide 74

Slide 74 text

Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏ K={3, 5} ๏ β={0.01, 1}

Slide 75

Slide 75 text

Slide 76

Slide 76 text

Slide 77

Slide 77 text

Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏ K={3, 5} ๏ β={0.01, 1} Similar to previous scenarios, LDA most effective in the exact-ﬁtted setting (K=K*) & topics are sparse (β small). ! When both conditions fail, the error rate fails to converge to zero, even if data size D=N increases. Empirical error decays at a faster rate than indicated by the upper bound (logD/D)^0.5 from Thm. 1. ! Rough estimate could be Ω(1/D), which actually matches the theoretical lower bound of the error contraction rate (cf. Thm. 3 in [Nguyen 2012]). ! This suggests that the upper bound given in Thm. 1 could be quite conservative in certain conﬁgurations and scenarios.

Slide 78

Slide 78 text

Exponential exponents of the error rate 28 ๏ 2 scenarios: ๏ Fixed N=5 and increasing D. ๏ D=N and both increasing.

Slide 79

Slide 79 text

Exponential exponents of the error rate 28 ๏ 2 scenarios: ๏ Fixed N=5 and increasing D. ๏ D=N and both increasing. K = K* K = K*

Slide 80

Slide 80 text

Slide 81

Slide 81 text

Exponential exponents of the error rate 28 ๏ 2 scenarios: ๏ Fixed N=5 and increasing D. ๏ D=N and both increasing. K = K* K = K* Exact-ﬁtted (K=K*)! Slope of the log error seems close to 1 → matches the lower bound Ω(1/D) K > K* K > K* Over-ﬁtted (K>K*)! Slope tend toward the range bounded by 1/2K = 0.1 and 2/K = 0.4 → approximations of the exponents of lower/upper bound by theory.

Slide 82

Slide 82 text

Experiments on real data sets ๏ Wikipedia, the New York Times articles, and Twitter. ๏ To test the effects of the four limiting factors: N, D, α, β. ๏ Ground-truth topics unknown → use PMI or perplexity. 29

Slide 83

Slide 83 text

30 Fixed D, increasing N Fixed N, increasing D Fixed N&D, varying α Fixed N&D, varying β New York Times Wikipedia Twitter

Slide 84

Slide 84 text

30 Fixed D, increasing N Fixed N, increasing D Fixed N&D, varying α Fixed N&D, varying β New York Times Wikipedia Twitter Results consistent with theory & empirical analysis on synthetic data. ! With extreme data (e.g. very short or very few), or when hyper parameters not appropriately set, performance suffers. ! Results suggesting favorable ranges of parameters: small β, small α (Wikipedia) or large α (NYT, Twitter).

Slide 85

Slide 85 text

Implications and guidelines: 1 & 2 1. Number of documents: D ๏ Impossible to guarantee identification of topics from small D, no matter how long. ๏ Once sufficiently large D, further increase may not significantly improve the result, unless N also suitably increased. ๏ In practice, the LDA achieves comparable results even if thousands of documents are sampled from much larger collection. 2. Length of document: N ๏ Poor result expected when N small, even if D is large. ๏ Ideally, N need to be sufficiently long, but need not too long. ๏ In practice, for very long documents, one can sample fraction of each document and the LDA still yields comparable topics. 31

Slide 86

Slide 86 text

Implications and guidelines: 3, 4, & 5 3. Number of topics: K ๏ If K > K*, inference may become inefficient. ๏ In theory, the convergence rate deteriorates quickly to a nonparametric rate, depending on the number of topics used to fit LDA → Need to be careful not to use too large K. 4. Topic / document separation: LDA performs well when … ๏ Topics are well-separated. ๏ Individual documents area associated mostly with small subset of topics. 5. Hyperparameters ๏ It you think each documents associated with few topics, set α small (e.g. 0.1). ๏ If the topics are known to be word-sparse, set β small (e.g. 0.01) → more efficient learning. 32

Slide 87

Slide 87 text

Limitations of existing results 1. Geometrically intuitive assumptions ๏ e.g. in reality we don’t know how separate the topics are, and whether their convex hull is geometrically degenerate or not. ๏ → may be beneﬁcial to impose additional geometric constraints on prior. 2. True / approximated posterior ๏ Here we considered true posterior distribution. ๏ In practice, posterior is obtained by approximation techniques → error. 33

Slide 88

Slide 88 text

To summarize … 1. Theoretical results to explain the convergence behavior of LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 34

Slide 89

Slide 89 text

Some references (1) ๏ [Blei&Lafferty 2009] Topic Models  http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf ๏ [Blei 2011] Introduction to Probabilistic Topic Models  https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf ๏ [Blei 2012] Review Articles: Probabilistic Topic Models  Communications of The ACM  http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf ๏ [Blei 2012] Probabilistic Topic Models  Machine Learning Summer School  http://www.cs.princeton.edu/~blei/blei-mlss-2012.pdf ๏ Topic Models by David Blei (video)  https://www.youtube.com/watch?v=DDq3OVp9dNA ๏ What is a good explanation of Latent Dirichlet Allocation? - Quora  http://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation ๏ The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors by Matthew L. Jockers  http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/ ๏ [࣋ڮ&ੴࠇ 2013] ֬཰తτϐοΫϞσϧ  ౷ܭ਺ཧݚڀॴ H24೥౓ެ։ߨ࠲  http://www.ism.ac.jp/~daichi/lectures/ISM-2012-TopicModels-daichi.pdf 35

Slide 90

Slide 90 text

Some references (2) ๏ [Blei+ 2003] Latent Dirichlet Allocation  Journal of Machine Learning Research  http://machinelearning.wustl.edu/mlpapers/paper_ﬁles/BleiNJ03.pdf ๏ [Nguyen 2012] Posterior contraction of the population polytope in ﬁnite admixture models  arXiv preprint arXiv:1206.0068  http://arxiv.org/abs/1206.0068 ๏ [Tang+ 2014] Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis  Proceedings of the 31st International Conference on Machine Learning (ICML)  http://jmlr.org/proceedings/papers/v32/tang14.pdf 36