Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Tang+ 2014] "Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"

[Tang+ 2014] "Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"

Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei and Ming Zhang.
"Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"
31st International Conference on Machine Learning (ICML), Beijing, June 2014.
http://jmlr.org/proceedings/papers/v32/tang14.pdf

C6b97a47d5406cfdef50a5c755751c16?s=128

Sorami Hisamoto

August 20, 2014
Tweet

Transcript

  1. 6OEFSTUBOEJOH
 UIF-JNJUJOH'BDUPSTPG5PQJD.PEFMJOH WJB1PTUFSJPS$POUSBDUJPO"OBMZTJT Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu

    Mei and Ming Zhang. 31st International Conference on Machine Learning (ICML), Beijing, June 2014. Sorami Hisamoto August 20, 2014.
  2. Summary 1. Theoretical results to explain the convergence behavior of

    LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 2
  3. Summary 1. Theoretical results to explain the convergence behavior of

    LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 2
  4. Summary 1. Theoretical results to explain the convergence behavior of

    LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 2
  5. Summary 1. Theoretical results to explain the convergence behavior of

    LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 2
  6. Posterior Contraction Analysis Topic Modeling Empirical Study

  7. Posterior Contraction Analysis Topic Modeling Empirical Study

  8. ๏ Modeling latent “topics” of each data. ๏ A lot

    of applications. Not limited to text. ๏ LDA: the basic topic model (next slide) What is topic modeling? 5 Figure from [Blei+ 2003] Data e.g. document Topics! e.g. word distribution
  9. latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that

    “each document consists of multiple topics”. ๏ “Topic” is defined as a distribution over a fixed vocabulary. 6
  10. latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that

    “each document consists of multiple topics”. ๏ “Topic” is defined as a distribution over a fixed vocabulary. 6 Two-stage generation process for each document 1. Randomly choose a distribution over topics. 2. For each word in the document a) Randomly choose a topic from the distribution over topic in step #1. b) Randomly choose a word from the corresponding topic.
  11. latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that

    “each document consists of multiple topics”. ๏ “Topic” is defined as a distribution over a fixed vocabulary. 6 Two-stage generation process for each document 1. Randomly choose a distribution over topics. 2. For each word in the document a) Randomly choose a topic from the distribution over topic in step #1. b) Randomly choose a word from the corresponding topic.
  12. latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that

    “each document consists of multiple topics”. ๏ “Topic” is defined as a distribution over a fixed vocabulary. 6 Two-stage generation process for each document 1. Randomly choose a distribution over topics. 2. For each word in the document a) Randomly choose a topic from the distribution over topic in step #1. b) Randomly choose a word from the corresponding topic.
  13. 7 Figure from [Blei 2011]

  14. 7 Figure from [Blei 2011] Topic: distribution over vocabulary

  15. 7 Figure from [Blei 2011] Topic: distribution over vocabulary Step

    1: Choose a distribution over topics
  16. 7 Figure from [Blei 2011] Topic: distribution over vocabulary Step

    1: Choose a distribution over topics Step 2a: Choose a topic from distribution
  17. 7 Figure from [Blei 2011] Topic: distribution over vocabulary Step

    1: Choose a distribution over topics Step 2a: Choose a topic from distribution Step 2b: Choose a word from topic
  18. 8 Figures from [Blei 2011] Graphical model representation

  19. 8 Figures from [Blei 2011] topic Graphical model representation

  20. 8 Figures from [Blei 2011] topic proportion topic Graphical model

    representation
  21. 8 Figures from [Blei 2011] topic assignment topic proportion topic

    Graphical model representation
  22. 8 Figures from [Blei 2011] observed word topic assignment topic

    proportion topic Graphical model representation
  23. 8 Figures from [Blei 2011] Joint distribution of hidden and

    observed variables observed word topic assignment topic proportion topic Graphical model representation
  24. 8 Figures from [Blei 2011] Joint distribution of hidden and

    observed variables observed word topic assignment topic proportion topic Graphical model representation
  25. 8 Figures from [Blei 2011] Joint distribution of hidden and

    observed variables observed word topic assignment topic proportion topic Graphical model representation
  26. Geometric interpretation 9 Figure from [Blei+ 2003]

  27. Geometric interpretation 9 Figure from [Blei+ 2003] Topic: in word

    simplex
  28. Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

    a distribution over topics Topic: in word simplex В
  29. Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

    a distribution over topics Topic: in word simplex Step 2a: Choose a topic from distribution В ;
  30. Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

    a distribution over topics Step 2b: Choose a word from topic Topic: in word simplex 8 Step 2a: Choose a topic from distribution В ;
  31. Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

    a distribution over topics Step 2b: Choose a word from topic Topic: in word simplex 8 Step 2a: Choose a topic from distribution В ; LDA: finding the optimal sub-simplex to represent documents.
  32. Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

    a distribution over topics Step 2b: Choose a word from topic Topic: in word simplex 8 Step 2a: Choose a topic from distribution В ; LDA: finding the optimal sub-simplex to represent documents. ! ! sub-simplex
  33. “reverse” the generation process ๏ We are interested in the

    posterior distribution. ๏ latent topic structure, given the observed documents. ! ! ๏ But it is difficult … → approximate: ๏ 1. Sampling-based methods (e.g. Gibbs sampling) ๏ 2. Variational methods (e.g. variational Bayes) ๏ etc… 10
  34. “reverse” the generation process ๏ We are interested in the

    posterior distribution. ๏ latent topic structure, given the observed documents. ! ! ๏ But it is difficult … → approximate: ๏ 1. Sampling-based methods (e.g. Gibbs sampling) ๏ 2. Variational methods (e.g. variational Bayes) ๏ etc… 10
  35. “reverse” the generation process ๏ We are interested in the

    posterior distribution. ๏ latent topic structure, given the observed documents. ! ! ๏ But it is difficult … → approximate: ๏ 1. Sampling-based methods (e.g. Gibbs sampling) ๏ 2. Variational methods (e.g. variational Bayes) ๏ etc… 10
  36. “reverse” the generation process ๏ We are interested in the

    posterior distribution. ๏ latent topic structure, given the observed documents. ! ! ๏ But it is difficult … → approximate: ๏ 1. Sampling-based methods (e.g. Gibbs sampling) ๏ 2. Variational methods (e.g. variational Bayes) ๏ etc… 10
  37. FAQs on LDA ๏ Is my data topic-model “friendly”? ๏

    Why did the LDA fail on my data? ๏ How many documents do I need to learn 100 topics? ! ๏ Machine learning folklores … 11
  38. FAQs on LDA ๏ Is my data topic-model “friendly”? ๏

    Why did the LDA fail on my data? ๏ How many documents do I need to learn 100 topics? ! ๏ Machine learning folklores … 11
  39. Posterior Contraction Analysis Topic Modeling Empirical Study

  40. Posterior Contraction Analysis Topic Modeling Empirical Study

  41. Convergence behavior of the posterior ๏ How does posterior convergence

    behavior change, as data increases? ๏ → Introduces a metric which describes the contracting neighborhood centred at the true topic values, where the posterior distribution will be shown to place most its probability mass on. ๏ The faster the contraction, the more efficient the statistical inference. 14
  42. … but it’s difficult to see individual topics ๏ Issue

    of identifiability ๏ “label-switching” issue: one can only identify the topic collection up to a permutation. ๏ Any vector that can be expressed as a convex combination of the topic parameters would be hard to identify and analyze. 15
  43. Latent topic polytype in the LDA 16 Topic Polytope: convex

    hull of the topics Figures from [Tang+ 2014]
  44. Latent topic polytype in the LDA 16 Topic Polytope: convex

    hull of the topics Figures from [Tang+ 2014] topics
  45. Latent topic polytype in the LDA 16 Topic Polytope: convex

    hull of the topics Distance between two polytopes: “minimum-matching” Euclidean Figures from [Tang+ 2014] topics
  46. Latent topic polytype in the LDA 16 Topic Polytope: convex

    hull of the topics Distance between two polytopes: “minimum-matching” Euclidean Figures from [Tang+ 2014] topics * Intuitively, this metric is a stable measure of the dissimilarity between two topic polytopes.
  47. Geometric interpretation 17 Figure from [Blei+ 2003]

  48. Geometric interpretation 17 Figure from [Blei+ 2003] ! ! Topic!

    Polytope
  49. Upper bound for the learning rate 18 Figures from [Tang+

    2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents
  50. Upper bound for the learning rate 18 Figures from [Tang+

    2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents
  51. Upper bound for the learning rate 18 Figures from [Tang+

    2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents
  52. Upper bound for the learning rate 18 Figures from [Tang+

    2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents
  53. Observations from the theorem 1 ๏ From (3), we should

    have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19
  54. Observations from the theorem 1 ๏ From (3), we should

    have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19
  55. Observations from the theorem 1 ๏ From (3), we should

    have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19
  56. Observations from the theorem 1 ๏ From (3), we should

    have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19
  57. Observations from the theorem 1 ๏ From (3), we should

    have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19
  58. Observations from the theorem 1 ๏ From (3), we should

    have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19
  59. Theorem for general situations 20 When neither condition A1 nor

    A2 in theorem 1 holds:
  60. Theorem for general situations 20 When neither condition A1 nor

    A2 in theorem 1 holds: Upper bound deteriorates with K.
  61. Theorem for general situations 20 When neither condition A1 nor

    A2 in theorem 1 holds: Upper bound deteriorates with K. c.f. [Nguyen 2012] for more detail.
  62. Posterior Contraction Analysis Topic Modeling Empirical Study

  63. Posterior Contraction Analysis Topic Modeling Empirical Study

  64. Empirical study: metrics 23 Distance between two polytopes: 
 “minimum-matching”

    Euclidean
  65. Empirical study: metrics 23 When the number of vertices of

    polytope in general positions is smaller than 
 the number of dimensions, all such vertices are also the extreme points of their convex hull. Distance between two polytopes: 
 “minimum-matching” Euclidean
  66. Empirical study: metrics 23 When the number of vertices of

    polytope in general positions is smaller than 
 the number of dimensions, all such vertices are also the extreme points of their convex hull. Distance between two polytopes: 
 “minimum-matching” Euclidean
  67. Experiments on synthetic data ๏ Create synthetic data set by

    LDA generative process. ๏ Default settings: ๏ true number of topics K*: 3 ๏ vocabulary size |V|: 5,000 ๏ symmetric Dirichlet prior for topic proportions: 1 ๏ symmetric Dirichlet prior for word distributions: 0.01 ๏ Model inference: collapsed Gibbs sampling. ๏ Learning error: posterior mean of the metric. ๏ Reported results: averaged over 30 simulations. 24
  68. Scenario I: fixed N and increasing D 25 ๏ N=500

    ๏ D=10~7,000 ๏ K ๏ =3=K*: exact fitted ๏ =10: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs)
  69. Scenario I: fixed N and increasing D 25 ๏ N=500

    ๏ D=10~7,000 ๏ K ๏ =3=K*: exact fitted ๏ =10: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) β = 0.01 β = 1 1. Same β but different K: When LDA is over-fitted (i.e. K > K*), the performance degenerates significantly.
  70. Scenario I: fixed N and increasing D 25 ๏ N=500

    ๏ D=10~7,000 ๏ K ๏ =3=K*: exact fitted ๏ =10: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) 1. Same β but different K: When LDA is over-fitted (i.e. K > K*), the performance degenerates significantly. K = K* K = K* K > K* K > K* 2. Same K but different β: When β larger, the error curves decay faster when less data is available. As more data available: becomes slower, then flats out. By contrast, small β results in a more efficient learning rate.
  71. Scenario I: fixed N and increasing D 25 ๏ N=500

    ๏ D=10~7,000 ๏ K ๏ =3=K*: exact fitted ๏ =10: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) 1. Same β but different K: When LDA is over-fitted (i.e. K > K*), the performance degenerates significantly. K = K* K = K* K > K* K > K* 2. Same K but different β: When β larger, the error curves decay faster when less data is available. As more data available: becomes slower, then flats out. By contrast, small β results in a more efficient learning rate. 3. K=K*:! Error rate seems to much (logD/D)^0.5 quite well. In overfitted case, rate is slower.!
  72. Scenario II: fixed D and increasing N 26 ๏ N=10~1,400

    ๏ D=1,000 ๏ K ๏ =3=K*: exact fitted ๏ =5: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs)
  73. Scenario II: fixed D and increasing N 26 ๏ N=10~1,400

    ๏ D=1,000 ๏ K ๏ =3=K*: exact fitted ๏ =5: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) Behavior similar to Scenario I. ! In over-fitted cases (K>K*), error fails to vanish even N becomes large. Possibly due to log D / D in the upper bound. K > K* K > K*
  74. Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏

    K={3, 5} ๏ β={0.01, 1}
  75. Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏

    K={3, 5} ๏ β={0.01, 1} Similar to previous scenarios, LDA most effective in the exact-fitted setting (K=K*) & topics are sparse (β small). ! When both conditions fail, the error rate fails to converge to zero, even if data size D=N increases. K >K* β = 1
  76. Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏

    K={3, 5} ๏ β={0.01, 1} Similar to previous scenarios, LDA most effective in the exact-fitted setting (K=K*) & topics are sparse (β small). ! When both conditions fail, the error rate fails to converge to zero, even if data size D=N increases.
  77. Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏

    K={3, 5} ๏ β={0.01, 1} Similar to previous scenarios, LDA most effective in the exact-fitted setting (K=K*) & topics are sparse (β small). ! When both conditions fail, the error rate fails to converge to zero, even if data size D=N increases. Empirical error decays at a faster rate than indicated by the upper bound (logD/D)^0.5 from Thm. 1. ! Rough estimate could be Ω(1/D), which actually matches the theoretical lower bound of the error contraction rate (cf. Thm. 3 in [Nguyen 2012]). ! This suggests that the upper bound given in Thm. 1 could be quite conservative in certain configurations and scenarios.
  78. Exponential exponents of the error rate 28 ๏ 2 scenarios:

    ๏ Fixed N=5 and increasing D. ๏ D=N and both increasing.
  79. Exponential exponents of the error rate 28 ๏ 2 scenarios:

    ๏ Fixed N=5 and increasing D. ๏ D=N and both increasing. K = K* K = K*
  80. Exponential exponents of the error rate 28 ๏ 2 scenarios:

    ๏ Fixed N=5 and increasing D. ๏ D=N and both increasing. K = K* K = K* Exact-fitted (K=K*)! Slope of the log error seems close to 1 → matches the lower bound Ω(1/D) K > K* K > K*
  81. Exponential exponents of the error rate 28 ๏ 2 scenarios:

    ๏ Fixed N=5 and increasing D. ๏ D=N and both increasing. K = K* K = K* Exact-fitted (K=K*)! Slope of the log error seems close to 1 → matches the lower bound Ω(1/D) K > K* K > K* Over-fitted (K>K*)! Slope tend toward the range bounded by 1/2K = 0.1 and 2/K = 0.4 → approximations of the exponents of lower/upper bound by theory.
  82. Experiments on real data sets ๏ Wikipedia, the New York

    Times articles, and Twitter. ๏ To test the effects of the four limiting factors: N, D, α, β. ๏ Ground-truth topics unknown → use PMI or perplexity. 29
  83. 30 Fixed D, increasing N Fixed N, increasing D Fixed

    N&D, varying α Fixed N&D, varying β New York Times Wikipedia Twitter
  84. 30 Fixed D, increasing N Fixed N, increasing D Fixed

    N&D, varying α Fixed N&D, varying β New York Times Wikipedia Twitter Results consistent with theory & empirical analysis on synthetic data. ! With extreme data (e.g. very short or very few), or when hyper parameters not appropriately set, performance suffers. ! Results suggesting favorable ranges of parameters: small β, small α (Wikipedia) or large α (NYT, Twitter).
  85. Implications and guidelines: 1 & 2 1. Number of documents:

    D ๏ Impossible to guarantee identification of topics from small D, no matter how long. ๏ Once sufficiently large D, further increase may not significantly improve the result, unless N also suitably increased. ๏ In practice, the LDA achieves comparable results even if thousands of documents are sampled from much larger collection. 2. Length of document: N ๏ Poor result expected when N small, even if D is large. ๏ Ideally, N need to be sufficiently long, but need not too long. ๏ In practice, for very long documents, one can sample fraction of each document and the LDA still yields comparable topics. 31
  86. Implications and guidelines: 3, 4, & 5 3. Number of

    topics: K ๏ If K > K*, inference may become inefficient. ๏ In theory, the convergence rate deteriorates quickly to a nonparametric rate, depending on the number of topics used to fit LDA → Need to be careful not to use too large K. 4. Topic / document separation: LDA performs well when … ๏ Topics are well-separated. ๏ Individual documents area associated mostly with small subset of topics. 5. Hyperparameters ๏ It you think each documents associated with few topics, set α small (e.g. 0.1). ๏ If the topics are known to be word-sparse, set β small (e.g. 0.01) → more efficient learning. 32
  87. Limitations of existing results 1. Geometrically intuitive assumptions ๏ e.g.

    in reality we don’t know how separate the topics are, and whether their convex hull is geometrically degenerate or not. ๏ → may be beneficial to impose additional geometric constraints on prior. 2. True / approximated posterior ๏ Here we considered true posterior distribution. ๏ In practice, posterior is obtained by approximation techniques → error. 33
  88. To summarize … 1. Theoretical results to explain the convergence

    behavior of LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 34
  89. Some references (1) ๏ [Blei&Lafferty 2009] Topic Models
 http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf ๏

    [Blei 2011] Introduction to Probabilistic Topic Models
 https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf ๏ [Blei 2012] Review Articles: Probabilistic Topic Models
 Communications of The ACM
 http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf ๏ [Blei 2012] Probabilistic Topic Models
 Machine Learning Summer School
 http://www.cs.princeton.edu/~blei/blei-mlss-2012.pdf ๏ Topic Models by David Blei (video)
 https://www.youtube.com/watch?v=DDq3OVp9dNA ๏ What is a good explanation of Latent Dirichlet Allocation? - Quora
 http://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation ๏ The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors by Matthew L. Jockers
 http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/ ๏ [࣋ڮ&ੴࠇ 2013] ֬཰తτϐοΫϞσϧ
 ౷ܭ਺ཧݚڀॴ H24೥౓ެ։ߨ࠲
 http://www.ism.ac.jp/~daichi/lectures/ISM-2012-TopicModels-daichi.pdf 35
  90. Some references (2) ๏ [Blei+ 2003] Latent Dirichlet Allocation
 Journal

    of Machine Learning Research
 http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf ๏ [Nguyen 2012] Posterior contraction of the population polytope in finite admixture models
 arXiv preprint arXiv:1206.0068
 http://arxiv.org/abs/1206.0068 ๏ [Tang+ 2014] Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis
 Proceedings of the 31st International Conference on Machine Learning (ICML)
 http://jmlr.org/proceedings/papers/v32/tang14.pdf 36