[Tang+ 2014] "Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"

[Tang+ 2014] "Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"

Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei and Ming Zhang.
"Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"
31st International Conference on Machine Learning (ICML), Beijing, June 2014.
http://jmlr.org/proceedings/papers/v32/tang14.pdf

C6b97a47d5406cfdef50a5c755751c16?s=128

Sorami Hisamoto

August 20, 2014
Tweet

Transcript

  1. 1.

    6OEFSTUBOEJOH
 UIF-JNJUJOH'BDUPSTPG5PQJD.PEFMJOH WJB1PTUFSJPS$POUSBDUJPO"OBMZTJT Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu

    Mei and Ming Zhang. 31st International Conference on Machine Learning (ICML), Beijing, June 2014. Sorami Hisamoto August 20, 2014.
  2. 2.

    Summary 1. Theoretical results to explain the convergence behavior of

    LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 2
  3. 3.

    Summary 1. Theoretical results to explain the convergence behavior of

    LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 2
  4. 4.

    Summary 1. Theoretical results to explain the convergence behavior of

    LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 2
  5. 5.

    Summary 1. Theoretical results to explain the convergence behavior of

    LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 2
  6. 8.

    ๏ Modeling latent “topics” of each data. ๏ A lot

    of applications. Not limited to text. ๏ LDA: the basic topic model (next slide) What is topic modeling? 5 Figure from [Blei+ 2003] Data e.g. document Topics! e.g. word distribution
  7. 9.

    latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that

    “each document consists of multiple topics”. ๏ “Topic” is defined as a distribution over a fixed vocabulary. 6
  8. 10.

    latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that

    “each document consists of multiple topics”. ๏ “Topic” is defined as a distribution over a fixed vocabulary. 6 Two-stage generation process for each document 1. Randomly choose a distribution over topics. 2. For each word in the document a) Randomly choose a topic from the distribution over topic in step #1. b) Randomly choose a word from the corresponding topic.
  9. 11.

    latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that

    “each document consists of multiple topics”. ๏ “Topic” is defined as a distribution over a fixed vocabulary. 6 Two-stage generation process for each document 1. Randomly choose a distribution over topics. 2. For each word in the document a) Randomly choose a topic from the distribution over topic in step #1. b) Randomly choose a word from the corresponding topic.
  10. 12.

    latent Dirichlet allocation (LDA) [Blei+ 2003] ๏ It assumes that

    “each document consists of multiple topics”. ๏ “Topic” is defined as a distribution over a fixed vocabulary. 6 Two-stage generation process for each document 1. Randomly choose a distribution over topics. 2. For each word in the document a) Randomly choose a topic from the distribution over topic in step #1. b) Randomly choose a word from the corresponding topic.
  11. 16.

    7 Figure from [Blei 2011] Topic: distribution over vocabulary Step

    1: Choose a distribution over topics Step 2a: Choose a topic from distribution
  12. 17.

    7 Figure from [Blei 2011] Topic: distribution over vocabulary Step

    1: Choose a distribution over topics Step 2a: Choose a topic from distribution Step 2b: Choose a word from topic
  13. 22.

    8 Figures from [Blei 2011] observed word topic assignment topic

    proportion topic Graphical model representation
  14. 23.

    8 Figures from [Blei 2011] Joint distribution of hidden and

    observed variables observed word topic assignment topic proportion topic Graphical model representation
  15. 24.

    8 Figures from [Blei 2011] Joint distribution of hidden and

    observed variables observed word topic assignment topic proportion topic Graphical model representation
  16. 25.

    8 Figures from [Blei 2011] Joint distribution of hidden and

    observed variables observed word topic assignment topic proportion topic Graphical model representation
  17. 28.

    Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

    a distribution over topics Topic: in word simplex В
  18. 29.

    Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

    a distribution over topics Topic: in word simplex Step 2a: Choose a topic from distribution В ;
  19. 30.

    Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

    a distribution over topics Step 2b: Choose a word from topic Topic: in word simplex 8 Step 2a: Choose a topic from distribution В ;
  20. 31.

    Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

    a distribution over topics Step 2b: Choose a word from topic Topic: in word simplex 8 Step 2a: Choose a topic from distribution В ; LDA: finding the optimal sub-simplex to represent documents.
  21. 32.

    Geometric interpretation 9 Figure from [Blei+ 2003] Step 1: Choose

    a distribution over topics Step 2b: Choose a word from topic Topic: in word simplex 8 Step 2a: Choose a topic from distribution В ; LDA: finding the optimal sub-simplex to represent documents. ! ! sub-simplex
  22. 33.

    “reverse” the generation process ๏ We are interested in the

    posterior distribution. ๏ latent topic structure, given the observed documents. ! ! ๏ But it is difficult … → approximate: ๏ 1. Sampling-based methods (e.g. Gibbs sampling) ๏ 2. Variational methods (e.g. variational Bayes) ๏ etc… 10
  23. 34.

    “reverse” the generation process ๏ We are interested in the

    posterior distribution. ๏ latent topic structure, given the observed documents. ! ! ๏ But it is difficult … → approximate: ๏ 1. Sampling-based methods (e.g. Gibbs sampling) ๏ 2. Variational methods (e.g. variational Bayes) ๏ etc… 10
  24. 35.

    “reverse” the generation process ๏ We are interested in the

    posterior distribution. ๏ latent topic structure, given the observed documents. ! ! ๏ But it is difficult … → approximate: ๏ 1. Sampling-based methods (e.g. Gibbs sampling) ๏ 2. Variational methods (e.g. variational Bayes) ๏ etc… 10
  25. 36.

    “reverse” the generation process ๏ We are interested in the

    posterior distribution. ๏ latent topic structure, given the observed documents. ! ! ๏ But it is difficult … → approximate: ๏ 1. Sampling-based methods (e.g. Gibbs sampling) ๏ 2. Variational methods (e.g. variational Bayes) ๏ etc… 10
  26. 37.

    FAQs on LDA ๏ Is my data topic-model “friendly”? ๏

    Why did the LDA fail on my data? ๏ How many documents do I need to learn 100 topics? ! ๏ Machine learning folklores … 11
  27. 38.

    FAQs on LDA ๏ Is my data topic-model “friendly”? ๏

    Why did the LDA fail on my data? ๏ How many documents do I need to learn 100 topics? ! ๏ Machine learning folklores … 11
  28. 41.

    Convergence behavior of the posterior ๏ How does posterior convergence

    behavior change, as data increases? ๏ → Introduces a metric which describes the contracting neighborhood centred at the true topic values, where the posterior distribution will be shown to place most its probability mass on. ๏ The faster the contraction, the more efficient the statistical inference. 14
  29. 42.

    … but it’s difficult to see individual topics ๏ Issue

    of identifiability ๏ “label-switching” issue: one can only identify the topic collection up to a permutation. ๏ Any vector that can be expressed as a convex combination of the topic parameters would be hard to identify and analyze. 15
  30. 43.

    Latent topic polytype in the LDA 16 Topic Polytope: convex

    hull of the topics Figures from [Tang+ 2014]
  31. 44.

    Latent topic polytype in the LDA 16 Topic Polytope: convex

    hull of the topics Figures from [Tang+ 2014] topics
  32. 45.

    Latent topic polytype in the LDA 16 Topic Polytope: convex

    hull of the topics Distance between two polytopes: “minimum-matching” Euclidean Figures from [Tang+ 2014] topics
  33. 46.

    Latent topic polytype in the LDA 16 Topic Polytope: convex

    hull of the topics Distance between two polytopes: “minimum-matching” Euclidean Figures from [Tang+ 2014] topics * Intuitively, this metric is a stable measure of the dissimilarity between two topic polytopes.
  34. 49.

    Upper bound for the learning rate 18 Figures from [Tang+

    2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents
  35. 50.

    Upper bound for the learning rate 18 Figures from [Tang+

    2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents
  36. 51.

    Upper bound for the learning rate 18 Figures from [Tang+

    2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents
  37. 52.

    Upper bound for the learning rate 18 Figures from [Tang+

    2014] G*: true topic polytope K*: true number of topics D: number of documents N: length of documents
  38. 53.

    Observations from the theorem 1 ๏ From (3), we should

    have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19
  39. 54.

    Observations from the theorem 1 ๏ From (3), we should

    have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19
  40. 55.

    Observations from the theorem 1 ๏ From (3), we should

    have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19
  41. 56.

    Observations from the theorem 1 ๏ From (3), we should

    have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19
  42. 57.

    Observations from the theorem 1 ๏ From (3), we should

    have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19
  43. 58.

    Observations from the theorem 1 ๏ From (3), we should

    have log D < N (length of documents should be at least on the order of log D, up to a constant factor). ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term may be an artefact due to the proof technique? ๏ In practice the actual rate could be faster than the given upper bound. However, this looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012] ๏ Condition A2: well-separated topics → small β. ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics are well-separated, the LDA inference is statistically efficient. ๏ In practice we do not know K*: while under fitting will result in a persistent error even with infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*). 19
  44. 60.

    Theorem for general situations 20 When neither condition A1 nor

    A2 in theorem 1 holds: Upper bound deteriorates with K.
  45. 61.

    Theorem for general situations 20 When neither condition A1 nor

    A2 in theorem 1 holds: Upper bound deteriorates with K. c.f. [Nguyen 2012] for more detail.
  46. 65.

    Empirical study: metrics 23 When the number of vertices of

    polytope in general positions is smaller than 
 the number of dimensions, all such vertices are also the extreme points of their convex hull. Distance between two polytopes: 
 “minimum-matching” Euclidean
  47. 66.

    Empirical study: metrics 23 When the number of vertices of

    polytope in general positions is smaller than 
 the number of dimensions, all such vertices are also the extreme points of their convex hull. Distance between two polytopes: 
 “minimum-matching” Euclidean
  48. 67.

    Experiments on synthetic data ๏ Create synthetic data set by

    LDA generative process. ๏ Default settings: ๏ true number of topics K*: 3 ๏ vocabulary size |V|: 5,000 ๏ symmetric Dirichlet prior for topic proportions: 1 ๏ symmetric Dirichlet prior for word distributions: 0.01 ๏ Model inference: collapsed Gibbs sampling. ๏ Learning error: posterior mean of the metric. ๏ Reported results: averaged over 30 simulations. 24
  49. 68.

    Scenario I: fixed N and increasing D 25 ๏ N=500

    ๏ D=10~7,000 ๏ K ๏ =3=K*: exact fitted ๏ =10: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs)
  50. 69.

    Scenario I: fixed N and increasing D 25 ๏ N=500

    ๏ D=10~7,000 ๏ K ๏ =3=K*: exact fitted ๏ =10: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) β = 0.01 β = 1 1. Same β but different K: When LDA is over-fitted (i.e. K > K*), the performance degenerates significantly.
  51. 70.

    Scenario I: fixed N and increasing D 25 ๏ N=500

    ๏ D=10~7,000 ๏ K ๏ =3=K*: exact fitted ๏ =10: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) 1. Same β but different K: When LDA is over-fitted (i.e. K > K*), the performance degenerates significantly. K = K* K = K* K > K* K > K* 2. Same K but different β: When β larger, the error curves decay faster when less data is available. As more data available: becomes slower, then flats out. By contrast, small β results in a more efficient learning rate.
  52. 71.

    Scenario I: fixed N and increasing D 25 ๏ N=500

    ๏ D=10~7,000 ๏ K ๏ =3=K*: exact fitted ๏ =10: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) 1. Same β but different K: When LDA is over-fitted (i.e. K > K*), the performance degenerates significantly. K = K* K = K* K > K* K > K* 2. Same K but different β: When β larger, the error curves decay faster when less data is available. As more data available: becomes slower, then flats out. By contrast, small β results in a more efficient learning rate. 3. K=K*:! Error rate seems to much (logD/D)^0.5 quite well. In overfitted case, rate is slower.!
  53. 72.

    Scenario II: fixed D and increasing N 26 ๏ N=10~1,400

    ๏ D=1,000 ๏ K ๏ =3=K*: exact fitted ๏ =5: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs)
  54. 73.

    Scenario II: fixed D and increasing N 26 ๏ N=10~1,400

    ๏ D=1,000 ๏ K ๏ =3=K*: exact fitted ๏ =5: over-fitted ๏ β ๏ =0.01: (well-separated topics) ๏ =1: (more word-diffuse, less distinguishable topics) Main varying term (compared in graphs) Behavior similar to Scenario I. ! In over-fitted cases (K>K*), error fails to vanish even N becomes large. Possibly due to log D / D in the upper bound. K > K* K > K*
  55. 75.

    Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏

    K={3, 5} ๏ β={0.01, 1} Similar to previous scenarios, LDA most effective in the exact-fitted setting (K=K*) & topics are sparse (β small). ! When both conditions fail, the error rate fails to converge to zero, even if data size D=N increases. K >K* β = 1
  56. 76.

    Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏

    K={3, 5} ๏ β={0.01, 1} Similar to previous scenarios, LDA most effective in the exact-fitted setting (K=K*) & topics are sparse (β small). ! When both conditions fail, the error rate fails to converge to zero, even if data size D=N increases.
  57. 77.

    Scenario III: N=D, both increasing 27 ๏ N=D: 10~1,300 ๏

    K={3, 5} ๏ β={0.01, 1} Similar to previous scenarios, LDA most effective in the exact-fitted setting (K=K*) & topics are sparse (β small). ! When both conditions fail, the error rate fails to converge to zero, even if data size D=N increases. Empirical error decays at a faster rate than indicated by the upper bound (logD/D)^0.5 from Thm. 1. ! Rough estimate could be Ω(1/D), which actually matches the theoretical lower bound of the error contraction rate (cf. Thm. 3 in [Nguyen 2012]). ! This suggests that the upper bound given in Thm. 1 could be quite conservative in certain configurations and scenarios.
  58. 78.

    Exponential exponents of the error rate 28 ๏ 2 scenarios:

    ๏ Fixed N=5 and increasing D. ๏ D=N and both increasing.
  59. 79.

    Exponential exponents of the error rate 28 ๏ 2 scenarios:

    ๏ Fixed N=5 and increasing D. ๏ D=N and both increasing. K = K* K = K*
  60. 80.

    Exponential exponents of the error rate 28 ๏ 2 scenarios:

    ๏ Fixed N=5 and increasing D. ๏ D=N and both increasing. K = K* K = K* Exact-fitted (K=K*)! Slope of the log error seems close to 1 → matches the lower bound Ω(1/D) K > K* K > K*
  61. 81.

    Exponential exponents of the error rate 28 ๏ 2 scenarios:

    ๏ Fixed N=5 and increasing D. ๏ D=N and both increasing. K = K* K = K* Exact-fitted (K=K*)! Slope of the log error seems close to 1 → matches the lower bound Ω(1/D) K > K* K > K* Over-fitted (K>K*)! Slope tend toward the range bounded by 1/2K = 0.1 and 2/K = 0.4 → approximations of the exponents of lower/upper bound by theory.
  62. 82.

    Experiments on real data sets ๏ Wikipedia, the New York

    Times articles, and Twitter. ๏ To test the effects of the four limiting factors: N, D, α, β. ๏ Ground-truth topics unknown → use PMI or perplexity. 29
  63. 83.

    30 Fixed D, increasing N Fixed N, increasing D Fixed

    N&D, varying α Fixed N&D, varying β New York Times Wikipedia Twitter
  64. 84.

    30 Fixed D, increasing N Fixed N, increasing D Fixed

    N&D, varying α Fixed N&D, varying β New York Times Wikipedia Twitter Results consistent with theory & empirical analysis on synthetic data. ! With extreme data (e.g. very short or very few), or when hyper parameters not appropriately set, performance suffers. ! Results suggesting favorable ranges of parameters: small β, small α (Wikipedia) or large α (NYT, Twitter).
  65. 85.

    Implications and guidelines: 1 & 2 1. Number of documents:

    D ๏ Impossible to guarantee identification of topics from small D, no matter how long. ๏ Once sufficiently large D, further increase may not significantly improve the result, unless N also suitably increased. ๏ In practice, the LDA achieves comparable results even if thousands of documents are sampled from much larger collection. 2. Length of document: N ๏ Poor result expected when N small, even if D is large. ๏ Ideally, N need to be sufficiently long, but need not too long. ๏ In practice, for very long documents, one can sample fraction of each document and the LDA still yields comparable topics. 31
  66. 86.

    Implications and guidelines: 3, 4, & 5 3. Number of

    topics: K ๏ If K > K*, inference may become inefficient. ๏ In theory, the convergence rate deteriorates quickly to a nonparametric rate, depending on the number of topics used to fit LDA → Need to be careful not to use too large K. 4. Topic / document separation: LDA performs well when … ๏ Topics are well-separated. ๏ Individual documents area associated mostly with small subset of topics. 5. Hyperparameters ๏ It you think each documents associated with few topics, set α small (e.g. 0.1). ๏ If the topics are known to be word-sparse, set β small (e.g. 0.01) → more efficient learning. 32
  67. 87.

    Limitations of existing results 1. Geometrically intuitive assumptions ๏ e.g.

    in reality we don’t know how separate the topics are, and whether their convex hull is geometrically degenerate or not. ๏ → may be beneficial to impose additional geometric constraints on prior. 2. True / approximated posterior ๏ Here we considered true posterior distribution. ๏ In practice, posterior is obtained by approximation techniques → error. 33
  68. 88.

    To summarize … 1. Theoretical results to explain the convergence

    behavior of LDA. ๏ “How does posterior converge as data increases?” ๏ Limiting factors: number of documents, length of docs, number of topics, … 2. Empirical study to support the theory. ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, … ๏ Real data sets: Wikipedia, the New York Times, and Twitter. 3. Guidelines for the practical use of LDA. ๏ Number of docs, length of docs, number of topics ๏ Topic / document separation, Dirichlet parameters, … 34
  69. 89.

    Some references (1) ๏ [Blei&Lafferty 2009] Topic Models
 http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf ๏

    [Blei 2011] Introduction to Probabilistic Topic Models
 https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf ๏ [Blei 2012] Review Articles: Probabilistic Topic Models
 Communications of The ACM
 http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf ๏ [Blei 2012] Probabilistic Topic Models
 Machine Learning Summer School
 http://www.cs.princeton.edu/~blei/blei-mlss-2012.pdf ๏ Topic Models by David Blei (video)
 https://www.youtube.com/watch?v=DDq3OVp9dNA ๏ What is a good explanation of Latent Dirichlet Allocation? - Quora
 http://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation ๏ The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors by Matthew L. Jockers
 http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/ ๏ [࣋ڮ&ੴࠇ 2013] ֬཰తτϐοΫϞσϧ
 ౷ܭ਺ཧݚڀॴ H24೥౓ެ։ߨ࠲
 http://www.ism.ac.jp/~daichi/lectures/ISM-2012-TopicModels-daichi.pdf 35
  70. 90.

    Some references (2) ๏ [Blei+ 2003] Latent Dirichlet Allocation
 Journal

    of Machine Learning Research
 http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf ๏ [Nguyen 2012] Posterior contraction of the population polytope in finite admixture models
 arXiv preprint arXiv:1206.0068
 http://arxiv.org/abs/1206.0068 ๏ [Tang+ 2014] Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis
 Proceedings of the 31st International Conference on Machine Learning (ICML)
 http://jmlr.org/proceedings/papers/v32/tang14.pdf 36