Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Tang+ 2014] "Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"

[Tang+ 2014] "Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"

Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei and Ming Zhang.
"Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"
31st International Conference on Machine Learning (ICML), Beijing, June 2014.
http://jmlr.org/proceedings/papers/v32/tang14.pdf

Sorami Hisamoto

August 20, 2014
Tweet

More Decks by Sorami Hisamoto

Other Decks in Research

Transcript

  1. 6OEFSTUBOEJOH

    UIF-JNJUJOH'BDUPSTPG5PQJD.PEFMJOH
    WJB1PTUFSJPS$POUSBDUJPO"OBMZTJT
    Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei and Ming Zhang.
    31st International Conference on Machine Learning (ICML), Beijing, June 2014.
    Sorami Hisamoto
    August 20, 2014.

    View Slide

  2. Summary
    1. Theoretical results to explain the convergence behavior of LDA.
    ๏ “How does posterior converge as data increases?”
    ๏ Limiting factors: number of documents, length of docs, number of topics, …
    2. Empirical study to support the theory.
    ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, …
    ๏ Real data sets: Wikipedia, the New York Times, and Twitter.
    3. Guidelines for the practical use of LDA.
    ๏ Number of docs, length of docs, number of topics
    ๏ Topic / document separation, Dirichlet parameters, …
    2

    View Slide

  3. Summary
    1. Theoretical results to explain the convergence behavior of LDA.
    ๏ “How does posterior converge as data increases?”
    ๏ Limiting factors: number of documents, length of docs, number of topics, …
    2. Empirical study to support the theory.
    ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, …
    ๏ Real data sets: Wikipedia, the New York Times, and Twitter.
    3. Guidelines for the practical use of LDA.
    ๏ Number of docs, length of docs, number of topics
    ๏ Topic / document separation, Dirichlet parameters, …
    2

    View Slide

  4. Summary
    1. Theoretical results to explain the convergence behavior of LDA.
    ๏ “How does posterior converge as data increases?”
    ๏ Limiting factors: number of documents, length of docs, number of topics, …
    2. Empirical study to support the theory.
    ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, …
    ๏ Real data sets: Wikipedia, the New York Times, and Twitter.
    3. Guidelines for the practical use of LDA.
    ๏ Number of docs, length of docs, number of topics
    ๏ Topic / document separation, Dirichlet parameters, …
    2

    View Slide

  5. Summary
    1. Theoretical results to explain the convergence behavior of LDA.
    ๏ “How does posterior converge as data increases?”
    ๏ Limiting factors: number of documents, length of docs, number of topics, …
    2. Empirical study to support the theory.
    ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, …
    ๏ Real data sets: Wikipedia, the New York Times, and Twitter.
    3. Guidelines for the practical use of LDA.
    ๏ Number of docs, length of docs, number of topics
    ๏ Topic / document separation, Dirichlet parameters, …
    2

    View Slide

  6. Posterior
    Contraction
    Analysis
    Topic
    Modeling
    Empirical
    Study

    View Slide

  7. Posterior
    Contraction
    Analysis
    Topic
    Modeling
    Empirical
    Study

    View Slide

  8. ๏ Modeling latent “topics” of each data.
    ๏ A lot of applications. Not limited to text.
    ๏ LDA: the basic topic model (next slide)
    What is topic modeling?
    5
    Figure from [Blei+ 2003]
    Data
    e.g. document
    Topics!
    e.g. word distribution

    View Slide

  9. latent Dirichlet allocation (LDA) [Blei+ 2003]
    ๏ It assumes that “each document consists of multiple topics”.
    ๏ “Topic” is defined as a distribution over a fixed vocabulary.
    6

    View Slide

  10. latent Dirichlet allocation (LDA) [Blei+ 2003]
    ๏ It assumes that “each document consists of multiple topics”.
    ๏ “Topic” is defined as a distribution over a fixed vocabulary.
    6
    Two-stage generation process for each document
    1. Randomly choose a distribution over topics.
    2. For each word in the document
    a) Randomly choose a topic from the distribution over topic in step #1.
    b) Randomly choose a word from the corresponding topic.

    View Slide

  11. latent Dirichlet allocation (LDA) [Blei+ 2003]
    ๏ It assumes that “each document consists of multiple topics”.
    ๏ “Topic” is defined as a distribution over a fixed vocabulary.
    6
    Two-stage generation process for each document
    1. Randomly choose a distribution over topics.
    2. For each word in the document
    a) Randomly choose a topic from the distribution over topic in step #1.
    b) Randomly choose a word from the corresponding topic.

    View Slide

  12. latent Dirichlet allocation (LDA) [Blei+ 2003]
    ๏ It assumes that “each document consists of multiple topics”.
    ๏ “Topic” is defined as a distribution over a fixed vocabulary.
    6
    Two-stage generation process for each document
    1. Randomly choose a distribution over topics.
    2. For each word in the document
    a) Randomly choose a topic from the distribution over topic in step #1.
    b) Randomly choose a word from the corresponding topic.

    View Slide

  13. 7
    Figure from [Blei 2011]

    View Slide

  14. 7
    Figure from [Blei 2011]
    Topic:
    distribution
    over vocabulary

    View Slide

  15. 7
    Figure from [Blei 2011]
    Topic:
    distribution
    over vocabulary
    Step 1:
    Choose a
    distribution over topics

    View Slide

  16. 7
    Figure from [Blei 2011]
    Topic:
    distribution
    over vocabulary
    Step 1:
    Choose a
    distribution over topics
    Step 2a:
    Choose a topic
    from distribution

    View Slide

  17. 7
    Figure from [Blei 2011]
    Topic:
    distribution
    over vocabulary
    Step 1:
    Choose a
    distribution over topics
    Step 2a:
    Choose a topic
    from distribution
    Step 2b:
    Choose a word
    from topic

    View Slide

  18. 8
    Figures from [Blei 2011]
    Graphical model representation

    View Slide

  19. 8
    Figures from [Blei 2011]
    topic
    Graphical model representation

    View Slide

  20. 8
    Figures from [Blei 2011]
    topic proportion topic
    Graphical model representation

    View Slide

  21. 8
    Figures from [Blei 2011]
    topic
    assignment
    topic proportion topic
    Graphical model representation

    View Slide

  22. 8
    Figures from [Blei 2011]
    observed word
    topic
    assignment
    topic proportion topic
    Graphical model representation

    View Slide

  23. 8
    Figures from [Blei 2011]
    Joint distribution of hidden and observed variables
    observed word
    topic
    assignment
    topic proportion topic
    Graphical model representation

    View Slide

  24. 8
    Figures from [Blei 2011]
    Joint distribution of hidden and observed variables
    observed word
    topic
    assignment
    topic proportion topic
    Graphical model representation

    View Slide

  25. 8
    Figures from [Blei 2011]
    Joint distribution of hidden and observed variables
    observed word
    topic
    assignment
    topic proportion topic
    Graphical model representation

    View Slide

  26. Geometric interpretation
    9
    Figure from [Blei+ 2003]

    View Slide

  27. Geometric interpretation
    9
    Figure from [Blei+ 2003]
    Topic:
    in word simplex

    View Slide

  28. Geometric interpretation
    9
    Figure from [Blei+ 2003]
    Step 1:
    Choose a
    distribution over topics
    Topic:
    in word simplex
    В

    View Slide

  29. Geometric interpretation
    9
    Figure from [Blei+ 2003]
    Step 1:
    Choose a
    distribution over topics
    Topic:
    in word simplex
    Step 2a:
    Choose a topic
    from distribution
    В
    ;

    View Slide

  30. Geometric interpretation
    9
    Figure from [Blei+ 2003]
    Step 1:
    Choose a
    distribution over topics
    Step 2b:
    Choose a word
    from topic
    Topic:
    in word simplex
    8
    Step 2a:
    Choose a topic
    from distribution
    В
    ;

    View Slide

  31. Geometric interpretation
    9
    Figure from [Blei+ 2003]
    Step 1:
    Choose a
    distribution over topics
    Step 2b:
    Choose a word
    from topic
    Topic:
    in word simplex
    8
    Step 2a:
    Choose a topic
    from distribution
    В
    ;
    LDA:

    finding the optimal sub-simplex

    to represent documents.

    View Slide

  32. Geometric interpretation
    9
    Figure from [Blei+ 2003]
    Step 1:
    Choose a
    distribution over topics
    Step 2b:
    Choose a word
    from topic
    Topic:
    in word simplex
    8
    Step 2a:
    Choose a topic
    from distribution
    В
    ;
    LDA:

    finding the optimal sub-simplex

    to represent documents.
    !
    !
    sub-simplex

    View Slide

  33. “reverse” the generation process
    ๏ We are interested in the posterior distribution.
    ๏ latent topic structure, given the observed documents.
    !
    !
    ๏ But it is difficult … → approximate:
    ๏ 1. Sampling-based methods (e.g. Gibbs sampling)
    ๏ 2. Variational methods (e.g. variational Bayes)
    ๏ etc…
    10

    View Slide

  34. “reverse” the generation process
    ๏ We are interested in the posterior distribution.
    ๏ latent topic structure, given the observed documents.
    !
    !
    ๏ But it is difficult … → approximate:
    ๏ 1. Sampling-based methods (e.g. Gibbs sampling)
    ๏ 2. Variational methods (e.g. variational Bayes)
    ๏ etc…
    10

    View Slide

  35. “reverse” the generation process
    ๏ We are interested in the posterior distribution.
    ๏ latent topic structure, given the observed documents.
    !
    !
    ๏ But it is difficult … → approximate:
    ๏ 1. Sampling-based methods (e.g. Gibbs sampling)
    ๏ 2. Variational methods (e.g. variational Bayes)
    ๏ etc…
    10

    View Slide

  36. “reverse” the generation process
    ๏ We are interested in the posterior distribution.
    ๏ latent topic structure, given the observed documents.
    !
    !
    ๏ But it is difficult … → approximate:
    ๏ 1. Sampling-based methods (e.g. Gibbs sampling)
    ๏ 2. Variational methods (e.g. variational Bayes)
    ๏ etc…
    10

    View Slide

  37. FAQs on LDA
    ๏ Is my data topic-model “friendly”?
    ๏ Why did the LDA fail on my data?
    ๏ How many documents do I need to learn 100 topics?
    !
    ๏ Machine learning folklores …
    11

    View Slide

  38. FAQs on LDA
    ๏ Is my data topic-model “friendly”?
    ๏ Why did the LDA fail on my data?
    ๏ How many documents do I need to learn 100 topics?
    !
    ๏ Machine learning folklores …
    11

    View Slide

  39. Posterior
    Contraction
    Analysis
    Topic
    Modeling
    Empirical
    Study

    View Slide

  40. Posterior
    Contraction
    Analysis
    Topic
    Modeling
    Empirical
    Study

    View Slide

  41. Convergence behavior of the posterior
    ๏ How does posterior convergence behavior change, as data increases?
    ๏ → Introduces a metric which describes the contracting neighborhood
    centred at the true topic values, where the posterior distribution will be
    shown to place most its probability mass on.
    ๏ The faster the contraction, the more efficient the statistical inference.
    14

    View Slide

  42. … but it’s difficult to see individual topics
    ๏ Issue of identifiability
    ๏ “label-switching” issue: one can only identify the topic collection up to a
    permutation.
    ๏ Any vector that can be expressed as a convex combination of the topic
    parameters would be hard to identify and analyze.
    15

    View Slide

  43. Latent topic polytype in the LDA
    16
    Topic Polytope: convex hull of the topics
    Figures from [Tang+ 2014]

    View Slide

  44. Latent topic polytype in the LDA
    16
    Topic Polytope: convex hull of the topics
    Figures from [Tang+ 2014]
    topics

    View Slide

  45. Latent topic polytype in the LDA
    16
    Topic Polytope: convex hull of the topics
    Distance between two polytopes: “minimum-matching” Euclidean
    Figures from [Tang+ 2014]
    topics

    View Slide

  46. Latent topic polytype in the LDA
    16
    Topic Polytope: convex hull of the topics
    Distance between two polytopes: “minimum-matching” Euclidean
    Figures from [Tang+ 2014]
    topics
    * Intuitively, this metric is a stable measure of the dissimilarity between two topic polytopes.

    View Slide

  47. Geometric interpretation
    17
    Figure from [Blei+ 2003]

    View Slide

  48. Geometric interpretation
    17
    Figure from [Blei+ 2003]
    !
    !
    Topic!
    Polytope

    View Slide

  49. Upper bound for the learning rate
    18
    Figures from [Tang+ 2014]
    G*: true topic polytope

    K*: true number of topics

    D: number of documents

    N: length of documents

    View Slide

  50. Upper bound for the learning rate
    18
    Figures from [Tang+ 2014]
    G*: true topic polytope

    K*: true number of topics

    D: number of documents

    N: length of documents

    View Slide

  51. Upper bound for the learning rate
    18
    Figures from [Tang+ 2014]
    G*: true topic polytope

    K*: true number of topics

    D: number of documents

    N: length of documents

    View Slide

  52. Upper bound for the learning rate
    18
    Figures from [Tang+ 2014]
    G*: true topic polytope

    K*: true number of topics

    D: number of documents

    N: length of documents

    View Slide

  53. Observations from the theorem 1
    ๏ From (3), we should have log D < N (length of documents should be at least on the order of
    log D, up to a constant factor).
    ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term
    may be an artefact due to the proof technique?
    ๏ In practice the actual rate could be faster than the given upper bound. However, this
    looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N
    should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012]
    ๏ Condition A2: well-separated topics → small β.
    ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics
    are well-separated, the LDA inference is statistically efficient.
    ๏ In practice we do not know K*: while under fitting will result in a persistent error even with
    infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*).
    19

    View Slide

  54. Observations from the theorem 1
    ๏ From (3), we should have log D < N (length of documents should be at least on the order of
    log D, up to a constant factor).
    ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term
    may be an artefact due to the proof technique?
    ๏ In practice the actual rate could be faster than the given upper bound. However, this
    looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N
    should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012]
    ๏ Condition A2: well-separated topics → small β.
    ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics
    are well-separated, the LDA inference is statistically efficient.
    ๏ In practice we do not know K*: while under fitting will result in a persistent error even with
    infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*).
    19

    View Slide

  55. Observations from the theorem 1
    ๏ From (3), we should have log D < N (length of documents should be at least on the order of
    log D, up to a constant factor).
    ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term
    may be an artefact due to the proof technique?
    ๏ In practice the actual rate could be faster than the given upper bound. However, this
    looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N
    should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012]
    ๏ Condition A2: well-separated topics → small β.
    ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics
    are well-separated, the LDA inference is statistically efficient.
    ๏ In practice we do not know K*: while under fitting will result in a persistent error even with
    infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*).
    19

    View Slide

  56. Observations from the theorem 1
    ๏ From (3), we should have log D < N (length of documents should be at least on the order of
    log D, up to a constant factor).
    ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term
    may be an artefact due to the proof technique?
    ๏ In practice the actual rate could be faster than the given upper bound. However, this
    looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N
    should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012]
    ๏ Condition A2: well-separated topics → small β.
    ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics
    are well-separated, the LDA inference is statistically efficient.
    ๏ In practice we do not know K*: while under fitting will result in a persistent error even with
    infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*).
    19

    View Slide

  57. Observations from the theorem 1
    ๏ From (3), we should have log D < N (length of documents should be at least on the order of
    log D, up to a constant factor).
    ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term
    may be an artefact due to the proof technique?
    ๏ In practice the actual rate could be faster than the given upper bound. However, this
    looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N
    should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012]
    ๏ Condition A2: well-separated topics → small β.
    ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics
    are well-separated, the LDA inference is statistically efficient.
    ๏ In practice we do not know K*: while under fitting will result in a persistent error even with
    infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*).
    19

    View Slide

  58. Observations from the theorem 1
    ๏ From (3), we should have log D < N (length of documents should be at least on the order of
    log D, up to a constant factor).
    ๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term
    may be an artefact due to the proof technique?
    ๏ In practice the actual rate could be faster than the given upper bound. However, this
    looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N
    should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012]
    ๏ Condition A2: well-separated topics → small β.
    ๏ Convergence rate does not depend on the number of topics K → once K is known, or topics
    are well-separated, the LDA inference is statistically efficient.
    ๏ In practice we do not know K*: while under fitting will result in a persistent error even with
    infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*).
    19

    View Slide

  59. Theorem for general situations
    20
    When neither condition A1 nor A2 in theorem 1 holds:

    View Slide

  60. Theorem for general situations
    20
    When neither condition A1 nor A2 in theorem 1 holds:
    Upper bound deteriorates with K.

    View Slide

  61. Theorem for general situations
    20
    When neither condition A1 nor A2 in theorem 1 holds:
    Upper bound deteriorates with K.
    c.f. [Nguyen 2012] for more detail.

    View Slide

  62. Posterior
    Contraction
    Analysis
    Topic
    Modeling
    Empirical
    Study

    View Slide

  63. Posterior
    Contraction
    Analysis
    Topic
    Modeling
    Empirical
    Study

    View Slide

  64. Empirical study: metrics
    23
    Distance between two polytopes: 

    “minimum-matching” Euclidean

    View Slide

  65. Empirical study: metrics
    23
    When the number of vertices of polytope in general positions is smaller than 

    the number of dimensions, all such vertices are also the extreme points of their convex hull.
    Distance between two polytopes: 

    “minimum-matching” Euclidean

    View Slide

  66. Empirical study: metrics
    23
    When the number of vertices of polytope in general positions is smaller than 

    the number of dimensions, all such vertices are also the extreme points of their convex hull.
    Distance between two polytopes: 

    “minimum-matching” Euclidean

    View Slide

  67. Experiments on synthetic data
    ๏ Create synthetic data set by LDA generative process.
    ๏ Default settings:
    ๏ true number of topics K*: 3
    ๏ vocabulary size |V|: 5,000
    ๏ symmetric Dirichlet prior for topic proportions: 1
    ๏ symmetric Dirichlet prior for word distributions: 0.01
    ๏ Model inference: collapsed Gibbs sampling.
    ๏ Learning error: posterior mean of the metric.
    ๏ Reported results: averaged over 30 simulations.
    24

    View Slide

  68. Scenario I: fixed N and increasing D
    25
    ๏ N=500
    ๏ D=10~7,000
    ๏ K
    ๏ =3=K*: exact fitted
    ๏ =10: over-fitted
    ๏ β
    ๏ =0.01: (well-separated topics)
    ๏ =1: (more word-diffuse, less distinguishable topics)
    Main varying term
    (compared in graphs)

    View Slide

  69. Scenario I: fixed N and increasing D
    25
    ๏ N=500
    ๏ D=10~7,000
    ๏ K
    ๏ =3=K*: exact fitted
    ๏ =10: over-fitted
    ๏ β
    ๏ =0.01: (well-separated topics)
    ๏ =1: (more word-diffuse, less distinguishable topics)
    Main varying term
    (compared in graphs)
    β = 0.01 β = 1
    1. Same β but different K:
    When LDA is over-fitted (i.e. K > K*),
    the performance degenerates
    significantly.

    View Slide

  70. Scenario I: fixed N and increasing D
    25
    ๏ N=500
    ๏ D=10~7,000
    ๏ K
    ๏ =3=K*: exact fitted
    ๏ =10: over-fitted
    ๏ β
    ๏ =0.01: (well-separated topics)
    ๏ =1: (more word-diffuse, less distinguishable topics)
    Main varying term
    (compared in graphs)
    1. Same β but different K:
    When LDA is over-fitted (i.e. K > K*),
    the performance degenerates
    significantly.
    K = K* K = K*
    K > K* K > K*
    2. Same K but different β:
    When β larger, the error curves decay
    faster when less data is available.
    As more data available: becomes
    slower, then flats out.
    By contrast, small β results in a more
    efficient learning rate.

    View Slide

  71. Scenario I: fixed N and increasing D
    25
    ๏ N=500
    ๏ D=10~7,000
    ๏ K
    ๏ =3=K*: exact fitted
    ๏ =10: over-fitted
    ๏ β
    ๏ =0.01: (well-separated topics)
    ๏ =1: (more word-diffuse, less distinguishable topics)
    Main varying term
    (compared in graphs)
    1. Same β but different K:
    When LDA is over-fitted (i.e. K > K*),
    the performance degenerates
    significantly.
    K = K* K = K*
    K > K* K > K*
    2. Same K but different β:
    When β larger, the error curves decay
    faster when less data is available.
    As more data available: becomes
    slower, then flats out.
    By contrast, small β results in a more
    efficient learning rate.
    3. K=K*:!
    Error rate seems to much (logD/D)^0.5
    quite well.
    In overfitted case, rate is slower.!

    View Slide

  72. Scenario II: fixed D and increasing N
    26
    ๏ N=10~1,400
    ๏ D=1,000
    ๏ K
    ๏ =3=K*: exact fitted
    ๏ =5: over-fitted
    ๏ β
    ๏ =0.01: (well-separated topics)
    ๏ =1: (more word-diffuse, less distinguishable topics)
    Main varying term
    (compared in graphs)

    View Slide

  73. Scenario II: fixed D and increasing N
    26
    ๏ N=10~1,400
    ๏ D=1,000
    ๏ K
    ๏ =3=K*: exact fitted
    ๏ =5: over-fitted
    ๏ β
    ๏ =0.01: (well-separated topics)
    ๏ =1: (more word-diffuse, less distinguishable topics)
    Main varying term
    (compared in graphs)
    Behavior similar to Scenario I.
    !
    In over-fitted cases (K>K*),
    error fails to vanish even N becomes large.
    Possibly due to log D / D in the upper bound.
    K > K* K > K*

    View Slide

  74. Scenario III: N=D, both increasing
    27
    ๏ N=D: 10~1,300
    ๏ K={3, 5}
    ๏ β={0.01, 1}

    View Slide

  75. Scenario III: N=D, both increasing
    27
    ๏ N=D: 10~1,300
    ๏ K={3, 5}
    ๏ β={0.01, 1}
    Similar to previous scenarios, LDA most
    effective in the exact-fitted setting
    (K=K*) & topics are sparse (β small).
    !
    When both conditions fail, the error rate
    fails to converge to zero, even if data
    size D=N increases.
    K >K*
    β = 1

    View Slide

  76. Scenario III: N=D, both increasing
    27
    ๏ N=D: 10~1,300
    ๏ K={3, 5}
    ๏ β={0.01, 1}
    Similar to previous scenarios, LDA most
    effective in the exact-fitted setting
    (K=K*) & topics are sparse (β small).
    !
    When both conditions fail, the error rate
    fails to converge to zero, even if data
    size D=N increases.

    View Slide

  77. Scenario III: N=D, both increasing
    27
    ๏ N=D: 10~1,300
    ๏ K={3, 5}
    ๏ β={0.01, 1}
    Similar to previous scenarios, LDA most
    effective in the exact-fitted setting
    (K=K*) & topics are sparse (β small).
    !
    When both conditions fail, the error rate
    fails to converge to zero, even if data
    size D=N increases.
    Empirical error decays at a faster rate than indicated by the upper
    bound (logD/D)^0.5 from Thm. 1.
    !
    Rough estimate could be Ω(1/D), which actually matches the theoretical
    lower bound of the error contraction rate (cf. Thm. 3 in [Nguyen 2012]).
    !
    This suggests that the upper bound given in Thm. 1 could be quite
    conservative in certain configurations and scenarios.

    View Slide

  78. Exponential exponents of the error rate
    28
    ๏ 2 scenarios:
    ๏ Fixed N=5 and increasing D.
    ๏ D=N and both increasing.

    View Slide

  79. Exponential exponents of the error rate
    28
    ๏ 2 scenarios:
    ๏ Fixed N=5 and increasing D.
    ๏ D=N and both increasing.
    K = K* K = K*

    View Slide

  80. Exponential exponents of the error rate
    28
    ๏ 2 scenarios:
    ๏ Fixed N=5 and increasing D.
    ๏ D=N and both increasing.
    K = K* K = K*
    Exact-fitted (K=K*)!
    Slope of the log error seems close to 1
    → matches the lower bound Ω(1/D)
    K > K* K > K*

    View Slide

  81. Exponential exponents of the error rate
    28
    ๏ 2 scenarios:
    ๏ Fixed N=5 and increasing D.
    ๏ D=N and both increasing.
    K = K* K = K*
    Exact-fitted (K=K*)!
    Slope of the log error seems close to 1
    → matches the lower bound Ω(1/D)
    K > K* K > K*
    Over-fitted (K>K*)!
    Slope tend toward the range bounded
    by 1/2K = 0.1 and 2/K = 0.4
    → approximations of the exponents of
    lower/upper bound by theory.

    View Slide

  82. Experiments on real data sets
    ๏ Wikipedia, the New York Times articles, and Twitter.
    ๏ To test the effects of the four limiting factors: N, D, α, β.
    ๏ Ground-truth topics unknown → use PMI or perplexity.
    29

    View Slide

  83. 30
    Fixed D,

    increasing N
    Fixed N,

    increasing D
    Fixed N&D,

    varying α
    Fixed N&D,

    varying β
    New York Times
    Wikipedia
    Twitter

    View Slide

  84. 30
    Fixed D,

    increasing N
    Fixed N,

    increasing D
    Fixed N&D,

    varying α
    Fixed N&D,

    varying β
    New York Times
    Wikipedia
    Twitter
    Results consistent with theory & empirical analysis on synthetic data.
    !
    With extreme data (e.g. very short or very few),
    or when hyper parameters not appropriately set,
    performance suffers.
    !
    Results suggesting favorable ranges of parameters:
    small β,
    small α (Wikipedia) or large α (NYT, Twitter).

    View Slide

  85. Implications and guidelines: 1 & 2
    1. Number of documents: D
    ๏ Impossible to guarantee identification of topics from small D, no matter how long.
    ๏ Once sufficiently large D, further increase may not significantly improve the result, unless N also
    suitably increased.
    ๏ In practice, the LDA achieves comparable results even if thousands of documents are sampled
    from much larger collection.
    2. Length of document: N
    ๏ Poor result expected when N small, even if D is large.
    ๏ Ideally, N need to be sufficiently long, but need not too long.
    ๏ In practice, for very long documents, one can sample fraction of each document and the LDA
    still yields comparable topics.
    31

    View Slide

  86. Implications and guidelines: 3, 4, & 5
    3. Number of topics: K
    ๏ If K > K*, inference may become inefficient.
    ๏ In theory, the convergence rate deteriorates quickly to a nonparametric rate, depending on the
    number of topics used to fit LDA → Need to be careful not to use too large K.
    4. Topic / document separation: LDA performs well when …
    ๏ Topics are well-separated.
    ๏ Individual documents area associated mostly with small subset of topics.
    5. Hyperparameters
    ๏ It you think each documents associated with few topics, set α small (e.g. 0.1).
    ๏ If the topics are known to be word-sparse, set β small (e.g. 0.01) → more efficient learning.
    32

    View Slide

  87. Limitations of existing results
    1. Geometrically intuitive assumptions
    ๏ e.g. in reality we don’t know how separate the topics are, and whether
    their convex hull is geometrically degenerate or not.
    ๏ → may be beneficial to impose additional geometric constraints on prior.
    2. True / approximated posterior
    ๏ Here we considered true posterior distribution.
    ๏ In practice, posterior is obtained by approximation techniques → error.
    33

    View Slide

  88. To summarize …
    1. Theoretical results to explain the convergence behavior of LDA.
    ๏ “How does posterior converge as data increases?”
    ๏ Limiting factors: number of documents, length of docs, number of topics, …
    2. Empirical study to support the theory.
    ๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, …
    ๏ Real data sets: Wikipedia, the New York Times, and Twitter.
    3. Guidelines for the practical use of LDA.
    ๏ Number of docs, length of docs, number of topics
    ๏ Topic / document separation, Dirichlet parameters, …
    34

    View Slide

  89. Some references (1)
    ๏ [Blei&Lafferty 2009] Topic Models

    http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf
    ๏ [Blei 2011] Introduction to Probabilistic Topic Models

    https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf
    ๏ [Blei 2012] Review Articles: Probabilistic Topic Models

    Communications of The ACM

    http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
    ๏ [Blei 2012] Probabilistic Topic Models

    Machine Learning Summer School

    http://www.cs.princeton.edu/~blei/blei-mlss-2012.pdf
    ๏ Topic Models by David Blei (video)

    https://www.youtube.com/watch?v=DDq3OVp9dNA
    ๏ What is a good explanation of Latent Dirichlet Allocation? - Quora

    http://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation
    ๏ The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors by Matthew L. Jockers

    http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/
    ๏ [࣋ڮ&ੴࠇ 2013] ֬཰తτϐοΫϞσϧ

    ౷ܭ਺ཧݚڀॴ H24೥౓ެ։ߨ࠲

    http://www.ism.ac.jp/~daichi/lectures/ISM-2012-TopicModels-daichi.pdf
    35

    View Slide

  90. Some references (2)
    ๏ [Blei+ 2003] Latent Dirichlet Allocation

    Journal of Machine Learning Research

    http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf
    ๏ [Nguyen 2012] Posterior contraction of the population polytope in finite admixture models

    arXiv preprint arXiv:1206.0068

    http://arxiv.org/abs/1206.0068
    ๏ [Tang+ 2014] Understanding the Limiting Factors of Topic Modeling via Posterior Contraction
    Analysis

    Proceedings of the 31st International Conference on Machine Learning (ICML)

    http://jmlr.org/proceedings/papers/v32/tang14.pdf
    36

    View Slide