Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nonparametric Topic Modelling

Nonparametric Topic Modelling

Latent Dirichlet Allocation, typically referred to as Topic Modeling, has proven to be a profoundly useful model in many areas of application. It uncovers latent "topics" in a corpus, leveraging the properties of the Dirichlet distribution to encourage sparseness. This Dirichlet distribution is of fixed size, making the choice of number of topics an important model parameter. By swapping out this fixed-size distribution for the up-to-infinite-sized Dirichlet process, the number of topics will be estimated by the model as well. Although this requires some extra book-keeping, the Gibbs sampling update is not substantially more difficult. At the end of this talk, you should understand why Latent Dirichlet Allocation and the Hierarchical Dirichlet Process have been so effective on so many tasks, and be able to implement them yourself.

MunichDataGeeks

April 14, 2015
Tweet

More Decks by MunichDataGeeks

Other Decks in Science

Transcript

  1. Topic Modeling in general Set of documents Lorem ipsum dolor

    sit amet... Sports Politics Health / Fitness Technology Education Finance Set of Topics 70% Sports, 30% Education 50% Technology, 50% Politics 100% Sports 80% Education, 20% Finance Topics per document
  2. A topic is a list of words Sports Politics Health

    / Fitness Technology Education Finance football, goal, score, points, victory, team, ... unemployment, reform, election, taxes, officials, ... diet, exercise, heart, healthy, vegetables, ... computer, tech, startup, ORM, TrustYou, ... school, university, teacher, student, classes, exams, ... stock, purchase, company, market, funds, budget, ...
  3. Topic Modeling Variants • PLSI • LDA ◦ by far

    the most popular variant ◦ the full graphical model • HDP ◦ easier than it sounds ◦ how to implement it • Implementations ◦ off-the-shelf ◦ fancy algorithms • Supervised or semi-supervised variants
  4. LDA: Latent Dirichlet Allocation • “Latent” because we don’t know

    the topics in advance • “Dirichlet” because the Dirichlet distribution is used • “Allocation” because we allocate (assign) a latent variable to each word
  5. Intuition LDA is a model that mathematically encodes: • Topics

    should have only a “small number” of relevant words • Documents should be fully represented by only a “small number” of topics Given a set of documents and parameters, LDA will try and satisfy both requirements as well as possible
  6. The graphical model Full dependency graph to generate the documents:

    • For each document 1, …, N we have a θ (θ is the document’s distribution over topics) • For each word 1, …, L in that document we have a k (w is the word, k is the topic that word is assigned to) • For each topic 1, …, K we have a ϕ (ϕ is the distribution of words in the topic) • θ k w j=1..L i=1..N ϕ j=1..K β γ • θ ~ Dir(β) • k ~ Mult(θ) • ϕ k ~ Dir(γ) • w ~ Mult(ϕ k )
  7. How to do it (Gibbs Sampling) 1. Assign each individual

    word in each document to a particular topic (random is fine to start) 2. One word at a time, reassign the topic probabilistically so that, in general, the generative model is well-satisfied 3. Stop when the topic distribution is more-or- less stable
  8. Iterative improvement of the model 1. Start with a random

    assignment of words to topics 2. For each word: a. Remove the word from the model b. Draw θ and ϕ based on all *other* assignments of words to topics c. Reassign a new topic k based on those estimates 3. Repeat step 2 for a long while Estimating θ and ϕ is where the choosing a Dirichlet distribution comes in handy
  9. Dirichlet makes estimation easy From before, in the sampling step:

    “Compute estimates of θ and ϕ based on all *other* assignments of words to topics” This is unnecessary with Dirichlet because the Dirichlet is the conjugate prior of the multinomial We can integrate out estimation of θ and ϕ, and actually only need to count
  10. Resampling topics To reassign topic of word w in document

    D Compute, for each k: p(k | θ, D ) ~= (count( k in D ) + β) / (size( D ) + N*β) topics that aren’t in the document are unlikely p(w | k, ϕ ) ~= (count( w in ϕ k ) + γ) / (size ( ϕ k ) + V*γ) [1] topics this word isn’t in are unlikely Bayes’ rule: p(k | θ, D, w, ϕ ) ~= p(k | θ, D ) * p(w | k, ϕ ) Choose new k based on this final distribution.
  11. Implementation notes • Typically β ~ 0.1, γ ~ 0.01

    • β should be larger if documents are shorter; γ should in general be smaller than β • # of topics should represent diversity of documents, but it’s typically in the low hundreds • Trimming out stop words is EXTREMELY important to topic coherence ◦ topics group together low-frequency words into smooth features ◦ high frequency words are disruptive to this process ▪ leave them out as separate features ▪ subsample them to remove their dominance
  12. A note on topic sizes Some topics, invariably, get more

    assignments than others. In general, there are a small number of topics which are very common, and there’s an exponential-like dropoff in topic size
  13. What happens if you use too many topics • The

    same shape emerges, but the tail is filled with small topics • These small topics tend to accumulate just “homeless” terms from a small handful of documents ◦ terms that aren’t common in any of the large topics ◦ often they’re just rare or uncommon terms • Eventually these small topics are full of garbage terms, because all topics available are used
  14. Example topics - decent rome italy pantheon florence milan duomo

    vatican sights sites euro square train station europe ... memories club ages boys adult ride ice cream family vacation teens mini golf teenagers child burgers ... hustle bustle crowds peace craziness quieter madness chaos calm tourists escape requests break ... golfers greens golf club holes golfer rounds game groups clubhouse clubs hole courses surroundings ... 150 Topics Used 500 Topics Used
  15. Example topics - garbage priceline bartender hotel restaurant microwave waitress

    hot tub nice hotel king room discount bar area fitness center restaurant staff convention ... mandalay bay mandalay thehotel tram wave pool delano excalibur blues mb south end monorail tubes aquarium … nyny roller coaster “new york new york” ny ny new york york mgm grand arcade spa suite piano bar pizza ... deli bathtub mile shower head love business center rating room key fitness room ill midnight furnishings away ... 150 Used 500 Topics Used
  16. Topic size distribution in a hotel reviews corpus (50 topics,

    β ~ 0.1, γ ~ 0.01) topic number log(number of words assigned to topic)
  17. Topic size distribution in a hotel reviews corpus (500 topics,

    β ~ 0.1, γ ~ 0.01) topic number log(number of words assigned to topic)
  18. HDP: Hierarchical Dirichlet Process • Swap the fixed-size Dirichlet distribution

    prior for an infinite Dirichlet process prior ◦ Infinite in the “no fixed (finite) size” sense • Hierarchical because we swap the multinomial inside a document, and the Dirichlet distribution it was drawn from ◦ We have a DP for each document, each drawn from a global DP of topics
  19. The Dirichlet Process: Polya’s Urn • An urn contains one

    black ball • At each iteration, we draw a ball from the urn ◦ If the ball is black, we choose a new color and put a ball of that color in the urn ◦ If the ball is not black, we note the color and put a ball of the same color in ◦ Either way, we replace the ball we drew
  20. The Dirichlet Process Characterization of the urn-filling: • New colors

    become less likely each iteration • Colors that have been drawn often are more likely to be drawn again • In general, we expect an exponential drop- off in number of balls for each color
  21. Why should this work? • When sampling, we first remove

    a ball from the urn completely, ignoring its color • We do the “draw a ball, note the color, …” procedure to replace it • There is selective pressure for minority colors to become even smaller, and eventually disappear ◦ only a small number of colors get used ◦ the typical number of colors increases slowly with data size
  22. The graphical model Full dependency graph to generate the documents:

    • Now words are grouped together inside documents with Dirichlet process G D • Each group in each document gets a topic from Dirichlet process G C • α is, essentially, how many black balls are in the urn (not restricted to integer values) • Topics are drawn from fixed-size Dirichlet priors, as before G D d w j=1..L i=1..N G C γ • G C ~ DP(α, γ) • G D ~ DP(α, G C ) • d ~ G D • ϕ ~ Dir(γ) • w ~ Mult(ϕ d ) α α
  23. Resampling groups for words To reassign group of word w

    in document D p(d | D ) ~= p(w|d,ϕ) ~= Bayes’ rule: p(d | D, w, ϕ ) ~= p(d | D ) * p(w | d, ϕ ) count( d in D ), d ≠ d new α , d = d new (count(w in ϕ d )+γ) / (size(ϕ d )+V*γ)[1] , d≠d new E[(count(w in ϕ d )+γ) / (size(ϕ d )+V*γ)], d=d new
  24. Resampling topics for groups To reassign topic of group d

    in corpus C p(t | C ) ~= p(d|t,ϕ) ~= Implementation: We only sample new topics when we’ve selected a new group, so we don’t actually take a product count( t in C ), t ≠ t new α , t = t new Prod w ϵ d ( [1] ) , t ≠ t new E[Prod w ϵ d ( [1] )] , t=t new
  25. Removing Old Groups / Topics Before resampling the group for

    a word, we remove that word from the model - updating the counts as we do so. If this was the only word in a particular group, the group disappears. (The same happens when assigning groups to topics.) We should remove these “phantom” groups
  26. LDA Implementation: Bookkeeping • Topic assignment for each word in

    each document (|C|) • Counts of all words in each topic (Nxk) • Counts of all topics in each document (Vxk) Remove a term: subtract 1 from the appropriate element in the above matrices Add a term: add 1 to the appropriate element in the above matrices
  27. HDP Implementation: Bookkeeping • Document group assignment for each word

    in each document (|C|) • Sizes of each group in each document (varies) • Topic assignment for each document group (varies in size) • Sizes of each topic (varies) • Counts of all topics in each document (variable number of vectors, each size V)
  28. HDP Implementation: Sampling Remove a term: Subtract 1 from appropriate

    counts, delete a group / topic when necessary Add a term: Add 1 to appropriate counts, create a group / topic when necessary
  29. Lazy alternative to HDP • Run LDA with more topics

    than you could possibly need. • Set β = α/(# of topics) • Delete garbage topics ◦ Automatically: ▪ Total count of topic and # of unique terms assigned seem to be useful heuristics ▪ Try (# unique terms)/(total count) ◦ Manually: More accurate, reasonably fast • If required, re-run sampling with fixed topics to get appropriate document distributions
  30. Off-the-shelf LDA PLDA (https://code.google.com/p/plda/) • C++ implementation based on MPI

    • Extremely fast and easy to distribute to many processors / networked computers • Straightforward text interface ◦ The first pass produces only topics; you need to run a second executable to get document distributions ◦ Special characters (including some punctuation) seem to break the text interface; use only integer counts and plain text characters for safety; YMMV
  31. Other speedups • Bayesian Nonparametrics: Fast updates, but setting hyperparameters

    isn’t as intuitive • Locality-Sensitive Hashing: can reduce the cost of sampling to less than 1 clock cycle, reducing it to less than the cost of bookkeeping