Nonparametric Topic Modelling

Nonparametric Topic Modeling Daniel Peterson - Munich DataGeeks

Topic Modeling in general Set of documents Lorem ipsum dolor
sit amet... Sports Politics Health / Fitness Technology Education Finance Set of Topics 70% Sports, 30% Education 50% Technology, 50% Politics 100% Sports 80% Education, 20% Finance Topics per document

A topic is a list of words Sports Politics Health
/ Fitness Technology Education Finance football, goal, score, points, victory, team, ... unemployment, reform, election, taxes, officials, ... diet, exercise, heart, healthy, vegetables, ... computer, tech, startup, ORM, TrustYou, ... school, university, teacher, student, classes, exams, ... stock, purchase, company, market, funds, budget, ...

Topic Modeling Variants • PLSI • LDA ◦ by far
the most popular variant ◦ the full graphical model • HDP ◦ easier than it sounds ◦ how to implement it • Implementations ◦ off-the-shelf ◦ fancy algorithms • Supervised or semi-supervised variants

LDA: Latent Dirichlet Allocation • “Latent” because we don’t know
the topics in advance • “Dirichlet” because the Dirichlet distribution is used • “Allocation” because we allocate (assign) a latent variable to each word

Intuition LDA is a model that mathematically encodes: • Topics
should have only a “small number” of relevant words • Documents should be fully represented by only a “small number” of topics Given a set of documents and parameters, LDA will try and satisfy both requirements as well as possible

3-parameter Dirichlet Image from Yee Whye Teh

The graphical model Full dependency graph to generate the documents:
• For each document 1, …, N we have a θ (θ is the document’s distribution over topics) • For each word 1, …, L in that document we have a k (w is the word, k is the topic that word is assigned to) • For each topic 1, …, K we have a ϕ (ϕ is the distribution of words in the topic) • θ k w j=1..L i=1..N ϕ j=1..K β γ • θ ~ Dir(β) • k ~ Mult(θ) • ϕ k ~ Dir(γ) • w ~ Mult(ϕ k )

How to do it (Gibbs Sampling) 1. Assign each individual
word in each document to a particular topic (random is fine to start) 2. One word at a time, reassign the topic probabilistically so that, in general, the generative model is well-satisfied 3. Stop when the topic distribution is more-or- less stable

Iterative improvement of the model 1. Start with a random
assignment of words to topics 2. For each word: a. Remove the word from the model b. Draw θ and ϕ based on all *other* assignments of words to topics c. Reassign a new topic k based on those estimates 3. Repeat step 2 for a long while Estimating θ and ϕ is where the choosing a Dirichlet distribution comes in handy

Dirichlet makes estimation easy From before, in the sampling step:
“Compute estimates of θ and ϕ based on all *other* assignments of words to topics” This is unnecessary with Dirichlet because the Dirichlet is the conjugate prior of the multinomial We can integrate out estimation of θ and ϕ, and actually only need to count

Resampling topics To reassign topic of word w in document
D Compute, for each k: p(k | θ, D ) ~= (count( k in D ) + β) / (size( D ) + N*β) topics that aren’t in the document are unlikely p(w | k, ϕ ) ~= (count( w in ϕ k ) + γ) / (size ( ϕ k ) + V*γ) [1] topics this word isn’t in are unlikely Bayes’ rule: p(k | θ, D, w, ϕ ) ~= p(k | θ, D ) * p(w | k, ϕ ) Choose new k based on this final distribution.

Implementation notes • Typically β ~ 0.1, γ ~ 0.01
• β should be larger if documents are shorter; γ should in general be smaller than β • # of topics should represent diversity of documents, but it’s typically in the low hundreds • Trimming out stop words is EXTREMELY important to topic coherence ◦ topics group together low-frequency words into smooth features ◦ high frequency words are disruptive to this process ▪ leave them out as separate features ▪ subsample them to remove their dominance

A note on topic sizes Some topics, invariably, get more
assignments than others. In general, there are a small number of topics which are very common, and there’s an exponential-like dropoff in topic size

What happens if you use too many topics • The
same shape emerges, but the tail is filled with small topics • These small topics tend to accumulate just “homeless” terms from a small handful of documents ◦ terms that aren’t common in any of the large topics ◦ often they’re just rare or uncommon terms • Eventually these small topics are full of garbage terms, because all topics available are used

Example topics - decent rome italy pantheon florence milan duomo
vatican sights sites euro square train station europe ... memories club ages boys adult ride ice cream family vacation teens mini golf teenagers child burgers ... hustle bustle crowds peace craziness quieter madness chaos calm tourists escape requests break ... golfers greens golf club holes golfer rounds game groups clubhouse clubs hole courses surroundings ... 150 Topics Used 500 Topics Used

Example topics - garbage priceline bartender hotel restaurant microwave waitress
hot tub nice hotel king room discount bar area fitness center restaurant staff convention ... mandalay bay mandalay thehotel tram wave pool delano excalibur blues mb south end monorail tubes aquarium … nyny roller coaster “new york new york” ny ny new york york mgm grand arcade spa suite piano bar pizza ... deli bathtub mile shower head love business center rating room key fitness room ill midnight furnishings away ... 150 Used 500 Topics Used

Topic size distribution in a hotel reviews corpus (50 topics,
β ~ 0.1, γ ~ 0.01) topic number log(number of words assigned to topic)

Topic size distribution in a hotel reviews corpus (500 topics,
β ~ 0.1, γ ~ 0.01) topic number log(number of words assigned to topic)

HDP: Hierarchical Dirichlet Process • Swap the fixed-size Dirichlet distribution
prior for an infinite Dirichlet process prior ◦ Infinite in the “no fixed (finite) size” sense • Hierarchical because we swap the multinomial inside a document, and the Dirichlet distribution it was drawn from ◦ We have a DP for each document, each drawn from a global DP of topics

The Dirichlet Process: Polya’s Urn • An urn contains one
black ball • At each iteration, we draw a ball from the urn ◦ If the ball is black, we choose a new color and put a ball of that color in the urn ◦ If the ball is not black, we note the color and put a ball of the same color in ◦ Either way, we replace the ball we drew

The Dirichlet Process Characterization of the urn-filling: • New colors
become less likely each iteration • Colors that have been drawn often are more likely to be drawn again • In general, we expect an exponential dropoff in number of balls for each color

Why should this work? • When sampling, we first remove
a ball from the urn completely, ignoring its color • We do the “draw a ball, note the color, …” procedure to replace it • There is selective pressure for minority colors to become even smaller, and eventually disappear ◦ only a small number of colors get used ◦ the typical number of colors increases slowly with data size

The graphical model Full dependency graph to generate the documents:
• Now words are grouped together inside documents with Dirichlet process G D • Each group in each document gets a topic from Dirichlet process G C • α is, essentially, how many black balls are in the urn (not restricted to integer values) • Topics are drawn from fixed-size Dirichlet priors, as before G D d w j=1..L i=1..N G C γ • G C ~ DP(α, γ) • G D ~ DP(α, G C ) • d ~ G D • ϕ ~ Dir(γ) • w ~ Mult(ϕ d ) α α

Resampling groups for words To reassign group of word w
in document D p(d | D ) ~= p(w|d,ϕ) ~= Bayes’ rule: p(d | D, w, ϕ ) ~= p(d | D ) * p(w | d, ϕ ) count( d in D ), d ≠ d new α , d = d new (count(w in ϕ d )+γ) / (size(ϕ d )+V*γ)[1] , d≠d new E[(count(w in ϕ d )+γ) / (size(ϕ d )+V*γ)], d=d new

Resampling topics for groups To reassign topic of group d
in corpus C p(t | C ) ~= p(d|t,ϕ) ~= Implementation: We only sample new topics when we’ve selected a new group, so we don’t actually take a product count( t in C ), t ≠ t new α , t = t new Prod w ϵ d ( [1] ) , t ≠ t new E[Prod w ϵ d ( [1] )] , t=t new

Removing Old Groups / Topics Before resampling the group for
a word, we remove that word from the model - updating the counts as we do so. If this was the only word in a particular group, the group disappears. (The same happens when assigning groups to topics.) We should remove these “phantom” groups

LDA Implementation: Bookkeeping • Topic assignment for each word in
each document (|C|) • Counts of all words in each topic (Nxk) • Counts of all topics in each document (Vxk) Remove a term: subtract 1 from the appropriate element in the above matrices Add a term: add 1 to the appropriate element in the above matrices

HDP Implementation: Bookkeeping • Document group assignment for each word
in each document (|C|) • Sizes of each group in each document (varies) • Topic assignment for each document group (varies in size) • Sizes of each topic (varies) • Counts of all topics in each document (variable number of vectors, each size V)

HDP Implementation: Sampling Remove a term: Subtract 1 from appropriate
counts, delete a group / topic when necessary Add a term: Add 1 to appropriate counts, create a group / topic when necessary

Lazy alternative to HDP • Run LDA with more topics
than you could possibly need. • Set β = α/(# of topics) • Delete garbage topics ◦ Automatically: ▪ Total count of topic and # of unique terms assigned seem to be useful heuristics ▪ Try (# unique terms)/(total count) ◦ Manually: More accurate, reasonably fast • If required, re-run sampling with fixed topics to get appropriate document distributions

Off-the-shelf LDA PLDA (https://code.google.com/p/plda/) • C++ implementation based on MPI
• Extremely fast and easy to distribute to many processors / networked computers • Straightforward text interface ◦ The first pass produces only topics; you need to run a second executable to get document distributions ◦ Special characters (including some punctuation) seem to break the text interface; use only integer counts and plain text characters for safety; YMMV

Other speedups • Bayesian Nonparametrics: Fast updates, but setting hyperparameters
isn’t as intuitive • Locality-Sensitive Hashing: can reduce the cost of sampling to less than 1 clock cycle, reducing it to less than the cost of bookkeeping

Nonparametric Topic Modelling

Nonparametric Topic Modelling

MunichDataGeeks

More Decks by MunichDataGeeks

Other Decks in Science

Featured

Transcript

Nonparametric Topic Modeling Daniel Peterson - Munich DataGeeks

Topic Modeling in general Set of documents Lorem ipsum dolor

A topic is a list of words Sports Politics Health

Topic Modeling Variants • PLSI • LDA ◦ by far

LDA: Latent Dirichlet Allocation • “Latent” because we don’t know

Intuition LDA is a model that mathematically encodes: • Topics

3-parameter Dirichlet Image from Yee Whye Teh

The graphical model Full dependency graph to generate the documents:

How to do it (Gibbs Sampling) 1. Assign each individual

Iterative improvement of the model 1. Start with a random

Dirichlet makes estimation easy From before, in the sampling step:

Resampling topics To reassign topic of word w in document

Implementation notes • Typically β ~ 0.1, γ ~ 0.01

A note on topic sizes Some topics, invariably, get more

What happens if you use too many topics • The

Example topics - decent rome italy pantheon florence milan duomo

Example topics - garbage priceline bartender hotel restaurant microwave waitress

Topic size distribution in a hotel reviews corpus (50 topics,

Topic size distribution in a hotel reviews corpus (500 topics,

HDP: Hierarchical Dirichlet Process • Swap the fixed-size Dirichlet distribution

The Dirichlet Process: Polya’s Urn • An urn contains one

The Dirichlet Process Characterization of the urn-filling: • New colors

Why should this work? • When sampling, we first remove

The graphical model Full dependency graph to generate the documents:

Resampling groups for words To reassign group of word w

Resampling topics for groups To reassign topic of group d

Removing Old Groups / Topics Before resampling the group for

LDA Implementation: Bookkeeping • Topic assignment for each word in

HDP Implementation: Bookkeeping • Document group assignment for each word

HDP Implementation: Sampling Remove a term: Subtract 1 from appropriate

Lazy alternative to HDP • Run LDA with more topics

Off-the-shelf LDA PLDA (https://code.google.com/p/plda/) • C++ implementation based on MPI

Other speedups • Bayesian Nonparametrics: Fast updates, but setting hyperparameters