Mining Topics in Documents
with Topic Modelling and Python
@MarcoBonzanini
London Python meetup - September 2019
Demo on: github.com/bonzanini/topic-modelling
Slide 2
Slide 2 text
• Sept 2016: Intro to NLP
• Sept 2017: Intro to Word Embeddings
• Sept 2018: Intro to NLG
• Sept 2019: Intro to Topic Modelling
• Sept 2020: Intro to … ???
Slide 3
Slide 3 text
Nice to meet you
• Data Science consultant:
NLP, Machine Learning,
Data Engineering
• Corporate training:
Python + Data Science
• PyData London chairperson
Slide 4
Slide 4 text
PyData London Conference
15-17 May 2020
@PyDataLondon
Slide 5
Slide 5 text
This presentation
• Introduction to Topic Modelling
• Depending on time/interest:
Happy to discuss broader applications of NLP
• The audience (tell me about you):
- new-ish to NLP?
- new-ish to Python tools for NLP?
github.com/bonzanini/topic-modelling
Slide 6
Slide 6 text
Motivation
Suppose you:
• have a huge number of (text) documents
• want to know what they’re talking about
• can’t read them all
Slide 7
Slide 7 text
Topic Modelling
• Bird’s-eye view on the whole corpus (dataset of docs)
• Unsupervised learning
pros: no need for labelled data
cons: how to evaluate the model?
Slide 8
Slide 8 text
Topic Modelling
Input:
- a collection of documents
- a number of topics K
Slide 9
Slide 9 text
Topic Modelling
Output:
- K topics
- their word distributions
movie, actor,
soundtrack,
director, …
goal, match,
referee,
champions, …
price, invest,
market,
stock, …
Slide 10
Slide 10 text
Distributional Hypothesis
• “You shall know a word by the company it keeps”
— J. R. Firth, 1957
• “Words that occur in similar context, tend to have
similar meaning”
— Z. Harris, 1954
• Context approximates Meaning
Slide 11
Slide 11 text
Term-document matrix
Word 1 Word 2 Word N
Doc 1 1 7 2
Doc 2 3 0 5
Doc N 0 4 2
Slide 12
Slide 12 text
Latent Dirichlet Allocation
• Commonly used topic modelling approach
• Key idea:
each document is a distribution of topics
each topic is a distribution of words
Slide 13
Slide 13 text
Latent Dirichlet Allocation
• “Latent” as in hidden:
only words are visible, other variables are hidden
• “Dirichlet Allocation”:
topics are assumed to be distributed with a
specific probability (Dirichlet prior)
Slide 14
Slide 14 text
Topic Model Evaluation
• How good is my topic model?
“Unsupervised learning”… is there a correct answer?
• Extrinsic metrics: what’s the task?
• Intrinsic metrics: e.g. topic coherence
• More interesting:
- how useful is my topic model?
- data visualisation can help to get some insights
Slide 15
Slide 15 text
Topic Coherence
• It gives a score of the topic quality
• Relationship with Information Theory
(Pointwise Mutual Information)
• Used to find the best number of topics for a corpus
Slide 16
Slide 16 text
Demo
Slide 17
Slide 17 text
Conclusions
• Topic Modelling gives you a bird’s-eye view on a
collection of documents
• It doesn’t give you:
- a “name” for each topic (you have to find out)
- the exact number of topics (you have to find out)
• Excellent tool for exploratory analysis and
knowledge discovery