Mining topics in documents with topic modelling and Python @ London Python meetup

Mining Topics in Documents  with Topic Modelling and Python @MarcoBonzanini
London Python meetup - September 2019 Demo on: github.com/bonzanini/topic-modelling

• Sept 2016: Intro to NLP • Sept 2017: Intro
to Word Embeddings • Sept 2018: Intro to NLG • Sept 2019: Intro to Topic Modelling • Sept 2020: Intro to … ???

Nice to meet you • Data Science consultant:  NLP, Machine
Learning,  Data Engineering • Corporate training:  Python + Data Science • PyData London chairperson

PyData London Conference 15-17 May 2020 @PyDataLondon

This presentation • Introduction to Topic Modelling • Depending on
time/interest:  Happy to discuss broader applications of NLP • The audience (tell me about you):  - new-ish to NLP?  - new-ish to Python tools for NLP? github.com/bonzanini/topic-modelling

Motivation Suppose you: • have a huge number of (text)
documents • want to know what they’re talking about • can’t read them all

Topic Modelling • Bird’s-eye view on the whole corpus (dataset
of docs) • Unsupervised learning  pros: no need for labelled data  cons: how to evaluate the model?

Topic Modelling Input:  - a collection of documents - a
number of topics K

Topic Modelling Output:  - K topics - their word distributions
movie, actor,  soundtrack,  director, … goal, match,  referee,  champions, … price, invest, market, stock, …

Distributional Hypothesis • “You shall know a word by the
company it keeps”  — J. R. Firth, 1957 • “Words that occur in similar context, tend to have similar meaning”  — Z. Harris, 1954 • Context approximates Meaning

Term-document matrix Word 1 Word 2 Word N Doc 1
1 7 2 Doc 2 3 0 5 Doc N 0 4 2

Latent Dirichlet Allocation • Commonly used topic modelling approach •
Key idea:  each document is a distribution of topics  each topic is a distribution of words

Latent Dirichlet Allocation • “Latent” as in hidden:  only words
are visible, other variables are hidden • “Dirichlet Allocation”:  topics are assumed to be distributed with a speciﬁc probability (Dirichlet prior)

Topic Model Evaluation • How good is my topic model? 
“Unsupervised learning”… is there a correct answer? • Extrinsic metrics: what’s the task? • Intrinsic metrics: e.g. topic coherence • More interesting:  - how useful is my topic model?  - data visualisation can help to get some insights

Topic Coherence • It gives a score of the topic
quality • Relationship with Information Theory  (Pointwise Mutual Information) • Used to ﬁnd the best number of topics for a corpus

Conclusions • Topic Modelling gives you a bird’s-eye view on
a collection of documents • It doesn’t give you:  - a “name” for each topic (you have to ﬁnd out)  - the exact number of topics (you have to ﬁnd out) • Excellent tool for exploratory analysis and knowledge discovery

THANK YOU @MarcoBonzanini @PyDataLondon

Mining topics in documents with topic modelling...

Mining topics in documents with topic modelling and Python @ London Python meetup

Marco Bonzanini

More Decks by Marco Bonzanini

Other Decks in Programming

Featured

Transcript

Mining Topics in Documents  with Topic Modelling and Python @MarcoBonzanini

• Sept 2016: Intro to NLP • Sept 2017: Intro

Nice to meet you • Data Science consultant:  NLP, Machine

PyData London Conference 15-17 May 2020 @PyDataLondon

This presentation • Introduction to Topic Modelling • Depending on

Motivation Suppose you: • have a huge number of (text)

Topic Modelling • Bird’s-eye view on the whole corpus (dataset

Topic Modelling Input:  - a collection of documents - a

Topic Modelling Output:  - K topics - their word distributions

Distributional Hypothesis • “You shall know a word by the

Term-document matrix Word 1 Word 2 Word N Doc 1

Latent Dirichlet Allocation • Commonly used topic modelling approach •

Latent Dirichlet Allocation • “Latent” as in hidden:  only words

Topic Model Evaluation • How good is my topic model?

Topic Coherence • It gives a score of the topic

Demo

Conclusions • Topic Modelling gives you a bird’s-eye view on

THANK YOU @MarcoBonzanini @PyDataLondon