$30 off During Our Annual Pro Sale. View Details »

Mining topics in documents with topic modelling and Python @ London Python meetup

Mining topics in documents with topic modelling and Python @ London Python meetup

Introduction to topic modelling in Python - presentation given at the London Python meetup in September 2019 (https://www.meetup.com/LondonPython/events/264921863/)

Title: What are they talking about? Mining topics in documents with topic modelling and Python

This presentation is a practical introduction to topic modelling in Python, tackling the problem of analysing large data sets of text, in order to identify topics of interest and related keywords.


Marco Bonzanini

September 26, 2019

More Decks by Marco Bonzanini

Other Decks in Programming


  1. Mining Topics in Documents

    with Topic Modelling and Python
    London Python meetup - September 2019
    Demo on: github.com/bonzanini/topic-modelling

    View Slide

  2. • Sept 2016: Intro to NLP
    • Sept 2017: Intro to Word Embeddings
    • Sept 2018: Intro to NLG
    • Sept 2019: Intro to Topic Modelling
    • Sept 2020: Intro to … ???

    View Slide

  3. Nice to meet you
    • Data Science consultant:

    NLP, Machine Learning,

    Data Engineering
    • Corporate training:

    Python + Data Science
    • PyData London chairperson

    View Slide

  4. PyData London Conference
    15-17 May 2020

    View Slide

  5. This presentation
    • Introduction to Topic Modelling
    • Depending on time/interest:

    Happy to discuss broader applications of NLP
    • The audience (tell me about you):

    - new-ish to NLP?

    - new-ish to Python tools for NLP?

    View Slide

  6. Motivation
    Suppose you:
    • have a huge number of (text) documents
    • want to know what they’re talking about
    • can’t read them all

    View Slide

  7. Topic Modelling
    • Bird’s-eye view on the whole corpus (dataset of docs)
    • Unsupervised learning

    pros: no need for labelled data

    cons: how to evaluate the model?

    View Slide

  8. Topic Modelling

    - a collection of documents
    - a number of topics K

    View Slide

  9. Topic Modelling

    - K topics
    - their word distributions
    movie, actor,


    director, …
    goal, match,


    champions, …
    price, invest,
    stock, …

    View Slide

  10. Distributional Hypothesis
    • “You shall know a word by the company it keeps”

    — J. R. Firth, 1957
    • “Words that occur in similar context, tend to have
    similar meaning”

    — Z. Harris, 1954
    • Context approximates Meaning

    View Slide

  11. Term-document matrix
    Word 1 Word 2 Word N
    Doc 1 1 7 2
    Doc 2 3 0 5
    Doc N 0 4 2

    View Slide

  12. Latent Dirichlet Allocation
    • Commonly used topic modelling approach
    • Key idea:

    each document is a distribution of topics

    each topic is a distribution of words

    View Slide

  13. Latent Dirichlet Allocation
    • “Latent” as in hidden:

    only words are visible, other variables are hidden
    • “Dirichlet Allocation”:

    topics are assumed to be distributed with a
    specific probability (Dirichlet prior)

    View Slide

  14. Topic Model Evaluation
    • How good is my topic model?

    “Unsupervised learning”… is there a correct answer?
    • Extrinsic metrics: what’s the task?
    • Intrinsic metrics: e.g. topic coherence
    • More interesting:

    - how useful is my topic model?

    - data visualisation can help to get some insights

    View Slide

  15. Topic Coherence
    • It gives a score of the topic quality
    • Relationship with Information Theory

    (Pointwise Mutual Information)
    • Used to find the best number of topics for a corpus

    View Slide

  16. Demo

    View Slide

  17. Conclusions
    • Topic Modelling gives you a bird’s-eye view on a
    collection of documents
    • It doesn’t give you:

    - a “name” for each topic (you have to find out)

    - the exact number of topics (you have to find out)
    • Excellent tool for exploratory analysis and
    knowledge discovery

    View Slide


    View Slide