Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Topic Modelling workshop @ PyCon UK 2019

Topic Modelling workshop @ PyCon UK 2019

Title: What are they talking about? Mining topics in documents with topic modelling and Python

This tutorial tackles the problem of analysing large data sets of unstructured textual data, with the aim of identifying and understanding topics of interest and their related keywords.

Topic modelling is a technique that provides a bird's-eye view on a large collection of text documents. The purpose is to identify abstract topics and capture hidden semantic structures. Topic modelling techniques can be used in exploratory analysis, to better understand its semantics even in absence of explicit labels.

In this tutorial, we'll walk through the whole pipeline of pre-processing textual data, applying topic modelling techniques, and evaluating the output. The focus will be on classic approaches like Latent Dirichlet Allocation (LDA), with practical examples in Python using the library Gensim.

The tutorial is tailored to beginner users of Natural Language Processing (NLP) tools and people who are interested in knowing more about NLP tools and techniques.

By attending this tutorial, participants will learn:
- how to run an end-to-end NLP pipeline on the problem of topic mining
- how to capture semantic structures in text with topic modelling
- how to assess the output of topic modelling techniques applied to textual data

If you're planning to attend the tutorial, please download the material beforehand: https://github.com/bonzanini/topic-modelling

Marco Bonzanini

September 15, 2019

More Decks by Marco Bonzanini

Other Decks in Technology


  1. Topic Modelling

    (and Natural Language Processing)

    PyCon UK 2019

    View Slide

  2. Nice to meet you
    • Data Science consultant:

    NLP, Machine Learning,

    Data Engineering
    • Corporate training:

    Python + Data Science
    • PyData London chairperson

    View Slide

  3. This tutorial
    • Introduction to Topic Modelling
    • Depending on time/interest:

    Happy to discuss broader applications of NLP
    • The audience (tell me about you):

    - new-ish to NLP?

    - new-ish to Python tools for NLP?

    View Slide

  4. Motivation
    Suppose you:
    • have a huge number of (text) documents
    • want to know what they’re talking about
    • can’t read them all

    View Slide

  5. Topic Modelling
    • Bird’s-eye view on the whole corpus (dataset of docs)
    • Unsupervised learning

    pros: no need for labelled data

    cons: how to evaluate the model?

    View Slide

  6. Topic Modelling

    - a collection of documents
    - a number of topics K

    View Slide

  7. Topic Modelling

    - K topics
    - their word distributions
    movie, actor,


    director, …
    goal, match,


    champions, …
    price, invest,
    stock, …

    View Slide

  8. Distributional Hypothesis
    • “You shall know a word by the company it keeps”

    — J. R. Firth, 1957
    • “Words that occur in similar context, tend to have
    similar meaning”

    — Z. Harris, 1954
    • Context approximates Meaning

    View Slide

  9. Term-document matrix
    Word 1 Word 2 Word N
    Doc 1 1 7 2
    Doc 2 3 0 5
    Doc N 0 4 2

    View Slide

  10. Latent Dirichlet Allocation
    • Commonly used topic modelling approach
    • Key idea:

    each document is a distribution of topics

    each topic is a distribution of words

    View Slide

  11. Latent Dirichlet Allocation
    • “Latent” as in hidden:

    only words are visible, other variables are hidden
    • “Dirichlet Allocation”:

    topics are assumed to be distributed with a
    specific probability (Dirichlet prior)

    View Slide

  12. Topic Model Evaluation
    • How good is my topic model?

    “Unsupervised learning”… is there a correct answer?
    • Extrinsic metrics: what’s the task?
    • Intrinsic metrics: e.g. topic coherence
    • More interesting:

    - how useful is my topic model?

    - data visualisation can help to get some insights

    View Slide

  13. Topic Coherence
    • It gives a score of the topic quality
    • Relationship with Information Theory

    (Pointwise Mutual Information)
    • Used to find the best number of topics for a corpus

    View Slide

  14. Demo

    View Slide