Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Topic Modelling workshop @ PyCon UK 2019

Topic Modelling workshop @ PyCon UK 2019

Title: What are they talking about? Mining topics in documents with topic modelling and Python

Abstract:
This tutorial tackles the problem of analysing large data sets of unstructured textual data, with the aim of identifying and understanding topics of interest and their related keywords.

Topic modelling is a technique that provides a bird's-eye view on a large collection of text documents. The purpose is to identify abstract topics and capture hidden semantic structures. Topic modelling techniques can be used in exploratory analysis, to better understand its semantics even in absence of explicit labels.

In this tutorial, we'll walk through the whole pipeline of pre-processing textual data, applying topic modelling techniques, and evaluating the output. The focus will be on classic approaches like Latent Dirichlet Allocation (LDA), with practical examples in Python using the library Gensim.

The tutorial is tailored to beginner users of Natural Language Processing (NLP) tools and people who are interested in knowing more about NLP tools and techniques.

By attending this tutorial, participants will learn:
- how to run an end-to-end NLP pipeline on the problem of topic mining
- how to capture semantic structures in text with topic modelling
- how to assess the output of topic modelling techniques applied to textual data

If you're planning to attend the tutorial, please download the material beforehand: https://github.com/bonzanini/topic-modelling

Marco Bonzanini

September 15, 2019
Tweet

More Decks by Marco Bonzanini

Other Decks in Technology

Transcript

  1. Topic Modelling

    (and Natural Language Processing)

    workshop
    @MarcoBonzanini
    PyCon UK 2019
    github.com/bonzanini/topic-modelling

    View Slide

  2. Nice to meet you
    • Data Science consultant:

    NLP, Machine Learning,

    Data Engineering
    • Corporate training:

    Python + Data Science
    • PyData London chairperson
    github.com/bonzanini/topic-modelling

    View Slide

  3. This tutorial
    • Introduction to Topic Modelling
    • Depending on time/interest:

    Happy to discuss broader applications of NLP
    • The audience (tell me about you):

    - new-ish to NLP?

    - new-ish to Python tools for NLP?
    github.com/bonzanini/topic-modelling

    View Slide

  4. Motivation
    Suppose you:
    • have a huge number of (text) documents
    • want to know what they’re talking about
    • can’t read them all
    github.com/bonzanini/topic-modelling

    View Slide

  5. Topic Modelling
    • Bird’s-eye view on the whole corpus (dataset of docs)
    • Unsupervised learning

    pros: no need for labelled data

    cons: how to evaluate the model?
    github.com/bonzanini/topic-modelling

    View Slide

  6. Topic Modelling
    Input:

    - a collection of documents
    - a number of topics K
    github.com/bonzanini/topic-modelling

    View Slide

  7. Topic Modelling
    Output:

    - K topics
    - their word distributions
    movie, actor,

    soundtrack,

    director, …
    goal, match,

    referee,

    champions, …
    price, invest,
    market,
    stock, …
    github.com/bonzanini/topic-modelling

    View Slide

  8. Distributional Hypothesis
    • “You shall know a word by the company it keeps”

    — J. R. Firth, 1957
    • “Words that occur in similar context, tend to have
    similar meaning”

    — Z. Harris, 1954
    • Context approximates Meaning
    github.com/bonzanini/topic-modelling

    View Slide

  9. Term-document matrix
    Word 1 Word 2 Word N
    Doc 1 1 7 2
    Doc 2 3 0 5
    Doc N 0 4 2
    github.com/bonzanini/topic-modelling

    View Slide

  10. Latent Dirichlet Allocation
    • Commonly used topic modelling approach
    • Key idea:

    each document is a distribution of topics

    each topic is a distribution of words
    github.com/bonzanini/topic-modelling

    View Slide

  11. Latent Dirichlet Allocation
    • “Latent” as in hidden:

    only words are visible, other variables are hidden
    • “Dirichlet Allocation”:

    topics are assumed to be distributed with a
    specific probability (Dirichlet prior)
    github.com/bonzanini/topic-modelling

    View Slide

  12. Topic Model Evaluation
    • How good is my topic model?

    “Unsupervised learning”… is there a correct answer?
    • Extrinsic metrics: what’s the task?
    • Intrinsic metrics: e.g. topic coherence
    • More interesting:

    - how useful is my topic model?

    - data visualisation can help to get some insights
    github.com/bonzanini/topic-modelling

    View Slide

  13. Topic Coherence
    • It gives a score of the topic quality
    • Relationship with Information Theory

    (Pointwise Mutual Information)
    • Used to find the best number of topics for a corpus
    github.com/bonzanini/topic-modelling

    View Slide

  14. Demo

    View Slide