Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Topic Modelling workshop @ PyCon UK 2019

Topic Modelling workshop @ PyCon UK 2019

Title: What are they talking about? Mining topics in documents with topic modelling and Python

Abstract:
This tutorial tackles the problem of analysing large data sets of unstructured textual data, with the aim of identifying and understanding topics of interest and their related keywords.

Topic modelling is a technique that provides a bird's-eye view on a large collection of text documents. The purpose is to identify abstract topics and capture hidden semantic structures. Topic modelling techniques can be used in exploratory analysis, to better understand its semantics even in absence of explicit labels.

In this tutorial, we'll walk through the whole pipeline of pre-processing textual data, applying topic modelling techniques, and evaluating the output. The focus will be on classic approaches like Latent Dirichlet Allocation (LDA), with practical examples in Python using the library Gensim.

The tutorial is tailored to beginner users of Natural Language Processing (NLP) tools and people who are interested in knowing more about NLP tools and techniques.

By attending this tutorial, participants will learn:
- how to run an end-to-end NLP pipeline on the problem of topic mining
- how to capture semantic structures in text with topic modelling
- how to assess the output of topic modelling techniques applied to textual data

If you're planning to attend the tutorial, please download the material beforehand: https://github.com/bonzanini/topic-modelling

Marco Bonzanini

September 15, 2019
Tweet

More Decks by Marco Bonzanini

Other Decks in Technology

Transcript

  1. Nice to meet you • Data Science consultant:
 NLP, Machine

    Learning,
 Data Engineering • Corporate training:
 Python + Data Science • PyData London chairperson github.com/bonzanini/topic-modelling
  2. This tutorial • Introduction to Topic Modelling • Depending on

    time/interest:
 Happy to discuss broader applications of NLP • The audience (tell me about you):
 - new-ish to NLP?
 - new-ish to Python tools for NLP? github.com/bonzanini/topic-modelling
  3. Motivation Suppose you: • have a huge number of (text)

    documents • want to know what they’re talking about • can’t read them all github.com/bonzanini/topic-modelling
  4. Topic Modelling • Bird’s-eye view on the whole corpus (dataset

    of docs) • Unsupervised learning
 pros: no need for labelled data
 cons: how to evaluate the model? github.com/bonzanini/topic-modelling
  5. Topic Modelling Input:
 - a collection of documents - a

    number of topics K github.com/bonzanini/topic-modelling
  6. Topic Modelling Output:
 - K topics - their word distributions

    movie, actor,
 soundtrack,
 director, … goal, match,
 referee,
 champions, … price, invest, market, stock, … github.com/bonzanini/topic-modelling
  7. Distributional Hypothesis • “You shall know a word by the

    company it keeps”
 — J. R. Firth, 1957 • “Words that occur in similar context, tend to have similar meaning”
 — Z. Harris, 1954 • Context approximates Meaning github.com/bonzanini/topic-modelling
  8. Term-document matrix Word 1 Word 2 Word N Doc 1

    1 7 2 Doc 2 3 0 5 Doc N 0 4 2 github.com/bonzanini/topic-modelling
  9. Latent Dirichlet Allocation • Commonly used topic modelling approach •

    Key idea:
 each document is a distribution of topics
 each topic is a distribution of words github.com/bonzanini/topic-modelling
  10. Latent Dirichlet Allocation • “Latent” as in hidden:
 only words

    are visible, other variables are hidden • “Dirichlet Allocation”:
 topics are assumed to be distributed with a specific probability (Dirichlet prior) github.com/bonzanini/topic-modelling
  11. Topic Model Evaluation • How good is my topic model?


    “Unsupervised learning”… is there a correct answer? • Extrinsic metrics: what’s the task? • Intrinsic metrics: e.g. topic coherence • More interesting:
 - how useful is my topic model?
 - data visualisation can help to get some insights github.com/bonzanini/topic-modelling
  12. Topic Coherence • It gives a score of the topic

    quality • Relationship with Information Theory
 (Pointwise Mutual Information) • Used to find the best number of topics for a corpus github.com/bonzanini/topic-modelling