$30 off During Our Annual Pro Sale. View Details »

Mining topics in documents with topic modelling and Python @ London Python meetup

Mining topics in documents with topic modelling and Python @ London Python meetup

Introduction to topic modelling in Python - presentation given at the London Python meetup in September 2019 (https://www.meetup.com/LondonPython/events/264921863/)

Title: What are they talking about? Mining topics in documents with topic modelling and Python

Abstract:
This presentation is a practical introduction to topic modelling in Python, tackling the problem of analysing large data sets of text, in order to identify topics of interest and related keywords.

Demo:
https://github.com/bonzanini/topic-modelling

Marco Bonzanini

September 26, 2019
Tweet

More Decks by Marco Bonzanini

Other Decks in Programming

Transcript

  1. Mining Topics in Documents

    with Topic Modelling and Python
    @MarcoBonzanini
    London Python meetup - September 2019
    Demo on: github.com/bonzanini/topic-modelling

    View Slide

  2. • Sept 2016: Intro to NLP
    • Sept 2017: Intro to Word Embeddings
    • Sept 2018: Intro to NLG
    • Sept 2019: Intro to Topic Modelling
    • Sept 2020: Intro to … ???

    View Slide

  3. Nice to meet you
    • Data Science consultant:

    NLP, Machine Learning,

    Data Engineering
    • Corporate training:

    Python + Data Science
    • PyData London chairperson

    View Slide

  4. PyData London Conference
    15-17 May 2020
    @PyDataLondon

    View Slide

  5. This presentation
    • Introduction to Topic Modelling
    • Depending on time/interest:

    Happy to discuss broader applications of NLP
    • The audience (tell me about you):

    - new-ish to NLP?

    - new-ish to Python tools for NLP?
    github.com/bonzanini/topic-modelling

    View Slide

  6. Motivation
    Suppose you:
    • have a huge number of (text) documents
    • want to know what they’re talking about
    • can’t read them all

    View Slide

  7. Topic Modelling
    • Bird’s-eye view on the whole corpus (dataset of docs)
    • Unsupervised learning

    pros: no need for labelled data

    cons: how to evaluate the model?

    View Slide

  8. Topic Modelling
    Input:

    - a collection of documents
    - a number of topics K

    View Slide

  9. Topic Modelling
    Output:

    - K topics
    - their word distributions
    movie, actor,

    soundtrack,

    director, …
    goal, match,

    referee,

    champions, …
    price, invest,
    market,
    stock, …

    View Slide

  10. Distributional Hypothesis
    • “You shall know a word by the company it keeps”

    — J. R. Firth, 1957
    • “Words that occur in similar context, tend to have
    similar meaning”

    — Z. Harris, 1954
    • Context approximates Meaning

    View Slide

  11. Term-document matrix
    Word 1 Word 2 Word N
    Doc 1 1 7 2
    Doc 2 3 0 5
    Doc N 0 4 2

    View Slide

  12. Latent Dirichlet Allocation
    • Commonly used topic modelling approach
    • Key idea:

    each document is a distribution of topics

    each topic is a distribution of words

    View Slide

  13. Latent Dirichlet Allocation
    • “Latent” as in hidden:

    only words are visible, other variables are hidden
    • “Dirichlet Allocation”:

    topics are assumed to be distributed with a
    specific probability (Dirichlet prior)

    View Slide

  14. Topic Model Evaluation
    • How good is my topic model?

    “Unsupervised learning”… is there a correct answer?
    • Extrinsic metrics: what’s the task?
    • Intrinsic metrics: e.g. topic coherence
    • More interesting:

    - how useful is my topic model?

    - data visualisation can help to get some insights

    View Slide

  15. Topic Coherence
    • It gives a score of the topic quality
    • Relationship with Information Theory

    (Pointwise Mutual Information)
    • Used to find the best number of topics for a corpus

    View Slide

  16. Demo

    View Slide

  17. Conclusions
    • Topic Modelling gives you a bird’s-eye view on a
    collection of documents
    • It doesn’t give you:

    - a “name” for each topic (you have to find out)

    - the exact number of topics (you have to find out)
    • Excellent tool for exploratory analysis and
    knowledge discovery

    View Slide

  18. THANK YOU
    @MarcoBonzanini
    @PyDataLondon

    View Slide