Mining topics in documents with topic modelling and Python @ London Python meetup

Mining topics in documents with topic modelling and Python @ London Python meetup

Introduction to topic modelling in Python - presentation given at the London Python meetup in September 2019 (https://www.meetup.com/LondonPython/events/264921863/)

Title: What are they talking about? Mining topics in documents with topic modelling and Python

Abstract:
This presentation is a practical introduction to topic modelling in Python, tackling the problem of analysing large data sets of text, in order to identify topics of interest and related keywords.

Demo:
https://github.com/bonzanini/topic-modelling

Aa38bb7a9c35bc414da6ec7dcd8d7339?s=128

Marco Bonzanini

September 26, 2019
Tweet

Transcript

  1. Mining Topics in Documents
 with Topic Modelling and Python @MarcoBonzanini

    London Python meetup - September 2019 Demo on: github.com/bonzanini/topic-modelling
  2. • Sept 2016: Intro to NLP • Sept 2017: Intro

    to Word Embeddings • Sept 2018: Intro to NLG • Sept 2019: Intro to Topic Modelling • Sept 2020: Intro to … ???
  3. Nice to meet you • Data Science consultant:
 NLP, Machine

    Learning,
 Data Engineering • Corporate training:
 Python + Data Science • PyData London chairperson
  4. PyData London Conference 15-17 May 2020 @PyDataLondon

  5. This presentation • Introduction to Topic Modelling • Depending on

    time/interest:
 Happy to discuss broader applications of NLP • The audience (tell me about you):
 - new-ish to NLP?
 - new-ish to Python tools for NLP? github.com/bonzanini/topic-modelling
  6. Motivation Suppose you: • have a huge number of (text)

    documents • want to know what they’re talking about • can’t read them all
  7. Topic Modelling • Bird’s-eye view on the whole corpus (dataset

    of docs) • Unsupervised learning
 pros: no need for labelled data
 cons: how to evaluate the model?
  8. Topic Modelling Input:
 - a collection of documents - a

    number of topics K
  9. Topic Modelling Output:
 - K topics - their word distributions

    movie, actor,
 soundtrack,
 director, … goal, match,
 referee,
 champions, … price, invest, market, stock, …
  10. Distributional Hypothesis • “You shall know a word by the

    company it keeps”
 — J. R. Firth, 1957 • “Words that occur in similar context, tend to have similar meaning”
 — Z. Harris, 1954 • Context approximates Meaning
  11. Term-document matrix Word 1 Word 2 Word N Doc 1

    1 7 2 Doc 2 3 0 5 Doc N 0 4 2
  12. Latent Dirichlet Allocation • Commonly used topic modelling approach •

    Key idea:
 each document is a distribution of topics
 each topic is a distribution of words
  13. Latent Dirichlet Allocation • “Latent” as in hidden:
 only words

    are visible, other variables are hidden • “Dirichlet Allocation”:
 topics are assumed to be distributed with a specific probability (Dirichlet prior)
  14. Topic Model Evaluation • How good is my topic model?


    “Unsupervised learning”… is there a correct answer? • Extrinsic metrics: what’s the task? • Intrinsic metrics: e.g. topic coherence • More interesting:
 - how useful is my topic model?
 - data visualisation can help to get some insights
  15. Topic Coherence • It gives a score of the topic

    quality • Relationship with Information Theory
 (Pointwise Mutual Information) • Used to find the best number of topics for a corpus
  16. Demo

  17. Conclusions • Topic Modelling gives you a bird’s-eye view on

    a collection of documents • It doesn’t give you:
 - a “name” for each topic (you have to find out)
 - the exact number of topics (you have to find out) • Excellent tool for exploratory analysis and knowledge discovery
  18. THANK YOU @MarcoBonzanini @PyDataLondon