Upgrade to Pro — share decks privately, control downloads, hide ads and more …

America's Next Topic Model at PyData Berlin August 2016

America's Next Topic Model at PyData Berlin August 2016

How to choose best LDA Topic Model using Gensim and pyLDAVis. Presented at PyData Berlin meetup 10 August 2016

Lev Konstantinovskiy

August 10, 2016
Tweet

More Decks by Lev Konstantinovskiy

Other Decks in Programming

Transcript

  1. America’s Next Topic Model
    Lev Konstantinovskiy
    Community Manager at Gensim
    @teagermylk
    http://rare-technologies.com/

    View Slide

  2. Streaming
    Topic Modelling and Word2vec in Python

    View Slide

  3. About
    Lev Konstantinovskiy
    @teagermylk
    https://github.com/tmylk
    Graduate school drop-out in Algebraic Geometry
    Worked in Trading IT
    Graduate of Galvanize Data Science Bootcamp in San Francisco
    Community manager and consultant at RaRe Technologies.
    NLP consulting and open-source development of Natural Language
    Processing packages in Python.

    View Slide

  4. Main corporate asset is data. But how to use it?
    Imagine you have 10 gigabytes of text. What is in them? How to navigate
    them? Which keyword to search for?
    Business problem

    View Slide

  5. Business Problem solved by Topic Modelling
    Bird’s eye view of internal company documents
    Drill down into individual documents by topic.
    Rather than just keywords!

    View Slide

  6. Business value: similar content
    LDA Topic Models using Gensim from Andrius Knispelis https://vimeo.com/140431085

    View Slide

  7. RaRe Technologies Ltd.

    View Slide

  8. Let’s look at one specific Topic Modelling approach:
    Latent Dirichlet Allocation
    See paper for details: “Latent Dirichlet Allocation” David Blei, Andrew Ng, Michael Jordan 2003
    http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

    View Slide

  9. 1. Let’s assume: documents are just bags of words
    a a although although although ambiguity and are arent at bad be be be beats beautiful better
    better better better better better better better break cases complex complex complicated counts
    dense do do dutch easy enough errors explain explain explicit explicitly face first flat good great
    guess hard honking idea idea idea if if implementation implementation implicit in is is is is is is is
    is is is it it its lets may may more namespaces nested never never never not now now obvious
    obvious of of often one one one only pass practicality preferably purity readability refuse right
    rules should should silenced silently simple sparse special special temptation than than than than
    than than than than that the the the the the there those to to to to to ugly unless unless way way
    youre'
    Example from Timothy Hopper: Understanding Probabilistic Topic Models By Simulation
    https://www.youtube.com/watch?v=_R66X_udxZQ
    Do you recognise this text?

    View Slide

  10. import collections
    text = 'beautiful is better than ugly explicit is better than implicit simple is better than complex
    complex is better than complicated flat is better than nested sparse is better than dense
    readability counts special cases arent special enough to break the rules although practicality
    beats purity errors should never pass silently unless explicitly silenced in the face of
    ambiguity refuse the temptation to guess there should be one and preferably only one
    obvious way to do it although that way may not be obvious at first unless youre dutch now is
    better than never although never is often better than right now if the implementation is hard to
    explain its a bad idea if the implementation is easy to explain it may be a good idea
    namespaces are one honking great idea lets do more of those'
    ' '.join(sorted(text.split(' ')))
    It is “The Zen of Python” with word order removed

    View Slide

  11. 2. Let’s assume that: Topic is a set of words
    Politics Weather Sports
    Conservative Rain Olympics
    Trump Sunny Football
    Clinton Beach Championship
    …. …. ….
    More formally: a topic is a probability distribution over all words, most of them with tiny probability

    View Slide

  12. LDA Topic Models using Gensim from Andrius Knispelis https://vimeo.com/140431085
    3. Let’s assume that:
    Documents are written by following recipes
    Recipe
    Politics Weather Sports
    50% 30% 20%
    DOCUMENT
    Conservative
    Liberal
    Rain President
    Sun Snow
    Minister
    Olympics Football
    Politics Weather Sports
    Conservati
    ve
    Rain Olympics
    Trump Sunny Football
    Clinton Beach Champions
    hip
    …. …. ….

    View Slide

  13. The fingerprint of the document is just the
    Recipe
    Dimensionality reduction
    From #{vocabulary_size} of dimensions -> just three
    Recipe
    Politics Weather Sports
    50% 30% 20%

    View Slide

  14. Behind the scenes: How to come up with topics?
    Go backwards with Bayesian Inference!
    Find the model that describes
    the documents best!
    -
    “Put words that appear together
    into the same topic.”
    Politics Weather Sports
    Conservati
    ve
    Rain Olympics
    Trump Sunny Football
    Clinton Beach Champions
    hip
    …. …. ….
    DOCUMENT 1
    Conservative Liberal
    Rain President Sun
    Snow Minister
    Olympics Football
    DOCUMENT 2
    Trum Clinton
    Storm Swimming
    Hockey
    DOCUMENT 1000
    UEFA Arsenal
    Football

    View Slide

  15. Hyperparameters of Latent Dirichlet Allocation
    User has to fix some parameters:
    - Number of topics
    - Are docs about a few topics or many? (low or high alpha)
    - Do topics have a few words or many? (low or high beta)
    - Advanced: pre-seed topics, # of iterations, asymmetric
    priors, multicore etc
    See paper for details: “Latent Dirichlet Allocation” David Blei, Andrew Ng, Michael Jordan 2003
    http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

    View Slide

  16. Now that we have several models trained with
    different hyperparameters...
    Which model is the best?
    Evaluating unsupervised learning.

    View Slide

  17. Manual/qualitative evaluation

    View Slide

  18. From Latent Dirichlet Allocation paper by David M. Blei.
    Words colored according to their topic

    View Slide

  19. Colouring words in Gensim
    bow_water = ['bank', 'water', 'river',
    'tree']
    color_words(good_lda_model, bow_water)
    bank river water tree
    color_words(bad_lda_model, bow_water)
    bank river water tree
    ? river bank or financial bank ?

    View Slide

  20. Credit for word colouring API goes to our Google Summer of Code student
    Bhargav Srinivasa. Student at BITS Bilani university in Goa, India.
    Project: Dynamic Topic Modelling in Python

    View Slide

  21. PyLDAVis
    Link to Jupyter notebook: http://small.cat/cat
    Tutorial video from creator of PyLDAVis Ben Mabey https://www.youtube.com/watch?v=tGxW2BzC_DU

    View Slide

  22. PyLDAVis: No topic selected.
    Blue bar is just frequency in the corpus

    View Slide

  23. PyLDAVis: Topic selected.
    Jeopardy questions about languages
    Red bar is how frequent the word is in the topic.
    Blue bar is how frequent the word is in the entire corpus

    View Slide

  24. PyLDAVis: Interactive
    Walk from World History (18) to American
    history(54).
    Red bar is how frequent the word is in the topic.
    Blue bar is how frequent the word is in the entire corpus

    View Slide

  25. Automatic/quantitative evaluation.

    View Slide

  26. Automated model selection
    See "Reading Tea Leaves: How Humans Interpret Topic Models by Chang,Boyd-Graber et al".
    Model fit
    Human
    opinion

    View Slide

  27. Topic coherence = human opinion
    Coherence is how often the topic words appear ‘together’ in the
    corpus.
    The trick is: many ways to define ‘together’... Just use ‘c_v
    coherence’ - it is the best one.
    Paper: “Exploring the Space of Topic Coherence Measures” by M. Roder et al
    http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

    View Slide

  28. Coherence example
    Corpus = [ “the game is a team sport”, “the game is played with a ball”, “the game
    demands great physical efforts” ]
    Topic to evaluate = {game, sport, ball, team}
    Many-many coherence measures can be tried. Here are two simple ones.

    View Slide

  29. Gensim coherence api
    All you need is a corpus and a set of words to find out how
    coherent the set is.
    goodcm = CoherenceModel(model=goodLdaModel, texts=texts,
    dictionary=dictionary, coherence='c_v')
    print goodcm.get_coherence()
    0.552164532134
    badcm = CoherenceModel(model=badLdaModel, texts=texts,
    dictionary=dictionary, coherence='c_v')
    print badcm.get_coherence()
    0.5269189184

    View Slide

  30. Credit for adding Topic Coherence to Gensim
    Our incubator student Devashish Deshpande. Student at BITS in Goa, India
    http://rare-technologies.com/incubator/

    View Slide

  31. Summary: How to choose your next Topic Model:
    Manually:
    - Colour words
    - pyLDAVis
    Automatically:
    - Topic coherence C_v

    View Slide

  32. RARE Training
    •customized, interactive corporate training hosted on-site for
    technical teams of 5-15 developers, engineers, analysts and
    data scientists
    •2-day intensives include Python Best Practices and Practical
    Machine Learning, and 1-day intensive Topic Modelling
    RNDr. Radim
    Řehůřek, Ph.D.
    Gordon
    Mohr,
    BA in CS & Econ
    industry-leading instructors
    for more information email
    [email protected]
    m

    View Slide

  33. Lev Konstantinovskiy @teagermylk
    [email protected]
    See you at our PyCon UK and PyCon India Sprints!
    NLP Consulting and Corporate training

    View Slide