Upgrade to Pro — share decks privately, control downloads, hide ads and more …

America's Next Topic Model at PyData Berlin August 2016

America's Next Topic Model at PyData Berlin August 2016

How to choose best LDA Topic Model using Gensim and pyLDAVis. Presented at PyData Berlin meetup 10 August 2016

39368910dbd6371b507e0b2113dcf4fe?s=128

Lev Konstantinovskiy

August 10, 2016
Tweet

Transcript

  1. America’s Next Topic Model Lev Konstantinovskiy Community Manager at Gensim

    @teagermylk http://rare-technologies.com/
  2. Streaming Topic Modelling and Word2vec in Python

  3. About Lev Konstantinovskiy @teagermylk https://github.com/tmylk Graduate school drop-out in Algebraic

    Geometry Worked in Trading IT Graduate of Galvanize Data Science Bootcamp in San Francisco Community manager and consultant at RaRe Technologies. NLP consulting and open-source development of Natural Language Processing packages in Python.
  4. Main corporate asset is data. But how to use it?

    Imagine you have 10 gigabytes of text. What is in them? How to navigate them? Which keyword to search for? Business problem
  5. Business Problem solved by Topic Modelling Bird’s eye view of

    internal company documents Drill down into individual documents by topic. Rather than just keywords!
  6. Business value: similar content LDA Topic Models using Gensim from

    Andrius Knispelis https://vimeo.com/140431085
  7. RaRe Technologies Ltd.

  8. Let’s look at one specific Topic Modelling approach: Latent Dirichlet

    Allocation See paper for details: “Latent Dirichlet Allocation” David Blei, Andrew Ng, Michael Jordan 2003 http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
  9. 1. Let’s assume: documents are just bags of words a

    a although although although ambiguity and are arent at bad be be be beats beautiful better better better better better better better better break cases complex complex complicated counts dense do do dutch easy enough errors explain explain explicit explicitly face first flat good great guess hard honking idea idea idea if if implementation implementation implicit in is is is is is is is is is is it it its lets may may more namespaces nested never never never not now now obvious obvious of of often one one one only pass practicality preferably purity readability refuse right rules should should silenced silently simple sparse special special temptation than than than than than than than than that the the the the the there those to to to to to ugly unless unless way way youre' Example from Timothy Hopper: Understanding Probabilistic Topic Models By Simulation https://www.youtube.com/watch?v=_R66X_udxZQ Do you recognise this text?
  10. import collections text = 'beautiful is better than ugly explicit

    is better than implicit simple is better than complex complex is better than complicated flat is better than nested sparse is better than dense readability counts special cases arent special enough to break the rules although practicality beats purity errors should never pass silently unless explicitly silenced in the face of ambiguity refuse the temptation to guess there should be one and preferably only one obvious way to do it although that way may not be obvious at first unless youre dutch now is better than never although never is often better than right now if the implementation is hard to explain its a bad idea if the implementation is easy to explain it may be a good idea namespaces are one honking great idea lets do more of those' ' '.join(sorted(text.split(' '))) It is “The Zen of Python” with word order removed
  11. 2. Let’s assume that: Topic is a set of words

    Politics Weather Sports Conservative Rain Olympics Trump Sunny Football Clinton Beach Championship …. …. …. More formally: a topic is a probability distribution over all words, most of them with tiny probability
  12. LDA Topic Models using Gensim from Andrius Knispelis https://vimeo.com/140431085 3.

    Let’s assume that: Documents are written by following recipes Recipe Politics Weather Sports 50% 30% 20% DOCUMENT Conservative Liberal Rain President Sun Snow Minister Olympics Football Politics Weather Sports Conservati ve Rain Olympics Trump Sunny Football Clinton Beach Champions hip …. …. ….
  13. The fingerprint of the document is just the Recipe Dimensionality

    reduction From #{vocabulary_size} of dimensions -> just three Recipe Politics Weather Sports 50% 30% 20%
  14. Behind the scenes: How to come up with topics? Go

    backwards with Bayesian Inference! Find the model that describes the documents best! - “Put words that appear together into the same topic.” Politics Weather Sports Conservati ve Rain Olympics Trump Sunny Football Clinton Beach Champions hip …. …. …. DOCUMENT 1 Conservative Liberal Rain President Sun Snow Minister Olympics Football DOCUMENT 2 Trum Clinton Storm Swimming Hockey DOCUMENT 1000 UEFA Arsenal Football
  15. Hyperparameters of Latent Dirichlet Allocation User has to fix some

    parameters: - Number of topics - Are docs about a few topics or many? (low or high alpha) - Do topics have a few words or many? (low or high beta) - Advanced: pre-seed topics, # of iterations, asymmetric priors, multicore etc See paper for details: “Latent Dirichlet Allocation” David Blei, Andrew Ng, Michael Jordan 2003 http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
  16. Now that we have several models trained with different hyperparameters...

    Which model is the best? Evaluating unsupervised learning.
  17. Manual/qualitative evaluation

  18. From Latent Dirichlet Allocation paper by David M. Blei. Words

    colored according to their topic
  19. Colouring words in Gensim bow_water = ['bank', 'water', 'river', 'tree']

    color_words(good_lda_model, bow_water) bank river water tree color_words(bad_lda_model, bow_water) bank river water tree ? river bank or financial bank ?
  20. Credit for word colouring API goes to our Google Summer

    of Code student Bhargav Srinivasa. Student at BITS Bilani university in Goa, India. Project: Dynamic Topic Modelling in Python
  21. PyLDAVis Link to Jupyter notebook: http://small.cat/cat Tutorial video from creator

    of PyLDAVis Ben Mabey https://www.youtube.com/watch?v=tGxW2BzC_DU
  22. PyLDAVis: No topic selected. Blue bar is just frequency in

    the corpus
  23. PyLDAVis: Topic selected. Jeopardy questions about languages Red bar is

    how frequent the word is in the topic. Blue bar is how frequent the word is in the entire corpus
  24. PyLDAVis: Interactive Walk from World History (18) to American history(54).

    Red bar is how frequent the word is in the topic. Blue bar is how frequent the word is in the entire corpus
  25. Automatic/quantitative evaluation.

  26. Automated model selection See "Reading Tea Leaves: How Humans Interpret

    Topic Models by Chang,Boyd-Graber et al". Model fit Human opinion
  27. Topic coherence = human opinion Coherence is how often the

    topic words appear ‘together’ in the corpus. The trick is: many ways to define ‘together’... Just use ‘c_v coherence’ - it is the best one. Paper: “Exploring the Space of Topic Coherence Measures” by M. Roder et al http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf
  28. Coherence example Corpus = [ “the game is a team

    sport”, “the game is played with a ball”, “the game demands great physical efforts” ] Topic to evaluate = {game, sport, ball, team} Many-many coherence measures can be tried. Here are two simple ones.
  29. Gensim coherence api All you need is a corpus and

    a set of words to find out how coherent the set is. goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v') print goodcm.get_coherence() 0.552164532134 badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v') print badcm.get_coherence() 0.5269189184
  30. Credit for adding Topic Coherence to Gensim Our incubator student

    Devashish Deshpande. Student at BITS in Goa, India http://rare-technologies.com/incubator/
  31. Summary: How to choose your next Topic Model: Manually: -

    Colour words - pyLDAVis Automatically: - Topic coherence C_v
  32. RARE Training •customized, interactive corporate training hosted on-site for technical

    teams of 5-15 developers, engineers, analysts and data scientists •2-day intensives include Python Best Practices and Practical Machine Learning, and 1-day intensive Topic Modelling RNDr. Radim Řehůřek, Ph.D. Gordon Mohr, BA in CS & Econ industry-leading instructors for more information email training@rare-technologies.co m
  33. Lev Konstantinovskiy @teagermylk lev@rare-technologies.com See you at our PyCon UK

    and PyCon India Sprints! NLP Consulting and Corporate training