America's Next Topic Model at PyData Berlin August 2016

Slide 1

Slide 1 text

America’s Next Topic Model Lev Konstantinovskiy Community Manager at Gensim @teagermylk http://rare-technologies.com/

Slide 2

Slide 2 text

Streaming Topic Modelling and Word2vec in Python

Slide 3

Slide 3 text

About Lev Konstantinovskiy @teagermylk https://github.com/tmylk Graduate school drop-out in Algebraic Geometry Worked in Trading IT Graduate of Galvanize Data Science Bootcamp in San Francisco Community manager and consultant at RaRe Technologies. NLP consulting and open-source development of Natural Language Processing packages in Python.

Slide 4

Slide 4 text

Main corporate asset is data. But how to use it? Imagine you have 10 gigabytes of text. What is in them? How to navigate them? Which keyword to search for? Business problem

Slide 5

Slide 5 text

Business Problem solved by Topic Modelling Bird’s eye view of internal company documents Drill down into individual documents by topic. Rather than just keywords!

Slide 6

Slide 6 text

Business value: similar content LDA Topic Models using Gensim from Andrius Knispelis https://vimeo.com/140431085

Slide 7

Slide 7 text

RaRe Technologies Ltd.

Slide 8

Slide 8 text

Let’s look at one specific Topic Modelling approach: Latent Dirichlet Allocation See paper for details: “Latent Dirichlet Allocation” David Blei, Andrew Ng, Michael Jordan 2003 http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

Slide 9

Slide 9 text

1. Let’s assume: documents are just bags of words a a although although although ambiguity and are arent at bad be be be beats beautiful better better better better better better better better break cases complex complex complicated counts dense do do dutch easy enough errors explain explain explicit explicitly face first flat good great guess hard honking idea idea idea if if implementation implementation implicit in is is is is is is is is is is it it its lets may may more namespaces nested never never never not now now obvious obvious of of often one one one only pass practicality preferably purity readability refuse right rules should should silenced silently simple sparse special special temptation than than than than than than than than that the the the the the there those to to to to to ugly unless unless way way youre' Example from Timothy Hopper: Understanding Probabilistic Topic Models By Simulation https://www.youtube.com/watch?v=_R66X_udxZQ Do you recognise this text?

Slide 10

Slide 10 text

import collections text = 'beautiful is better than ugly explicit is better than implicit simple is better than complex complex is better than complicated flat is better than nested sparse is better than dense readability counts special cases arent special enough to break the rules although practicality beats purity errors should never pass silently unless explicitly silenced in the face of ambiguity refuse the temptation to guess there should be one and preferably only one obvious way to do it although that way may not be obvious at first unless youre dutch now is better than never although never is often better than right now if the implementation is hard to explain its a bad idea if the implementation is easy to explain it may be a good idea namespaces are one honking great idea lets do more of those' ' '.join(sorted(text.split(' '))) It is “The Zen of Python” with word order removed

Slide 11

Slide 11 text

2. Let’s assume that: Topic is a set of words Politics Weather Sports Conservative Rain Olympics Trump Sunny Football Clinton Beach Championship …. …. …. More formally: a topic is a probability distribution over all words, most of them with tiny probability

Slide 12

Slide 12 text

LDA Topic Models using Gensim from Andrius Knispelis https://vimeo.com/140431085 3. Let’s assume that: Documents are written by following recipes Recipe Politics Weather Sports 50% 30% 20% DOCUMENT Conservative Liberal Rain President Sun Snow Minister Olympics Football Politics Weather Sports Conservati ve Rain Olympics Trump Sunny Football Clinton Beach Champions hip …. …. ….

Slide 13

Slide 13 text

The fingerprint of the document is just the Recipe Dimensionality reduction From #{vocabulary_size} of dimensions -> just three Recipe Politics Weather Sports 50% 30% 20%

Slide 14

Slide 14 text

Behind the scenes: How to come up with topics? Go backwards with Bayesian Inference! Find the model that describes the documents best! - “Put words that appear together into the same topic.” Politics Weather Sports Conservati ve Rain Olympics Trump Sunny Football Clinton Beach Champions hip …. …. …. DOCUMENT 1 Conservative Liberal Rain President Sun Snow Minister Olympics Football DOCUMENT 2 Trum Clinton Storm Swimming Hockey DOCUMENT 1000 UEFA Arsenal Football

Slide 15

Slide 15 text

Hyperparameters of Latent Dirichlet Allocation User has to fix some parameters: - Number of topics - Are docs about a few topics or many? (low or high alpha) - Do topics have a few words or many? (low or high beta) - Advanced: pre-seed topics, # of iterations, asymmetric priors, multicore etc See paper for details: “Latent Dirichlet Allocation” David Blei, Andrew Ng, Michael Jordan 2003 http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

Slide 16

Slide 16 text

Now that we have several models trained with different hyperparameters... Which model is the best? Evaluating unsupervised learning.

Slide 17

Slide 17 text

Manual/qualitative evaluation

Slide 18

Slide 18 text

From Latent Dirichlet Allocation paper by David M. Blei. Words colored according to their topic

Slide 19

Slide 19 text

Colouring words in Gensim bow_water = ['bank', 'water', 'river', 'tree'] color_words(good_lda_model, bow_water) bank river water tree color_words(bad_lda_model, bow_water) bank river water tree ? river bank or financial bank ?

Slide 20

Slide 20 text

Credit for word colouring API goes to our Google Summer of Code student Bhargav Srinivasa. Student at BITS Bilani university in Goa, India. Project: Dynamic Topic Modelling in Python

Slide 21

Slide 21 text

PyLDAVis Link to Jupyter notebook: http://small.cat/cat Tutorial video from creator of PyLDAVis Ben Mabey https://www.youtube.com/watch?v=tGxW2BzC_DU

Slide 22

Slide 22 text

PyLDAVis: No topic selected. Blue bar is just frequency in the corpus

Slide 23

Slide 23 text

PyLDAVis: Topic selected. Jeopardy questions about languages Red bar is how frequent the word is in the topic. Blue bar is how frequent the word is in the entire corpus

Slide 24

Slide 24 text

PyLDAVis: Interactive Walk from World History (18) to American history(54). Red bar is how frequent the word is in the topic. Blue bar is how frequent the word is in the entire corpus

Slide 25

Slide 25 text

Automatic/quantitative evaluation.

Slide 26

Slide 26 text

Automated model selection See "Reading Tea Leaves: How Humans Interpret Topic Models by Chang,Boyd-Graber et al". Model fit Human opinion

Slide 27

Slide 27 text

Topic coherence = human opinion Coherence is how often the topic words appear ‘together’ in the corpus. The trick is: many ways to define ‘together’... Just use ‘c_v coherence’ - it is the best one. Paper: “Exploring the Space of Topic Coherence Measures” by M. Roder et al http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

Slide 28

Slide 28 text

Coherence example Corpus = [ “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts” ] Topic to evaluate = {game, sport, ball, team} Many-many coherence measures can be tried. Here are two simple ones.

Slide 29

Slide 29 text

Gensim coherence api All you need is a corpus and a set of words to find out how coherent the set is. goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v') print goodcm.get_coherence() 0.552164532134 badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v') print badcm.get_coherence() 0.5269189184

Slide 30

Slide 30 text

Credit for adding Topic Coherence to Gensim Our incubator student Devashish Deshpande. Student at BITS in Goa, India http://rare-technologies.com/incubator/

Slide 31

Slide 31 text

Summary: How to choose your next Topic Model: Manually: - Colour words - pyLDAVis Automatically: - Topic coherence C_v

Slide 32

Slide 32 text

RARE Training •customized, interactive corporate training hosted on-site for technical teams of 5-15 developers, engineers, analysts and data scientists •2-day intensives include Python Best Practices and Practical Machine Learning, and 1-day intensive Topic Modelling RNDr. Radim Řehůřek, Ph.D. Gordon Mohr, BA in CS & Econ industry-leading instructors for more information email [email protected] m

Slide 33

Slide 33 text

Lev Konstantinovskiy @teagermylk [email protected] See you at our PyCon UK and PyCon India Sprints! NLP Consulting and Corporate training