America's Next Topic Model at PyData Berlin August 2016

America’s Next Topic Model Lev Konstantinovskiy Community Manager at Gensim
@teagermylk http://rare-technologies.com/

Streaming Topic Modelling and Word2vec in Python

About Lev Konstantinovskiy @teagermylk https://github.com/tmylk Graduate school drop-out in Algebraic
Geometry Worked in Trading IT Graduate of Galvanize Data Science Bootcamp in San Francisco Community manager and consultant at RaRe Technologies. NLP consulting and open-source development of Natural Language Processing packages in Python.

Main corporate asset is data. But how to use it?
Imagine you have 10 gigabytes of text. What is in them? How to navigate them? Which keyword to search for? Business problem

Business Problem solved by Topic Modelling Bird’s eye view of
internal company documents Drill down into individual documents by topic. Rather than just keywords!

Business value: similar content LDA Topic Models using Gensim from
Andrius Knispelis https://vimeo.com/140431085

RaRe Technologies Ltd.

Let’s look at one specific Topic Modelling approach: Latent Dirichlet
Allocation See paper for details: “Latent Dirichlet Allocation” David Blei, Andrew Ng, Michael Jordan 2003 http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

1. Let’s assume: documents are just bags of words a
a although although although ambiguity and are arent at bad be be be beats beautiful better better better better better better better better break cases complex complex complicated counts dense do do dutch easy enough errors explain explain explicit explicitly face first flat good great guess hard honking idea idea idea if if implementation implementation implicit in is is is is is is is is is is it it its lets may may more namespaces nested never never never not now now obvious obvious of of often one one one only pass practicality preferably purity readability refuse right rules should should silenced silently simple sparse special special temptation than than than than than than than than that the the the the the there those to to to to to ugly unless unless way way youre' Example from Timothy Hopper: Understanding Probabilistic Topic Models By Simulation https://www.youtube.com/watch?v=_R66X_udxZQ Do you recognise this text?

import collections text = 'beautiful is better than ugly explicit
is better than implicit simple is better than complex complex is better than complicated flat is better than nested sparse is better than dense readability counts special cases arent special enough to break the rules although practicality beats purity errors should never pass silently unless explicitly silenced in the face of ambiguity refuse the temptation to guess there should be one and preferably only one obvious way to do it although that way may not be obvious at first unless youre dutch now is better than never although never is often better than right now if the implementation is hard to explain its a bad idea if the implementation is easy to explain it may be a good idea namespaces are one honking great idea lets do more of those' ' '.join(sorted(text.split(' '))) It is “The Zen of Python” with word order removed

2. Let’s assume that: Topic is a set of words
Politics Weather Sports Conservative Rain Olympics Trump Sunny Football Clinton Beach Championship …. …. …. More formally: a topic is a probability distribution over all words, most of them with tiny probability

LDA Topic Models using Gensim from Andrius Knispelis https://vimeo.com/140431085 3.
Let’s assume that: Documents are written by following recipes Recipe Politics Weather Sports 50% 30% 20% DOCUMENT Conservative Liberal Rain President Sun Snow Minister Olympics Football Politics Weather Sports Conservati ve Rain Olympics Trump Sunny Football Clinton Beach Champions hip …. …. ….

The fingerprint of the document is just the Recipe Dimensionality
reduction From #{vocabulary_size} of dimensions -> just three Recipe Politics Weather Sports 50% 30% 20%

Behind the scenes: How to come up with topics? Go
backwards with Bayesian Inference! Find the model that describes the documents best! - “Put words that appear together into the same topic.” Politics Weather Sports Conservati ve Rain Olympics Trump Sunny Football Clinton Beach Champions hip …. …. …. DOCUMENT 1 Conservative Liberal Rain President Sun Snow Minister Olympics Football DOCUMENT 2 Trum Clinton Storm Swimming Hockey DOCUMENT 1000 UEFA Arsenal Football

Hyperparameters of Latent Dirichlet Allocation User has to fix some
parameters: - Number of topics - Are docs about a few topics or many? (low or high alpha) - Do topics have a few words or many? (low or high beta) - Advanced: pre-seed topics, # of iterations, asymmetric priors, multicore etc See paper for details: “Latent Dirichlet Allocation” David Blei, Andrew Ng, Michael Jordan 2003 http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

Now that we have several models trained with different hyperparameters...
Which model is the best? Evaluating unsupervised learning.

Manual/qualitative evaluation

From Latent Dirichlet Allocation paper by David M. Blei. Words
colored according to their topic

Colouring words in Gensim bow_water = ['bank', 'water', 'river', 'tree']
color_words(good_lda_model, bow_water) bank river water tree color_words(bad_lda_model, bow_water) bank river water tree ? river bank or financial bank ?

Credit for word colouring API goes to our Google Summer
of Code student Bhargav Srinivasa. Student at BITS Bilani university in Goa, India. Project: Dynamic Topic Modelling in Python

PyLDAVis Link to Jupyter notebook: http://small.cat/cat Tutorial video from creator
of PyLDAVis Ben Mabey https://www.youtube.com/watch?v=tGxW2BzC_DU

PyLDAVis: No topic selected. Blue bar is just frequency in
the corpus

PyLDAVis: Topic selected. Jeopardy questions about languages Red bar is
how frequent the word is in the topic. Blue bar is how frequent the word is in the entire corpus

PyLDAVis: Interactive Walk from World History (18) to American history(54).
Red bar is how frequent the word is in the topic. Blue bar is how frequent the word is in the entire corpus

Automatic/quantitative evaluation.

Automated model selection See "Reading Tea Leaves: How Humans Interpret
Topic Models by Chang,Boyd-Graber et al". Model fit Human opinion

Topic coherence = human opinion Coherence is how often the
topic words appear ‘together’ in the corpus. The trick is: many ways to define ‘together’... Just use ‘c_v coherence’ - it is the best one. Paper: “Exploring the Space of Topic Coherence Measures” by M. Roder et al http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

Coherence example Corpus = [ “the game is a team
sport”, “the game is played with a ball”, “the game demands great physical efforts” ] Topic to evaluate = {game, sport, ball, team} Many-many coherence measures can be tried. Here are two simple ones.

Gensim coherence api All you need is a corpus and
a set of words to find out how coherent the set is. goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v') print goodcm.get_coherence() 0.552164532134 badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v') print badcm.get_coherence() 0.5269189184

Credit for adding Topic Coherence to Gensim Our incubator student
Devashish Deshpande. Student at BITS in Goa, India http://rare-technologies.com/incubator/

Summary: How to choose your next Topic Model: Manually: -
Colour words - pyLDAVis Automatically: - Topic coherence C_v

RARE Training •customized, interactive corporate training hosted on-site for technical
teams of 5-15 developers, engineers, analysts and data scientists •2-day intensives include Python Best Practices and Practical Machine Learning, and 1-day intensive Topic Modelling RNDr. Radim Řehůřek, Ph.D. Gordon Mohr, BA in CS & Econ industry-leading instructors for more information email [email protected] m

Lev Konstantinovskiy @teagermylk [email protected] See you at our PyCon UK
and PyCon India Sprints! NLP Consulting and Corporate training

America's Next Topic Model at PyData Berlin Aug...

America's Next Topic Model at PyData Berlin August 2016

Lev Konstantinovskiy

More Decks by Lev Konstantinovskiy

Other Decks in Programming

Featured

Transcript

America’s Next Topic Model Lev Konstantinovskiy Community Manager at Gensim

Streaming Topic Modelling and Word2vec in Python

About Lev Konstantinovskiy @teagermylk https://github.com/tmylk Graduate school drop-out in Algebraic

Main corporate asset is data. But how to use it?

Business Problem solved by Topic Modelling Bird’s eye view of

Business value: similar content LDA Topic Models using Gensim from

RaRe Technologies Ltd.

Let’s look at one specific Topic Modelling approach: Latent Dirichlet

1. Let’s assume: documents are just bags of words a

import collections text = 'beautiful is better than ugly explicit

2. Let’s assume that: Topic is a set of words

LDA Topic Models using Gensim from Andrius Knispelis https://vimeo.com/140431085 3.

The fingerprint of the document is just the Recipe Dimensionality

Behind the scenes: How to come up with topics? Go

Hyperparameters of Latent Dirichlet Allocation User has to fix some

Now that we have several models trained with different hyperparameters...

Manual/qualitative evaluation

From Latent Dirichlet Allocation paper by David M. Blei. Words

Colouring words in Gensim bow_water = ['bank', 'water', 'river', 'tree']

Credit for word colouring API goes to our Google Summer

PyLDAVis Link to Jupyter notebook: http://small.cat/cat Tutorial video from creator

PyLDAVis: No topic selected. Blue bar is just frequency in

PyLDAVis: Topic selected. Jeopardy questions about languages Red bar is

PyLDAVis: Interactive Walk from World History (18) to American history(54).

Automatic/quantitative evaluation.

Automated model selection See "Reading Tea Leaves: How Humans Interpret

Topic coherence = human opinion Coherence is how often the

Coherence example Corpus = [ “the game is a team

Gensim coherence api All you need is a corpus and

Credit for adding Topic Coherence to Gensim Our incubator student

Summary: How to choose your next Topic Model: Manually: -

RARE Training •customized, interactive corporate training hosted on-site for technical

Lev Konstantinovskiy @teagermylk [email protected] See you at our PyCon UK