About Lev Konstantinovskiy @teagermylk https://github.com/tmylk Graduate school drop-out in Algebraic Geometry Worked in Trading IT Graduate of Galvanize Data Science Bootcamp in San Francisco Community manager and consultant at RaRe Technologies. NLP consulting and open-source development of Natural Language Processing packages in Python.
Main corporate asset is data. But how to use it? Imagine you have 10 gigabytes of text. What is in them? How to navigate them? Which keyword to search for? Business problem
Business Problem solved by Topic Modelling Bird’s eye view of internal company documents Drill down into individual documents by topic. Rather than just keywords!
Let’s look at one specific Topic Modelling approach: Latent Dirichlet Allocation See paper for details: “Latent Dirichlet Allocation” David Blei, Andrew Ng, Michael Jordan 2003 http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
1. Let’s assume: documents are just bags of words a a although although although ambiguity and are arent at bad be be be beats beautiful better better better better better better better better break cases complex complex complicated counts dense do do dutch easy enough errors explain explain explicit explicitly face first flat good great guess hard honking idea idea idea if if implementation implementation implicit in is is is is is is is is is is it it its lets may may more namespaces nested never never never not now now obvious obvious of of often one one one only pass practicality preferably purity readability refuse right rules should should silenced silently simple sparse special special temptation than than than than than than than than that the the the the the there those to to to to to ugly unless unless way way youre' Example from Timothy Hopper: Understanding Probabilistic Topic Models By Simulation https://www.youtube.com/watch?v=_R66X_udxZQ Do you recognise this text?
import collections text = 'beautiful is better than ugly explicit is better than implicit simple is better than complex complex is better than complicated flat is better than nested sparse is better than dense readability counts special cases arent special enough to break the rules although practicality beats purity errors should never pass silently unless explicitly silenced in the face of ambiguity refuse the temptation to guess there should be one and preferably only one obvious way to do it although that way may not be obvious at first unless youre dutch now is better than never although never is often better than right now if the implementation is hard to explain its a bad idea if the implementation is easy to explain it may be a good idea namespaces are one honking great idea lets do more of those' ' '.join(sorted(text.split(' '))) It is “The Zen of Python” with word order removed
2. Let’s assume that: Topic is a set of words Politics Weather Sports Conservative Rain Olympics Trump Sunny Football Clinton Beach Championship …. …. …. More formally: a topic is a probability distribution over all words, most of them with tiny probability
LDA Topic Models using Gensim from Andrius Knispelis https://vimeo.com/140431085 3. Let’s assume that: Documents are written by following recipes Recipe Politics Weather Sports 50% 30% 20% DOCUMENT Conservative Liberal Rain President Sun Snow Minister Olympics Football Politics Weather Sports Conservati ve Rain Olympics Trump Sunny Football Clinton Beach Champions hip …. …. ….
The fingerprint of the document is just the Recipe Dimensionality reduction From #{vocabulary_size} of dimensions -> just three Recipe Politics Weather Sports 50% 30% 20%
Behind the scenes: How to come up with topics? Go backwards with Bayesian Inference! Find the model that describes the documents best! - “Put words that appear together into the same topic.” Politics Weather Sports Conservati ve Rain Olympics Trump Sunny Football Clinton Beach Champions hip …. …. …. DOCUMENT 1 Conservative Liberal Rain President Sun Snow Minister Olympics Football DOCUMENT 2 Trum Clinton Storm Swimming Hockey DOCUMENT 1000 UEFA Arsenal Football
Hyperparameters of Latent Dirichlet Allocation User has to fix some parameters: - Number of topics - Are docs about a few topics or many? (low or high alpha) - Do topics have a few words or many? (low or high beta) - Advanced: pre-seed topics, # of iterations, asymmetric priors, multicore etc See paper for details: “Latent Dirichlet Allocation” David Blei, Andrew Ng, Michael Jordan 2003 http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
Colouring words in Gensim bow_water = ['bank', 'water', 'river', 'tree'] color_words(good_lda_model, bow_water) bank river water tree color_words(bad_lda_model, bow_water) bank river water tree ? river bank or financial bank ?
Credit for word colouring API goes to our Google Summer of Code student Bhargav Srinivasa. Student at BITS Bilani university in Goa, India. Project: Dynamic Topic Modelling in Python
PyLDAVis: Topic selected. Jeopardy questions about languages Red bar is how frequent the word is in the topic. Blue bar is how frequent the word is in the entire corpus
PyLDAVis: Interactive Walk from World History (18) to American history(54). Red bar is how frequent the word is in the topic. Blue bar is how frequent the word is in the entire corpus
Topic coherence = human opinion Coherence is how often the topic words appear ‘together’ in the corpus. The trick is: many ways to define ‘together’... Just use ‘c_v coherence’ - it is the best one. Paper: “Exploring the Space of Topic Coherence Measures” by M. Roder et al http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf
Coherence example Corpus = [ “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts” ] Topic to evaluate = {game, sport, ball, team} Many-many coherence measures can be tried. Here are two simple ones.
Gensim coherence api All you need is a corpus and a set of words to find out how coherent the set is. goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v') print goodcm.get_coherence() 0.552164532134 badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v') print badcm.get_coherence() 0.5269189184
Credit for adding Topic Coherence to Gensim Our incubator student Devashish Deshpande. Student at BITS in Goa, India http://rare-technologies.com/incubator/
RARE Training •customized, interactive corporate training hosted on-site for technical teams of 5-15 developers, engineers, analysts and data scientists •2-day intensives include Python Best Practices and Practical Machine Learning, and 1-day intensive Topic Modelling RNDr. Radim Řehůřek, Ph.D. Gordon Mohr, BA in CS & Econ industry-leading instructors for more information email [email protected] m