Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyCon Slovakia - Gensim

PyCon Slovakia - Gensim

Slides of talk I presented at PyCon SK about topic modelling and gensim.

Bhargav Srinivasa

March 12, 2017
Tweet

Other Decks in Education

Transcript

  1. BUSINESS PROBLEM SOLVED! You can drill down into individual documents

    by broad topics -
 Rather than with keywords!
  2. • Topic models generate topics • Topics are a collection

    of words • Documents are a collection of topics DOCUMENTS TOPICS WORDS
  3. DOCUMENTS ARE ‘MADE’ WITH THE FOLLOWING RECIPES BUT WHAT ABOUT

    DOCUMENTS? Politics Weather Sports Conservative Rain Olympics Trump Sunny Football Clinton Beach Championship …. …. …. Recipe Politics Weather Sports 50% 30% 20% DOCUMENT Conservative Liberal Rain President Sun Snow Minister Olympics Football
  4. HOW DID WE DO THIS MAGIC? Politics Weather Sports Conservative

    Rain Olympics Trump Sunny Football Clinton Beach Championship …. …. …. DOCUMENT 1 Conservative Liberal Rain President Sun Snow Minister Olympics Football DOCUMENT 2 Trump Clinton Storm Swimming Hockey DOCUMENT 1000 UEFA PSG Lille Football We just go backwards by using Bayesian Inference :) We find the model that describes the documents best. “Put words that appear together into the same topic”
  5. PARAMETERS OF LATENT DIRICHLET ALLOCATION? • model = ldamodel(corpus =

    corpus, num_topics = 10) • and you are good to go! • can do more, of course.
  6. Is it a RIVER BANK or a FINANCIAL BANK? EXAMPLE

    If the sentence contains the following words:
 
 sentence = “BANK WATER RIVER TREE” And we do want to find if it is good or bad model — color_words(bad_lda, sentence) BANK WATER WATER TREE color_words(good_lda, sentence) BANK WATER WATER TREE
  7. AUTOMATIC EVALUATION - COHERENCE If you’re lazy and want numbers:

    goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')
 print goodcm.get_coherence() 0.552164532134
 badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')
 print badcm.get_coherence() 0.5269189184
  8. WHAT HAVE WE LEARNED? • How to make sense of

    unstructured textual data • How to do it very easily with Python and Gensim • How to color words in a document • How to find the best model! • What else can we do?
  9. • Dynamic Topic Modelling! • Vector Space Text Modelling -

    Doc2Vec and Word2Vec! • LSI, HDP for more topic modelling! • And why is this useful or important? • Document Similarity over time-periods • Easy text analysis with a variety of algorithms
  10. WHAT NEXT? • Gensim and RaRe Technologies, the parent company,

    regularly hold Sprints and Tutorials • Variety of tutorials, practical implementations on the GitHub page • Would be very happy to see Gensim being used to solve business problems in Slovakia • Even happier to see everyone contributing to Gensim :)
  11. ABOUT ME • Google Summer of Code 2016 with Gensim

    • Currently research assistant at INRIA Lille, France. https://github.com/ bhargavvader • My French is pretty bad • Still - Dobrý deň! • My Slovak is worse https://twitter.com/ bhargavvader