PyCon Slovakia - Gensim

TOPIC MODELLING! PYTHON AND GENSIM INTRODUCE

Gensim does Natural Language Processing • It’s fast • It’s
easy

BUSINESS PROBLEM

BUSINESS PROBLEM SOLVED! You can drill down into individual documents
by broad topics -  Rather than with keywords!

Topic Modelling for research!

• Topic models generate topics • Topics are a collection
of words • Documents are a collection of topics DOCUMENTS TOPICS WORDS

Politics Weather Sports Conservative Rain Olympics Trump Sunny Football Clinton
Beach Championship …. …. ….

DOCUMENTS ARE ‘MADE’ WITH THE FOLLOWING RECIPES BUT WHAT ABOUT
DOCUMENTS? Politics Weather Sports Conservative Rain Olympics Trump Sunny Football Clinton Beach Championship …. …. …. Recipe Politics Weather Sports 50% 30% 20% DOCUMENT Conservative Liberal Rain President Sun Snow Minister Olympics Football

Recipe Politics Weather Sports 50% 30% 20%

HOW DID WE DO THIS MAGIC? Politics Weather Sports Conservative
Rain Olympics Trump Sunny Football Clinton Beach Championship …. …. …. DOCUMENT 1 Conservative Liberal Rain President Sun Snow Minister Olympics Football DOCUMENT 2 Trump Clinton Storm Swimming Hockey DOCUMENT 1000 UEFA PSG Lille Football We just go backwards by using Bayesian Inference :) We find the model that describes the documents best. “Put words that appear together into the same topic”

Latent Dirichlet Allocation.

PARAMETERS OF LATENT DIRICHLET ALLOCATION? • model = ldamodel(corpus =
corpus, num_topics = 10) • and you are good to go! • can do more, of course.

EVALUATING TOPIC MODELS BUT WHICH MODEL IS THE BEST? 1.Manual
Evaluating! 2.Topic Coherence

COLOR WORDS ACCORDING TO TOPIC!

Is it a RIVER BANK or a FINANCIAL BANK? EXAMPLE
If the sentence contains the following words:    sentence = “BANK WATER RIVER TREE” And we do want to find if it is good or bad model — color_words(bad_lda, sentence) BANK WATER WATER TREE color_words(good_lda, sentence) BANK WATER WATER TREE

AUTOMATIC EVALUATION - COHERENCE If you’re lazy and want numbers:
goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')  print goodcm.get_coherence() 0.552164532134  badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')  print badcm.get_coherence() 0.5269189184

WHAT HAVE WE LEARNED? • How to make sense of
unstructured textual data • How to do it very easily with Python and Gensim • How to color words in a document • How to find the best model! • What else can we do?

• Dynamic Topic Modelling! • Vector Space Text Modelling -
Doc2Vec and Word2Vec! • LSI, HDP for more topic modelling! • And why is this useful or important? • Document Similarity over time-periods • Easy text analysis with a variety of algorithms

WHAT NEXT? • Gensim and RaRe Technologies, the parent company,
regularly hold Sprints and Tutorials • Variety of tutorials, practical implementations on the GitHub page • Would be very happy to see Gensim being used to solve business problems in Slovakia • Even happier to see everyone contributing to Gensim :)

ABOUT ME • Google Summer of Code 2016 with Gensim
• Currently research assistant at INRIA Lille, France. https://github.com/ bhargavvader • My French is pretty bad • Still - Dobrý deň! • My Slovak is worse https://twitter.com/ bhargavvader

THANK YOU!

PyCon Slovakia - Gensim

PyCon Slovakia - Gensim

Bhargav Srinivasa

Other Decks in Education

Featured

Transcript

TOPIC MODELLING! PYTHON AND GENSIM INTRODUCE

Gensim does Natural Language Processing • It’s fast • It’s

BUSINESS PROBLEM

BUSINESS PROBLEM SOLVED! You can drill down into individual documents

Topic Modelling for research!

• Topic models generate topics • Topics are a collection

Politics Weather Sports Conservative Rain Olympics Trump Sunny Football Clinton

DOCUMENTS ARE ‘MADE’ WITH THE FOLLOWING RECIPES BUT WHAT ABOUT

Recipe Politics Weather Sports 50% 30% 20%

HOW DID WE DO THIS MAGIC? Politics Weather Sports Conservative

Latent Dirichlet Allocation.

PARAMETERS OF LATENT DIRICHLET ALLOCATION? • model = ldamodel(corpus =

EVALUATING TOPIC MODELS BUT WHICH MODEL IS THE BEST? 1.Manual

COLOR WORDS ACCORDING TO TOPIC!

Is it a RIVER BANK or a FINANCIAL BANK? EXAMPLE

AUTOMATIC EVALUATION - COHERENCE If you’re lazy and want numbers:

WHAT HAVE WE LEARNED? • How to make sense of

• Dynamic Topic Modelling! • Vector Space Text Modelling -

WHAT NEXT? • Gensim and RaRe Technologies, the parent company,

ABOUT ME • Google Summer of Code 2016 with Gensim

THANK YOU!