Business Problem solved by Topic Modelling Bird’s eye view of internal company documents Drill down into individual documents by topic. Rather than just keywords!
Colouring words in Gensim bow_water = ['bank','water','river', 'tree'] color_words(goodLdaModel, bow_water) bank river water tree color_words(badLdaModel, bow_water) bank river water tree ? river bank or financial bank ?
Topic coherence = human opinion Coherence is how often the topic words appear ‘together’ in the corpus. Many ways to define ‘together’ - ‘c_v’ is the best one. goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v') print goodcm.get_coherence() 0.552164532134 goodcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v') print goodcm.get_coherence() 0.5269189184
Lev Konstantinovskiy @teagermylk Topic coherence by our incubator student Devashish Deshpande Word colouring by our Google Summer of Code student Bhargav Srinivasa See you at PyCon UK Sprints! Monday 19 September
Topic Model of Harry Potter Chapter 1 of Book 1: introduces the Dursley family and has Dumbledore discuss Harry’s parent’s death. - 40% Muggle topic - 30% Voldemort topic - 30% Harry