Upgrade to Pro — share decks privately, control downloads, hide ads and more …

America's Next Topic Model Lightning talk (5 mins)

America's Next Topic Model Lightning talk (5 mins)

Presented at Pydata London 5 July 2016

Lev Konstantinovskiy

July 05, 2016
Tweet

More Decks by Lev Konstantinovskiy

Other Decks in Technology

Transcript

  1. America’s Next Topic Model
    Lev Konstantinovskiy
    Community Manager at Gensim
    @teagermylk
    http://rare-technologies.com/

    View full-size slide

  2. Streaming
    Topic Modelling and Word2vec in Python

    View full-size slide

  3. The questions I get asked all the time:
    - Why is your hair blue?
    - How to choose the best Topic Model?

    View full-size slide

  4. Business Problem solved by Topic Modelling
    Bird’s eye view of internal company documents
    Drill down into individual documents by topic.
    Rather than just keywords!

    View full-size slide

  5. From Latent Dirichlet Allocation paper by David M. Blei.
    Words colored according to their topic

    View full-size slide

  6. Colouring words in Gensim
    bow_water = ['bank','water','river', 'tree']
    color_words(goodLdaModel, bow_water)
    bank river water tree
    color_words(badLdaModel, bow_water)
    bank river water tree
    ? river bank or financial bank ?

    View full-size slide

  7. Automated model selection
    See "Reading Tea Leaves: How Humans Interpret Topic Models by Chang,Boyd-Graber et al".
    Model fit
    Human
    opinion

    View full-size slide

  8. Topic coherence = human opinion
    Coherence is how often the topic words appear ‘together’ in the
    corpus. Many ways to define ‘together’ - ‘c_v’ is the best one.
    goodcm = CoherenceModel(model=goodLdaModel, texts=texts,
    dictionary=dictionary, coherence='c_v')
    print goodcm.get_coherence()
    0.552164532134
    goodcm = CoherenceModel(model=badLdaModel, texts=texts,
    dictionary=dictionary, coherence='c_v')
    print goodcm.get_coherence()
    0.5269189184

    View full-size slide

  9. Summary: How to choose your next Topic Model:
    Manually:
    - Colour words
    - pyLDAVis
    Automatically:
    - Topic coherence C_v

    View full-size slide

  10. Why is your hair blue?
    Trying to fit in at PyCon in Portland, Oregon

    View full-size slide

  11. Lev Konstantinovskiy @teagermylk
    Topic coherence by our incubator
    student Devashish Deshpande
    Word colouring by our Google
    Summer of Code student
    Bhargav Srinivasa
    See you at PyCon UK Sprints!
    Monday 19 September

    View full-size slide

  12. Topic Model of Harry Potter
    1. (the Muggle topic) 50% “Muggle”, 25% “Dursey”, 10%
    “Privet”, 5% “Mudblood”...
    2. (the Voldemort topic) 65% “Voldemort”, 12% “Death”, 10%
    “Horcrux”, 5% “Snake”…
    3. (the Harry topic) 42% “Harry Potter”, 15% “Scar”, 7%
    “Quidditch”, 7% “Gryffindor”…

    View full-size slide

  13. Topic Model of Harry Potter
    Chapter 1 of Book 1: introduces the Dursley
    family and has Dumbledore discuss Harry’s
    parent’s death.
    - 40% Muggle topic
    - 30% Voldemort topic
    - 30% Harry

    View full-size slide