Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Same content. Different words.

Same content. Different words.

Introducing Word Mover's Distance in gensim.

39368910dbd6371b507e0b2113dcf4fe?s=128

Lev Konstantinovskiy

April 05, 2016
Tweet

Transcript

  1. Same content. Different words. Lev Konstantinovskiy Community Manager at Gensim

    @teagermylk http://rare-technologies.com/
  2. Streaming We turn NLP papers into industrial Python code.

  3. Credits Olavur Mortensen Applied Mathematics student DTU in Copenhagen RaReTech

    Incubator program Added WMD to Gensim http://rare-technologies.com/incubator/
  4. Business Problem All reviewers are raving about the same thing

    “The Sicilian gelato was extremely rich” “The Italian ice-cream was very velvety” What about Ambiance, Service and Prices? Let’s filter “gelato” out and add other aspects! Credit: Sudeep Das @datamusing applied WMD to restaurant reviews. http://tech.opentable. com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
  5. Ways to find similar documents • Count common words (

    bag of words, TF-IDF) ◦ #Dimensions = #Vocabulary (thousands) Stuck if no words in common. “Gelato” != “Ice-cream”
  6. Ways to find similar documents • Low-dimensional latent features ◦

    Eigen-values (LSI) ◦ Probability (LDA) Nice features. That is basically what most of Gensim is. But there is something better now… WMD!
  7. New way to find similar documents • Word Mover’s Distance

    ◦ Built on top of Google’s word2vec ◦ Well-used concept in other fields known as Earth Mover’s Distance Beats BOW, TF-IDF, LDA, LSI in k Nearest Neigbours classification tasks.
  8. Word Mover’s distance http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf https://github.com/mkusner/wmd

  9. Word Mover’s distance http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/

  10. Distance in gensim

  11. Finding similar reviews from gensim.similarities import WmdSimilarity similiar_reviews = WmdSimilarity(reviews,

    model, num_best=10) similar_reviews['Very good, you should seat outdoor.']
  12. Thanks! Lev Konstantinovskiy github.com/tmylk @teagermylk Events coming up: - Word2vec

    successor Swivel DS Journal Club Meetup 21 April - Word embeddings talk at PyData London conference 7-8 May - Word2vec tutorial at PyData Berlin 20-21 May
  13. Extra slides

  14. Ways to find similar documents • Google’s Doc2vec ◦ Built

    on top of word2vec ◦ Document tags are just extra words in the document Hard to tune. Slow inference.
  15. Earth Mover’s Distance How do you best move piles of

    sand to fill up holes of the same total volume? Stated by Monge in 1781. Solved by Kantorovich in [Image: APS/Alan Stonebraker]
  16. Google’s Word2vec algorithm • Word becomes a vector in 100-dimensional

    space. • king - man + woman = queen http://nbviewer.jupyter.org/github/fbkarsdorp/doc2vec/blob/master/doc2vec.ipynb http://radimrehurek.com/2014/02/word2vec-tutorial
  17. http://vene.ro/blog/word-movers-distance-in-python.html Word Mover’s distance