Slide 1

Slide 1 text

Same content. Different words. Lev Konstantinovskiy Community Manager at Gensim @teagermylk http://rare-technologies.com/

Slide 2

Slide 2 text

Streaming We turn NLP papers into industrial Python code.

Slide 3

Slide 3 text

Credits Olavur Mortensen Applied Mathematics student DTU in Copenhagen RaReTech Incubator program Added WMD to Gensim http://rare-technologies.com/incubator/

Slide 4

Slide 4 text

Business Problem All reviewers are raving about the same thing “The Sicilian gelato was extremely rich” “The Italian ice-cream was very velvety” What about Ambiance, Service and Prices? Let’s filter “gelato” out and add other aspects! Credit: Sudeep Das @datamusing applied WMD to restaurant reviews. http://tech.opentable. com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/

Slide 5

Slide 5 text

Ways to find similar documents ● Count common words ( bag of words, TF-IDF) ○ #Dimensions = #Vocabulary (thousands) Stuck if no words in common. “Gelato” != “Ice-cream”

Slide 6

Slide 6 text

Ways to find similar documents ● Low-dimensional latent features ○ Eigen-values (LSI) ○ Probability (LDA) Nice features. That is basically what most of Gensim is. But there is something better now… WMD!

Slide 7

Slide 7 text

New way to find similar documents ● Word Mover’s Distance ○ Built on top of Google’s word2vec ○ Well-used concept in other fields known as Earth Mover’s Distance Beats BOW, TF-IDF, LDA, LSI in k Nearest Neigbours classification tasks.

Slide 8

Slide 8 text

Word Mover’s distance http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf https://github.com/mkusner/wmd

Slide 9

Slide 9 text

Word Mover’s distance http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/

Slide 10

Slide 10 text

Distance in gensim

Slide 11

Slide 11 text

Finding similar reviews from gensim.similarities import WmdSimilarity similiar_reviews = WmdSimilarity(reviews, model, num_best=10) similar_reviews['Very good, you should seat outdoor.']

Slide 12

Slide 12 text

Thanks! Lev Konstantinovskiy github.com/tmylk @teagermylk Events coming up: - Word2vec successor Swivel DS Journal Club Meetup 21 April - Word embeddings talk at PyData London conference 7-8 May - Word2vec tutorial at PyData Berlin 20-21 May

Slide 13

Slide 13 text

Extra slides

Slide 14

Slide 14 text

Ways to find similar documents ● Google’s Doc2vec ○ Built on top of word2vec ○ Document tags are just extra words in the document Hard to tune. Slow inference.

Slide 15

Slide 15 text

Earth Mover’s Distance How do you best move piles of sand to fill up holes of the same total volume? Stated by Monge in 1781. Solved by Kantorovich in [Image: APS/Alan Stonebraker]

Slide 16

Slide 16 text

Google’s Word2vec algorithm ● Word becomes a vector in 100-dimensional space. ● king - man + woman = queen http://nbviewer.jupyter.org/github/fbkarsdorp/doc2vec/blob/master/doc2vec.ipynb http://radimrehurek.com/2014/02/word2vec-tutorial

Slide 17

Slide 17 text

http://vene.ro/blog/word-movers-distance-in-python.html Word Mover’s distance