"How to get the similarity you need with next gen of word embeddings" PyData Berlin 2017

Next generation of word embeddings Lev Konstantinovskiy A gensim contributor
@teagermylk Notebook: https://goo.gl/uefW9f

Streaming Word2vec and Topic Modelling in Python

Gensim Open Source Package • Numerous Industry Adopters • 210
Code contributors, 4000 Github stars • 200 Messages per month on the mailing list • 200 People chatting on Gitter • 500 Academic citations

Parul Sethi Undergraduate student University of Delhi, India Google Summer
of Code 2017 Added WordRank to Gensim Credits

Which words are similar?

“Coast” vs “Shore” ????

“Coast” vs “Shore” similar

“Clothes” vs “Closet” ????

“Clothes” vs “Closet” similar in the sense “related”

“Clothes” vs “Closet” different! Why?

“Clothes” vs “Closet” different! not “interchangeable”

Business Problems

Business Problems “What does Elizabeth think about Mr Darcy?” “Male
characters in Pride and Prejudice?”

Two Different Business Problems 1) What words are in the
topic of “Darcy”? 2) What are the Named Entities in the text?

Pride and Prejudice models

Algorithm Related analogies Related (WS-353) Interchangeable analogies Interchangeable (SimLex-999) Word2Vec
Window = 2 21 0.63 36 0.33 Word2Vec Window = 15 40 0.69 40 0.31 Word2Vec Window = 50 42 0.68 34 0.26 Big window means less “interchangeable”

Closest word to “king” Trained on Wikipedia 17m words Attribute
Interchangeable Both Same window size. Different algo.

Theory of word embeddings

What is a word embedding? ‘Word embedding’ = ‘word vectors’
= ‘distributed representations’ It is a dense representation of words in a low-dimensional vector space. One-hot representation: king = [1 0 0 0.. 0 0 0 0 0] queen = [0 1 0 0 0 0 0 0 0] book = [0 0 1 0 0 0 0 0 0] king = [0.9457, 0.5774, 0.2224] Distributed representation:

Awesome 3D viz in TensorBoard

Awesome 3D viz in TensorBoard “Projections are meaningless.” Matti Lyra,
2017

How to come up with the vectors?

Our corpus (aka dataset) cute kitten purred cute furry cat
purred and miaowed cute kitten miaowed loud furry dog ran

Co-occurence matrix ... and the cute kitten purred and then
... ... the cute furry cat purred and miaowed ... ... that the cute kitten miaowed and she ... ... the loud furry dog ran and bit ... kitten context words: [cute, purred, miaowed]. cat context words: [cute, furry, miaowed]. dog context words: [loud, furry, ran, bit]. Credit: Ed Grefenstette https://github.com/oxford-cs-deepnlp-2017/

Co-occurence matrix kitten context words: [cute, purred, miaowed]. cat context
words: [cute, furry, miaowed]. dog context words: [loud, furry, ran, bit]. Credit: Ed Grefenstette https://github.com/oxford-cs-deepnlp-2017/ cute furry bit kitten 2 0 0 cat 1 1 0 dog 0 1 1 X =

Dimensionality reduction More precisely, in word2vec u*v approximates PMI(X) -
log n, in Glove log(X) cute furry bit … kitten 2 0 0 ... cat 1 1 0 ... dog 0 1 1 ... ... ... ... ... ... X = = U * V Dims: vocab x vocab = (vocab x small) * (small x vocab) First row of U is the word embedding of “kitten”

Dimensionality reduction More precisely u*v approximates PMI(X) - log n,
where n is the negative sampling parameter Co-occurence score in word2vec = U word * V context Dims: count = (small vector) * (small vector)

FastText: word is a sum of its parts Similarity score
in FastText = U subword * V context over all subwords of w going = go + oi + in + ng + goi + oin + ing

FastText better than word2vec because morphology Credit: Takahiro Kubo http://qiita.com/icoxfog417/items/42a95b279c0b7ad26589
Slower because many more vectors to consider!

FastText is better at interchangeable words Related words accuracy Interchangeable
words accuracy Training time word2vec

Same API as word2vec. Out-of-vocabulary words can also be used,
provided they have at least one character n-gram present in the training data. FastText Gensim Wrapper

FastText slower than Python word2vec. Even without n-grams... But Python
is slower than C++, right? :)

- Word2vec - FastText - WordRank - Factorise the co-occurence
matrix: SVD/LSI - GLoVe - EigenWords - VarEmbed Many ways to get a vector for a word

WordRank is a Ranking Algorithm Word2vec Input: Context Cute Output:
Word Kitten Classification problem WordRank Input: Context Cute Output: Ranking 1. Kitten 2. Cat 3. Dog Robust: Mistake at the top of the rank costs more than mistake at the bottom.

Gensim WordRank API is same as word2vec, FastText

Algorithm Train time (sec) Passes through corpus Related accuracy Related
(WS-353 ) Interchan geable accuracy Interchan geable (SimLex- 999) Word2Ve c 18 6 4.69 0.37 2.77 0.17 FastText 50 6 6.57 0.36 36.95 0.13 WordRan k 4hrs 91 15.26 0.39 4.23 0.09 Evaluation on 1 mln words corpus

Algorithm Train time (sec) Passes through corpus Related accuracy Related
(WS-353) Interchang eable accuracy Interchan geable (SimLex-9 99) Word2Vec 402 6 40.34 41.48 0.69 41.1 FastText 942 6 25.75 57.33 0.67 45.2 WordRank ~42 hours 91 54.52 39.83 0.71 44.7 Evaluation on 17 mln words corpus

How to get the similarity you need My similar words
must be Related Interchangeable I want to describe the word’s Topic Function I want to Know what doc is about Recognize names Then I should run Wordrank (even on small corpus, 1m words) or Word2vec skipgram big window needs large corpus >5m words Word2vec skipgram small window or FastText or VarEmbed

How to get the similarity you need “Similar words” are
Related Interchangeable Got a million words? FastText Word2vec small context VarEmbed WordRank Word2vec large context Yes No

Thanks for listening! Lev Konstantinovskiy github.com/tmylk @teagermylk

"How to get the similarity you need with next g...

"How to get the similarity you need with next gen of word embeddings" PyData Berlin 2017

More Decks by Lev Konstantinovskiy

Other Decks in Programming

Featured

Transcript