Next generation of word embeddings

Next generation of word embeddings Lev Konstantinovskiy Community Manager at
Gensim @teagermylk http://rare-technologies.com/

Streaming Word2vec and Topic Modelling in Python

Gensim Open Source Package • Numerous Industry Adopters • 170
Code contributors, 4000 Github stars • 200 Messages per month on the mailing list • 150 People chatting on Gitter • 500 Academic citations

Gensim coding sprint Date: April Location: somewhere in BH, To
Be Announced Interested? Contact me on pydatabh.slack.com Twitter @teagermylk [email protected] Topic: Learn machine learning by improving our tutorials.

Credits Parul Sethi Undergraduate student University of Delhi, India RaReTech
Incubator program Added WordRank to Gensim http://rare-technologies.com/incubator/

Part 1. Different word embeddings Part 2. Theory of word2vec

Business Problems

Business Problems “What is Dona Flor like?” “List all female
characters in “Dona Flor e seus dois maridos”?”

Two Different Business Problems 1) What words are in the
topic of “Dona Flor”? 2) What are the Named Entities in the text?

DFDM is only 170k words so results are so-so

Teodoro

Closest word to “king”? Trained on Wikipedia 17m words Attribute
Interchangeable Both

How to get the similarity you need My similar words
must be Associated Interchangeable I want to describe the word’s Topic Function I want to Know what doc is about Recognize names Then I should run Wordrank (even on small corpus, 1m words) or Word2vec skipgram big window needs large corpus >5m words Word2vec skipgram small window or FastText or VarEmbed

Part 2. Theory of Word2vec

Word2vec is big victory of unsupervised learning Google ran word2vec
on 100billion of unlabelled words. Then shared their trained model. Thanks to Google for cutting our training time to zero!. :)

Word embeddings can be used for: - automated text tagging
- recommendation engines - synonyms and search query expansion - machine translation - plain feature engineering

What is a word embedding? ‘Word embedding’ = ‘word vectors’
= ‘distributed representations’ It is a dense representation of words in a low-dimensional vector space. One-hot representation: king = [1 0 0 0.. 0 0 0 0 0] queen = [0 1 0 0 0 0 0 0 0] book = [0 0 1 0 0 0 0 0 0] king = [0.9457, 0.5774, 0.2224] Distributed representation:

Many other ways to get a vector for a word:
- Factorise the co-occurence matrix (SVD/LSA) - GLoVe - EigenWords - WordRank - VarEmbed - FastText Disclaimer Word2vec is not the only word embedding in the world

Use the “Distributional hypothesis”: “You shall know a word by
the company it keeps” -J. R. Firth 1957 Richard Socher’s NLP course http://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf How to come up with an embeddig?

Usual procedure 1.Initialise random vectors 2. Pick an objective function.
3. Do gradient descent.

For the theory, take Richard Sochers’s CS224D free online class
Richard Socher’s NLP course http://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf

“The fox jumped over the lazy dog” Maximize the likelihood
of seeing the context words given the word over. P(the|over) P(fox|over) P(jumped|over) P(the|over) P(lazy|over) P(dog|over) word2vec algorithm Used with permission from @chrisemoody http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec

Probability should depend on the word vectors. Used with permission
from @chrisemoody http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec P(fox|over) P(v fox |v over )

A twist: two vectors for every word Used with permission
from @chrisemoody http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec Should depend on whether it’s the input or the output. P(v OUT |v IN ) “The fox jumped over the lazy dog” v IN

Twist: two vectors for every word Used with permission from
@chrisemoody http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec Should depend on whether it’s the input or the output. P(v OUT |v IN ) “The fox jumped over the lazy dog” v IN v OUT = P(v THE |v OVER )

Twist: two vectors for every word Used with permission from
@chrisemoody http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec Should depend on whether it’s the input or the output. P(v OUT |v IN ) “The fox jumped over the lazy dog” v IN v OUT

How to define P(v OUT |v IN )? First, define
similarity. How similar are two vectors? Just dot product for unit length vectors v OUT * v IN Used with permission from @chrisemoody http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec

Get a probability in [0,1] out of similarity in [-1,
1] Normalization term over all out words Used with permission from @chrisemoody http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec

Word2vec is great! Vector arithmetic Slide from @chrisemoody http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec -
-

@datamusing Sudeep Das http://www.slideshare.net/SparkSummit/using-data-science-to-transform-opentable-into-delgado-das More directions

Consistent directions Mikolov et al. Distributed Representations of Words and
Phrases and their Compositionality 2013

Explore word2vec yourself http://rare-technologies.com/word2vec-tutorial/#app

Thanks! Lev Konstantinovskiy github.com/tmylk @teagermylk Contact me if interested Gensim
sprint in BH (full day coding event)

Next generation of word embeddings

Next generation of word embeddings

More Decks by Lev Konstantinovskiy

Other Decks in Technology

Featured

Transcript