$30 off During Our Annual Pro Sale. View Details »

Next gen of word embeddings London 45 mins

Next gen of word embeddings London 45 mins

How to get interchangeable or related words
Theory behind word2vec, fasttext, wordrank

Lev Konstantinovskiy

May 07, 2017
Tweet

More Decks by Lev Konstantinovskiy

Other Decks in Programming

Transcript

  1. Next generation of word embeddings Lev Konstantinovskiy Community Manager at

    Gensim @teagermylk http://rare-technologies.com/
  2. Gensim Open Source Package • Numerous Industry Adopters • 170

    Code contributors, 4000 Github stars • 200 Messages per month on the mailing list • 150 People chatting on Gitter • 500 Academic citations
  3. Credits Parul Sethi Undergraduate student University of Delhi, India RaReTech

    Incubator program Added WordRank to Gensim http://rare-technologies.com/incubator/
  4. Two Different Business Problems 1) What words are in the

    topic of “Darcy”? 2) What are the Named Entities in the text?
  5. What is a word embedding? ‘Word embedding’ = ‘word vectors’

    = ‘distributed representations’ It is a dense representation of words in a low-dimensional vector space. One-hot representation: king = [1 0 0 0.. 0 0 0 0 0] queen = [0 1 0 0 0 0 0 0 0] book = [0 0 1 0 0 0 0 0 0] king = [0.9457, 0.5774, 0.2224] Distributed representation:
  6. Co-occurence matrix ... and the cute kitten purred and then

    ... ... the cute furry cat purred and miaowed ... ... that the cute kitten miaowed and she ... ... the loud furry dog ran and bit ... kitten context words: [cute, purred, miaowed]. cat context words: [cute, furry, miaowed]. dog context words: [loud, furry, ran, bit]. Credit: Ed Grefenstette https://github.com/oxford-cs-deepnlp-2017/
  7. Co-occurence matrix kitten context words: [cute, purred, miaowed]. cat context

    words: [cute, furry, miaowed]. dog context words: [loud, furry, ran, bit]. Credit: Ed Grefenstette https://github.com/oxford-cs-deepnlp-2017/ cute furry bit kitten 2 0 0 cat 1 1 0 dog 0 1 1 X =
  8. Dimensionality reduction More precisely, in word2vec u*v approximates PMI(X) -

    log n, in Glove log(X) cute furry bit … kitten 2 0 0 ... cat 1 1 0 ... dog 0 1 1 ... ... ... ... ... ... X = = U * V Dims: vocab x vocab = (vocab x small) * (small x vocab) First row of U is the word embedding of “kitten”
  9. Dimensionality reduction More precisely u*v approximates PMI(X) - log n,

    where n is the negative sampling parameter Co-occurence score in word2vec = U word * V context Dims: count = (small vector) * (small vector)
  10. FastText: word is a sum of its parts Co-occurence score

    in FastText = U subword * V context over all subwords of w going = go + oi + in + ng + goi + oin + ing
  11. Same API as word2vec. Out-of-vocabulary words can also be used,

    provided they have at least one character n-gram present in the training data. FastText Gensim Wrapper
  12. - Word2vec - FastText - WordRank - Factorise the co-occurence

    matrix: SVD/LSI - GLoVe - EigenWords - VarEmbed Many ways to get a vector for a word
  13. WordRank is a Ranking Algorithm Word2vec Input: Context Cute Output:

    Word Kitten Classification problem WordRank Input: Context Cute Output: Ranking 1. Kitten 2. Cat 3. Dog Robust: Mistake at the top of the rank costs more than mistake at the bottom.
  14. Algorithm Train time (sec) Passes through corpus Related accuracy Related

    (WS-353) Interchange able accuracy Interchange able (SimLex-999 ) Word2Vec 18 6 4.69 0.37 2.77 0.17 FastText 50 6 6.57 0.36 36.95 0.13 WordRank 4hrs 91 15.26 0.39 4.23 0.09 Evaluation on 1 mln words corpus
  15. Algorithm Train time (sec) Passes through corpus Related accuracy Related

    (WS-353) Interchang eable accuracy Interchan geable (SimLex-9 99) Word2Vec 402 6 40.34 41.48 0.69 41.1 FastText 942 6 25.75 57.33 0.67 45.2 WordRank ~42 hours 91 54.52 39.83 0.71 44.7 Evaluation on 17 mln words corpus
  16. Algorithm Related accuracy Related (WS-353) Interchangeable accuracy Interchangeable (SimLex-999) Word2Vec

    Window = 2 21 0.63 36 0.33 Word2Vec Window = 15 40 0.69 40 0.31 Word2Vec Window = 50 42 0.68 34 0.26 Big window means less “interchangeable”
  17. How to get the similarity you need My similar words

    must be Related Interchangeable I want to describe the word’s Topic Function I want to Know what doc is about Recognize names Then I should run Wordrank (even on small corpus, 1m words) or Word2vec skipgram big window needs large corpus >5m words Word2vec skipgram small window or FastText or VarEmbed
  18. How to get the similarity you need “Similar words” are

    Related Interchangeable Got a million words? FastText Word2vec small context VarEmbed WordRank Word2vec large context Yes No
  19. Thanks for listening! Lev Konstantinovskiy github.com/tmylk @teagermylk Coding sprint today

    13:30 - 15:30 “Learn NLP by running Gensim tutorials” “Turn your FullFact project into a Gensim tutorial” Fix an easy bug on github.