Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Next gen of word embeddings London 45 mins

Next gen of word embeddings London 45 mins

How to get interchangeable or related words
Theory behind word2vec, fasttext, wordrank

Lev Konstantinovskiy

May 07, 2017
Tweet

More Decks by Lev Konstantinovskiy

Other Decks in Programming

Transcript

  1. Next generation of word embeddings
    Lev Konstantinovskiy
    Community Manager at Gensim
    @teagermylk
    http://rare-technologies.com/

    View Slide

  2. Streaming
    Word2vec and Topic Modelling in Python

    View Slide

  3. Gensim Open Source Package
    ● Numerous Industry Adopters
    ● 170 Code contributors, 4000 Github stars
    ● 200 Messages per month on the mailing list
    ● 150 People chatting on Gitter
    ● 500 Academic citations

    View Slide

  4. Credits
    Parul Sethi
    Undergraduate student
    University of Delhi, India
    RaReTech Incubator program
    Added WordRank to Gensim
    http://rare-technologies.com/incubator/

    View Slide

  5. View Slide

  6. Which words are similar?

    View Slide

  7. “Coast” vs “Shore”
    ????

    View Slide

  8. “Coast” vs “Shore”
    similar

    View Slide

  9. “Clothes” vs “Closet”
    ????

    View Slide

  10. “Clothes” vs “Closet”
    similar
    in the sense “related”

    View Slide

  11. “Clothes” vs “Closet”
    different!
    Why?

    View Slide

  12. “Clothes” vs “Closet”
    different!
    not “interchangeable”

    View Slide

  13. Business Problems

    View Slide

  14. Business Problems
    “What does Elizabeth think about Mr Darcy?”
    “Male characters in Pride and Prejudice?”

    View Slide

  15. Two Different
    Business Problems
    1) What words are in the topic of “Darcy”?
    2) What are the Named Entities in the text?

    View Slide

  16. Pride and Prejudice models

    View Slide

  17. Closest word to “king”?
    Trained on Wikipedia 17m words
    Attribute Interchangeable Both

    View Slide

  18. Theory of word embeddings

    View Slide

  19. What is a word embedding?
    ‘Word embedding’ = ‘word vectors’ = ‘distributed representations’
    It is a dense representation of words in a low-dimensional vector space.
    One-hot representation:
    king = [1 0 0 0.. 0 0 0 0 0]
    queen = [0 1 0 0 0 0 0 0 0]
    book = [0 0 1 0 0 0 0 0 0]
    king = [0.9457, 0.5774, 0.2224]
    Distributed representation:

    View Slide

  20. Awesome 3D viz in TensorBoard

    View Slide

  21. How to come up with the
    vectors?

    View Slide

  22. Co-occurence matrix
    ... and the cute kitten purred and then ...
    ... the cute furry cat purred and miaowed ...
    ... that the cute kitten miaowed and she ...
    ... the loud furry dog ran and bit ...
    kitten context words: [cute, purred, miaowed].
    cat context words: [cute, furry, miaowed].
    dog context words: [loud, furry, ran, bit].
    Credit: Ed Grefenstette https://github.com/oxford-cs-deepnlp-2017/

    View Slide

  23. Co-occurence matrix
    kitten context words: [cute, purred, miaowed].
    cat context words: [cute, furry, miaowed].
    dog context words: [loud, furry, ran, bit].
    Credit: Ed Grefenstette https://github.com/oxford-cs-deepnlp-2017/
    cute furry bit
    kitten 2 0 0
    cat 1 1 0
    dog 0 1 1
    X =

    View Slide

  24. Dimensionality reduction
    More precisely, in word2vec u*v approximates PMI(X) - log n, in Glove log(X)
    cute furry bit …
    kitten 2 0 0 ...
    cat 1 1 0 ...
    dog 0 1 1 ...
    ... ... ... ... ...
    X = = U * V
    Dims: vocab x vocab = (vocab x small) * (small x vocab)
    First row of U is the word
    embedding of “kitten”

    View Slide

  25. Dimensionality reduction
    More precisely u*v approximates PMI(X) - log n, where n is the negative sampling parameter
    Co-occurence
    score in
    word2vec
    =
    U
    word
    * V
    context
    Dims: count = (small vector) * (small vector)

    View Slide

  26. FastText: word is a sum of
    its parts
    Co-occurence
    score in
    FastText
    = U
    subword
    *
    V
    context
    over all subwords of
    w
    going = go + oi + in + ng + goi + oin + ing

    View Slide

  27. FastText better than word2vec
    because morphology
    Credit: Takahiro Kubo http://qiita.com/icoxfog417/items/42a95b279c0b7ad26589
    Slower because many more vectors to consider!

    View Slide

  28. FastText is better at interchangeable
    words
    Related
    words
    accuracy
    Interchangeable
    words accuracy
    Training
    time

    View Slide

  29. Same API as word2vec.
    Out-of-vocabulary words can also be used, provided they have
    at least one character n-gram present in the training data.
    FastText Gensim Wrapper

    View Slide

  30. FastText slower than Python word2vec.
    Even without n-grams...
    But Python is slower than C, right? :)

    View Slide

  31. - Word2vec
    - FastText
    - WordRank
    - Factorise the co-occurence matrix: SVD/LSI
    - GLoVe
    - EigenWords
    - VarEmbed
    Many ways to get a vector for a word

    View Slide

  32. WordRank is a Ranking Algorithm
    Word2vec
    Input: Context Cute
    Output: Word Kitten
    Classification problem
    WordRank
    Input: Context Cute
    Output: Ranking
    1. Kitten
    2. Cat
    3. Dog
    Robust: Mistake at the top of the rank
    costs more than mistake at the bottom.

    View Slide

  33. Gensim WordRank API is same as word2vec,
    FastText

    View Slide

  34. Algorithm
    Train
    time
    (sec)
    Passes
    through
    corpus
    Related
    accuracy
    Related
    (WS-353)
    Interchange
    able
    accuracy
    Interchange
    able
    (SimLex-999
    )
    Word2Vec 18 6 4.69 0.37 2.77 0.17
    FastText 50 6 6.57 0.36 36.95 0.13
    WordRank 4hrs 91 15.26 0.39 4.23 0.09
    Evaluation on 1 mln words corpus

    View Slide

  35. Algorithm
    Train
    time
    (sec)
    Passes
    through
    corpus
    Related
    accuracy
    Related
    (WS-353)
    Interchang
    eable
    accuracy
    Interchan
    geable
    (SimLex-9
    99)
    Word2Vec 402 6 40.34 41.48 0.69 41.1
    FastText 942 6 25.75 57.33 0.67 45.2
    WordRank
    ~42
    hours
    91 54.52 39.83 0.71 44.7
    Evaluation on 17 mln words corpus

    View Slide

  36. Algorithm Related accuracy Related (WS-353)
    Interchangeable
    accuracy
    Interchangeable
    (SimLex-999)
    Word2Vec
    Window = 2
    21 0.63 36 0.33
    Word2Vec
    Window = 15
    40
    0.69 40 0.31
    Word2Vec
    Window = 50
    42
    0.68 34 0.26
    Big window means less “interchangeable”

    View Slide

  37. How to get the similarity you need
    My similar words must
    be
    Related Interchangeable
    I want to describe the
    word’s
    Topic Function
    I want to Know what doc is about Recognize names
    Then I should run Wordrank (even on small
    corpus, 1m words)
    or
    Word2vec skipgram big
    window needs large corpus
    >5m words
    Word2vec skipgram small
    window
    or
    FastText
    or
    VarEmbed

    View Slide

  38. How to get the similarity you need
    “Similar
    words” are
    Related Interchangeable
    Got a
    million
    words? FastText
    Word2vec
    small context
    VarEmbed
    WordRank
    Word2vec
    large context
    Yes No

    View Slide

  39. Thanks for listening!
    Lev Konstantinovskiy
    github.com/tmylk
    @teagermylk
    Coding sprint today 13:30 - 15:30
    “Learn NLP by running Gensim tutorials”
    “Turn your FullFact project into a Gensim
    tutorial”
    Fix an easy bug on github.

    View Slide