Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"How to get the similarity you need with next gen of word embeddings" PyData Berlin 2017

"How to get the similarity you need with next gen of word embeddings" PyData Berlin 2017

Lev Konstantinovskiy

July 02, 2017
Tweet

More Decks by Lev Konstantinovskiy

Other Decks in Programming

Transcript

  1. Next generation of word
    embeddings
    Lev Konstantinovskiy
    A gensim contributor
    @teagermylk
    Notebook: https://goo.gl/uefW9f

    View Slide

  2. Streaming
    Word2vec and Topic Modelling in Python

    View Slide

  3. Gensim Open Source Package
    ● Numerous Industry Adopters
    ● 210 Code contributors, 4000 Github stars
    ● 200 Messages per month on the mailing list
    ● 200 People chatting on Gitter
    ● 500 Academic citations

    View Slide

  4. Parul Sethi
    Undergraduate student
    University of Delhi, India
    Google Summer of Code 2017
    Added WordRank to Gensim
    Credits

    View Slide

  5. View Slide

  6. Which words are similar?

    View Slide

  7. “Coast” vs “Shore”
    ????

    View Slide

  8. “Coast” vs “Shore”
    similar

    View Slide

  9. “Clothes” vs “Closet”
    ????

    View Slide

  10. “Clothes” vs “Closet”
    similar
    in the sense “related”

    View Slide

  11. “Clothes” vs “Closet”
    different!
    Why?

    View Slide

  12. “Clothes” vs “Closet”
    different!
    not “interchangeable”

    View Slide

  13. Business Problems

    View Slide

  14. Business Problems
    “What does Elizabeth think about Mr Darcy?”
    “Male characters in Pride and Prejudice?”

    View Slide

  15. Two Different
    Business Problems
    1) What words are in the topic of “Darcy”?
    2) What are the Named Entities in the text?

    View Slide

  16. Pride and Prejudice models

    View Slide

  17. Algorithm
    Related
    analogies
    Related
    (WS-353)
    Interchangeable
    analogies
    Interchangeable
    (SimLex-999)
    Word2Vec
    Window = 2
    21 0.63 36 0.33
    Word2Vec
    Window = 15
    40
    0.69 40 0.31
    Word2Vec
    Window = 50
    42
    0.68 34 0.26
    Big window means less “interchangeable”

    View Slide

  18. Closest word to “king”
    Trained on Wikipedia 17m words
    Attribute Interchangeable Both
    Same window size. Different algo.

    View Slide

  19. Theory of word embeddings

    View Slide

  20. What is a word embedding?
    ‘Word embedding’ = ‘word vectors’ = ‘distributed representations’
    It is a dense representation of words in a low-dimensional vector space.
    One-hot representation:
    king = [1 0 0 0.. 0 0 0 0 0]
    queen = [0 1 0 0 0 0 0 0 0]
    book = [0 0 1 0 0 0 0 0 0] king = [0.9457, 0.5774, 0.2224]
    Distributed representation:

    View Slide

  21. Awesome 3D viz in TensorBoard

    View Slide

  22. Awesome 3D viz in TensorBoard
    “Projections are meaningless.”
    Matti Lyra, 2017

    View Slide

  23. How to come up with the
    vectors?

    View Slide

  24. Our corpus (aka dataset)
    cute kitten purred
    cute furry cat purred
    and miaowed
    cute kitten miaowed
    loud furry dog ran

    View Slide

  25. Co-occurence matrix
    ... and the cute kitten purred and then ...
    ... the cute furry cat purred and miaowed ...
    ... that the cute kitten miaowed and she ...
    ... the loud furry dog ran and bit ...
    kitten context words: [cute, purred, miaowed].
    cat context words: [cute, furry, miaowed].
    dog context words: [loud, furry, ran, bit].
    Credit: Ed Grefenstette https://github.com/oxford-cs-deepnlp-2017/

    View Slide

  26. Co-occurence matrix
    kitten context words: [cute, purred,
    miaowed].
    cat context words: [cute, furry, miaowed].
    dog context words: [loud, furry, ran, bit].
    Credit: Ed Grefenstette https://github.com/oxford-cs-deepnlp-2017/
    cute furry bit
    kitten 2 0 0
    cat 1 1 0
    dog 0 1 1
    X =

    View Slide

  27. Dimensionality reduction
    More precisely, in word2vec u*v approximates PMI(X) - log n, in Glove log(X)
    cute furry bit …
    kitten 2 0 0 ...
    cat 1 1 0 ...
    dog 0 1 1 ...
    ... ... ... ... ...
    X = = U * V
    Dims: vocab x vocab = (vocab x small) * (small x vocab)
    First row of U is the word
    embedding of “kitten”

    View Slide

  28. Dimensionality reduction
    More precisely u*v approximates PMI(X) - log n, where n is the negative sampling parameter
    Co-occurence
    score in
    word2vec
    =
    U
    word
    *
    V
    context
    Dims: count = (small vector) * (small vector)

    View Slide

  29. FastText: word is a sum of
    its parts
    Similarity score
    in FastText
    =
    U
    subword
    *
    V
    context
    over all subwords of
    w
    going =
    go + oi + in + ng + goi + oin + ing

    View Slide

  30. FastText better than word2vec
    because morphology
    Credit: Takahiro Kubo http://qiita.com/icoxfog417/items/42a95b279c0b7ad26589
    Slower because many more vectors to consider!

    View Slide

  31. FastText is better at interchangeable
    words
    Related
    words
    accuracy
    Interchangeable
    words accuracy
    Training
    time
    word2vec

    View Slide

  32. Same API as word2vec.
    Out-of-vocabulary words can also be used, provided they have
    at least one character n-gram present in the training data.
    FastText Gensim Wrapper

    View Slide

  33. FastText slower than Python word2vec.
    Even without n-grams...
    But Python is slower than C++, right? :)

    View Slide

  34. - Word2vec
    - FastText
    - WordRank
    - Factorise the co-occurence matrix: SVD/LSI
    - GLoVe
    - EigenWords
    - VarEmbed
    Many ways to get a vector for a word

    View Slide

  35. WordRank is a Ranking Algorithm
    Word2vec
    Input: Context Cute
    Output: Word Kitten
    Classification problem
    WordRank
    Input: Context Cute
    Output: Ranking
    1. Kitten
    2. Cat
    3. Dog
    Robust: Mistake at the top of the
    rank costs more than mistake at
    the bottom.

    View Slide

  36. Gensim WordRank API is same as word2vec,
    FastText

    View Slide

  37. Algorithm
    Train
    time
    (sec)
    Passes
    through
    corpus
    Related
    accuracy
    Related
    (WS-353
    )
    Interchan
    geable
    accuracy
    Interchan
    geable
    (SimLex-
    999)
    Word2Ve
    c
    18 6 4.69 0.37 2.77 0.17
    FastText 50 6 6.57 0.36 36.95 0.13
    WordRan
    k
    4hrs 91 15.26 0.39 4.23 0.09
    Evaluation on 1 mln words corpus

    View Slide

  38. Algorithm
    Train
    time
    (sec)
    Passes
    through
    corpus
    Related
    accuracy
    Related
    (WS-353)
    Interchang
    eable
    accuracy
    Interchan
    geable
    (SimLex-9
    99)
    Word2Vec 402 6 40.34 41.48 0.69 41.1
    FastText 942 6 25.75 57.33 0.67 45.2
    WordRank
    ~42
    hours
    91 54.52 39.83 0.71 44.7
    Evaluation on 17 mln words corpus

    View Slide

  39. How to get the similarity you need
    My similar words must
    be
    Related Interchangeable
    I want to describe the
    word’s
    Topic Function
    I want to Know what doc is about Recognize names
    Then I should run Wordrank (even on
    small corpus, 1m words)
    or
    Word2vec skipgram big
    window needs large
    corpus >5m words
    Word2vec skipgram
    small window
    or
    FastText
    or
    VarEmbed

    View Slide

  40. How to get the similarity you need
    “Similar
    words” are
    Related Interchangeable
    Got a
    million
    words?
    FastText
    Word2vec
    small context
    VarEmbed
    WordRank
    Word2vec
    large context
    Yes No

    View Slide

  41. Thanks for listening!
    Lev Konstantinovskiy
    github.com/tmylk
    @teagermylk

    View Slide