Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Next gen of word embeddings Rio 30 mins

Next gen of word embeddings Rio 30 mins

Lev Konstantinovskiy

April 14, 2017
Tweet

More Decks by Lev Konstantinovskiy

Other Decks in Programming

Transcript

  1. Next generation of word embeddings
    Lev Konstantinovskiy
    Community Manager at Gensim
    @teagermylk
    http://rare-technologies.com/

    View full-size slide

  2. Streaming
    Word2vec and Topic Modelling in Python

    View full-size slide

  3. Gensim Open Source Package
    ● Numerous Industry Adopters
    ● 170 Code contributors, 4000 Github stars
    ● 200 Messages per month on the mailing list
    ● 150 People chatting on Gitter
    ● 500 Academic citations

    View full-size slide

  4. Credits
    Parul Sethi
    Undergraduate student
    University of Delhi, India
    RaReTech Incubator program
    Added WordRank to Gensim
    http://rare-technologies.com/incubator/

    View full-size slide

  5. Part 1. Different word embeddings
    Part 2. Theory of word2vec

    View full-size slide

  6. Business Problems

    View full-size slide

  7. Business Problems
    “What is Dona Flor like?”
    “List all female characters in
    “Dona Flor e seus dois maridos”?”

    View full-size slide

  8. Two Different
    Business Problems
    1) What words are in the topic of “Dona
    Flor”?
    2) What are the Named Entities in the text?

    View full-size slide

  9. DFDM is only 170k words so
    results are so-so

    View full-size slide

  10. Pride & Prejudice
    It is a case universally acknowledged, that
    a single woman in defiance of a good
    sense, must be in use of a son.

    View full-size slide

  11. Pride & Prejudice
    By Lynn Cherny
    http://www.ghostweather.com/files/word2vecpride/
    It is a case universally acknowledged, that a single woman in defiance
    of a good sense, must be in use of a son.
    It is a truth universally acknowledged, that a single man in possession
    of a good fortune, must be in want of a wife.

    View full-size slide

  12. Closest word to “king”?
    Trained on Wikipedia 17m words
    Attribute Interchangeable Both

    View full-size slide

  13. How to get the similarity you need
    My similar words must
    be
    Associated Interchangeable
    I want to describe the
    word’s
    Topic Function
    I want to Know what doc is about Recognize names
    Then I should run Wordrank (even on small
    corpus, 1m words)
    or
    Word2vec skipgram big
    window needs large corpus
    >5m words
    Word2vec skipgram small
    window
    or
    FastText
    or
    VarEmbed

    View full-size slide

  14. Part 2. Theory of Word2vec

    View full-size slide

  15. Word2vec is a big victory of
    unsupervised learning in industry.
    [GANs will get there in 3 years too :)]
    Google ran word2vec on 100billion of unlabelled words.
    Then shared their trained model.
    Thanks to Google for cutting our training time to zero!. :)

    View full-size slide

  16. The famous Google News model.
    Google ran word2vec on 100billion of unlabelled words.
    Then shared their trained model.
    Thanks to Google for cutting our training time to zero!. :)

    View full-size slide

  17. Word embeddings can be used for:
    - automated text tagging
    - recommendation engines
    - synonyms and search query expansion
    - machine translation
    - plain feature engineering

    View full-size slide

  18. What is a word embedding?
    ‘Word embedding’ = ‘word vectors’ = ‘distributed representations’
    It is a dense representation of words in a low-dimensional vector space.
    One-hot representation:
    king = [1 0 0 0.. 0 0 0 0 0]
    queen = [0 1 0 0 0 0 0 0 0]
    book = [0 0 1 0 0 0 0 0 0]
    king = [0.9457, 0.5774, 0.2224]
    Distributed representation:

    View full-size slide

  19. Many other ways to get a vector for a word:
    - Factorise the co-occurence matrix (SVD/LSA)
    - GLoVe
    - EigenWords
    - WordRank
    - VarEmbed
    - FastText
    Disclaimer
    Word2vec is not the only word embedding in the world

    View full-size slide

  20. Use the “Distributional hypothesis”:
    “You shall know a word by the company it keeps”
    -J. R. Firth 1957
    Richard Socher’s NLP course http://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
    How to come up with an embeddig?

    View full-size slide

  21. Usual procedure
    1.Initialise random vectors
    2. Pick an objective function.
    3. Do gradient descent.

    View full-size slide

  22. For the theory, take Richard Sochers’s CS224D free online class
    Richard Socher’s NLP course http://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf

    View full-size slide

  23. “The fox jumped over the lazy dog”
    Maximize the likelihood of seeing the context words given the word over.
    P(the|over)
    P(fox|over)
    P(jumped|over)
    P(the|over)
    P(lazy|over)
    P(dog|over)
    word2vec algorithm
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec

    View full-size slide

  24. Probability should depend on the word vectors.
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec
    P(fox|over)
    P(v
    fox
    |v
    over
    )

    View full-size slide

  25. A twist: two vectors for every word
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec
    Should depend on whether it’s the input or the output.
    P(v
    OUT
    |v
    IN
    )
    “The fox jumped over the lazy dog”
    v
    IN

    View full-size slide

  26. Twist: two vectors for every word
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec
    Should depend on whether it’s the input or the output.
    P(v
    OUT
    |v
    IN
    )
    “The fox jumped over the lazy dog”
    v
    IN
    v
    OUT
    = P(v
    THE
    |v
    OVER
    )

    View full-size slide

  27. Twist: two vectors for every word
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec
    Should depend on whether it’s the input or the output.
    P(v
    OUT
    |v
    IN
    )
    “The fox jumped over the lazy dog”
    v
    IN
    v
    OUT

    View full-size slide

  28. Twist: two vectors for every word
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec
    Should depend on whether it’s the input or the output.
    P(v
    OUT
    |v
    IN
    )
    “The fox jumped over the lazy dog”
    v
    IN
    v
    OUT

    View full-size slide

  29. Twist: two vectors for every word
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec
    Should depend on whether it’s the input or the output.
    P(v
    OUT
    |v
    IN
    )
    “The fox jumped over the lazy dog”
    v
    IN
    v
    OUT

    View full-size slide

  30. Twist: two vectors for every word
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec
    Should depend on whether it’s the input or the output.
    P(v
    OUT
    |v
    IN
    )
    “The fox jumped over the lazy dog”
    v
    IN
    v
    OUT

    View full-size slide

  31. Twist: two vectors for every word
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec
    Should depend on whether it’s the input or the output.
    P(v
    OUT
    |v
    IN
    )
    “The fox jumped over the lazy dog”
    v
    IN
    v
    OUT

    View full-size slide

  32. Twist: two vectors for every word
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec
    Should depend on whether it’s the input or the output.
    P(v
    OUT
    |v
    IN
    )
    “The fox jumped over the lazy dog”
    v
    IN
    v
    OUT

    View full-size slide

  33. Twist: two vectors for every word
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec
    Should depend on whether it’s the input or the output.
    P(v
    OUT
    |v
    IN
    )
    “The fox jumped over the lazy dog”
    v
    IN
    v
    OUT

    View full-size slide

  34. How to define P(v
    OUT
    |v
    IN
    )? First, define similarity.
    How similar are two vectors?
    Just dot product for unit length vectors
    v
    OUT
    * v
    IN
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec

    View full-size slide

  35. Get a probability in [0,1] out of similarity in [-1, 1]
    Normalization term over all out words
    Used with permission from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec

    View full-size slide

  36. Word2vec is great!
    Vector arithmetic
    Slide from @chrisemoody
    http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec
    -
    -

    View full-size slide

  37. @datamusing Sudeep Das
    http://www.slideshare.net/SparkSummit/using-data-science-to-transform-opentable-into-delgado-das
    More directions

    View full-size slide

  38. Consistent directions
    Mikolov et al. Distributed Representations of Words and Phrases and their Compositionality 2013

    View full-size slide

  39. Explore word2vec yourself
    http://rare-technologies.com/word2vec-tutorial/#app

    View full-size slide

  40. Facebook’s FastText :
    word is a sum of its parts
    Credit: Takahiro Kubo http://qiita.com/icoxfog417/items/42a95b279c0b7ad26589
    Better than word2vec! But slower…
    Download and play with Portuguese model.

    View full-size slide

  41. Thanks for listening!
    Lev Konstantinovskiy
    github.com/tmylk
    @teagermylk
    Gensim T-shirt question:
    Please answer by raising your hand.
    How many words are in “Dona Flor e seus dois
    maridos”?

    View full-size slide