Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding and Expanding Word2Vec

Understanding and Expanding Word2Vec

Word2Vec has proven useful at many tasks in NLP, because the vectors learned capture many meaningful relationships. This has spawned a large number of variations suited to accomplishing distinct tasks - not unlike LDA did ten years earlier. In this talk, I'll walk through briefly how Word2Vec functions, its assumptions and shortfalls, and some simple, successful variants.

MunichDataGeeks

July 28, 2016
Tweet

More Decks by MunichDataGeeks

Other Decks in Science

Transcript

  1. What is Word2Vec? • produces vectors to capture context ◦

    input: a text corpus ◦ output: word and context vectors for each vocabulary item • W, C: fixed-dimensional vectors for words and contexts • context: words seen before and after target • capture: strongly predict real actual context words, but not others 3
  2. Language Model Inspiration: • Attempt to predict context words c

    that we observe near word w • More precisely, increase P(w, c) for observed (w, c) pairs P(w, c) = σ(w⋅c) = 1/(1+exp(-w⋅c)) → Probability comes from a vector dot product 4 How does it work?
  3. • Learn unknown matrix • (stochastic) gradient updates to optimize

    an objective function • Words predict their context ◦ (left and right neighbors) 5 Neural Network LM similarity
  4. Negative Sampling • Goal is contrast ◦ make P(w, c

    s ) larger than P(w, c u ) ◦ we don’t need real probabilities → much simpler! • Need contrasting set ◦ ~15 random samples of unseen words seems to be enough ◦ can sample and include in objective 6
  5. How does it work? • For each observed word w

    i and context c s ◦ Compute w i ⋅c s for the current vectors ◦ Compute w i ⋅c u for k unseen context vectors c u (to contrast with) • Compute objective (for a single example): S(w i , c s ) = log σ(w i ⋅c s ) + Σ u=1..k log σ(-w i ⋅c u ) • Update vectors to maximize objective using gradient updates ◦ Increase w i ⋅c s and lower w i ⋅c u 7
  6. Why is this reasonable? Recall: P(w i , c j

    ) = σ(w i ⋅c j ) Desire: P(w i , c s ) is large → we want w i ⋅c s to be large Desire: P(w i , c u ) is small → we want (-w i ⋅c u ) to be large (w i ⋅c u is small) Define: S(w i , c s ) = log σ(w i ⋅c s ) + Σ u=1..k log σ(-w i ⋅c u ) This function encodes our desires! 8
  7. Additional Word2Vec boosts • Modified sampling frequency ((P(c u ))^(¾))

    for negative samples ◦ Gives extra chance of sampling rare words ◦ Broadens exposure of negative samples • Subsampling frequent terms ◦ No theoretical effect on objective function ◦ Reduces amount of computation dramatically • Join common phrases (word2phrase) ◦ Equation looks surprisingly similar to shifted PMI 9
  8. Word2Vec as a matrix Taking the dot product of all

    vectors w i ⋅c j , Word2Vec computes a matrix W⋅C = A <w⋅c> = [ a ij ] As training converges, a ij approaches a* ij = PMI(w i , c j ) - k = log( ) - k 10 P(w i , c j ) P(w i )P(c j )
  9. Why is this reasonable? • PMI increases with pair frequency,

    but decreases with word frequency • Subtracting k increases contrast further ◦ Expect some random variables to have a positive PMI ◦ Require stronger evidence of shared behavior • PMI performs well on many word-related tasks 11
  10. Training as Matrix Factorization • We want W⋅C = A

    = [ a ij ] • We know a ij → a* ij , and we know a* ij → This is a matrix factorization problem![3] Decompose A* |V|x|V| =W |V|x300 ⋅C 300x|V| → Boost from changing negative sampling rate can be translated to matrix interpretation; use (P(c u ))^(¾) in denominator of PMI[4] 12
  11. Differences from analogous models Compared to neural language model •

    No non-linear hidden layer needed • Noise-contrastive objective function replaces softmax Compared to matrix factorization • Objective penalizes (a ij - a* ij ) more for common (w i , c j ) pairs; this weights the reconstruction error differently • PMI is infinite for unseen (w i , c j ) pairs; this makes positive PMI more practical 13
  12. Related Work, as extensions • GloVe[5] ◦ Add bias terms

    to a ij for each word and each context ◦ Average word and context vectors for final result • Doc2Vec[6] ◦ Add rows to A for each document, and consider context words contained within ◦ Not all w need to be contexts • Word2Vecf[7] ◦ Base contexts on, e.g., dependency parse ◦ No need for C to overlap with W 14
  13. Questions? [1] Mikolov, Tomas, et al. "Efficient estimation of word

    representations in vector space." arXiv preprint arXiv:1301.3781 (2013). [2] Mikolov, T., and J. Dean. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems (2013). [3] Levy, Omer, and Yoav Goldberg. "Neural word embedding as implicit matrix factorization." Advances in neural information processing systems. 2014. [4] Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225. [5] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global Vectors for Word Representation." EMNLP. Vol. 14. 2014. [6] Le, Quoc V., and Tomas Mikolov. "Distributed Representations of Sentences and Documents." ICML. Vol. 14. 2014. [7] Levy, Omer, and Yoav Goldberg. "Dependency-Based Word Embeddings."ACL (2). 2014. 15