Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Word Embeddings

Word Embeddings

word embeddings, distributed representation, distributional hypothesis, pointwise mutual information, singular value decomposition, word2vec, word analogy, GloVe, fastText

Naoaki Okazaki

August 07, 2020
Tweet

More Decks by Naoaki Okazaki

Other Decks in Research

Transcript

  1. Word Embeddings Naoaki Okazaki School of Computing, Tokyo Institute of

    Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/
  2. Deep Neural Networks (DNNs) and Natural Language Processing (NLP) 1

    very good movie very good movie very good movie very good movie とても よい 映画 Word embeddings (representing a word as a vector) Semantic composition (computing the vector of phrases from constituent words) Encoder-decoder model (generating a sequence of words from the composed vector)  DNN made breakthroughs in speech processing and computer vision  Reduced the error rate of image recognition more than 10% (ILSVRC 2012)  At first, DNN had limited impacts on NLP  Natural languages have symbols that represent semantic information  Recently, DNNs have successfully been applied to various tasks  DNNs achieve the state-of-the-art performance on most NLP tasks  DNNs learn vector representations of text and generate a text (e.g., sequence of words) from the representations
  3. Word embedding 2 very good movie ∈ ℝ ∈ ℝ

    ∈ ℝ  Represents a word with a vector of real numbers  Embeds a word into a neural network  Expresses semantic and syntactic aspects of a word
  4. Distributed representation (Hinton+ 1986) 3  Local representation  Assigns

    a unit (neuron, dimension, symbol) to every concept  Distributed representation  Each concept is represented by multiple units (micro-features)  Each unit commits to multiple concepts … … #249 … … #809 … … #18329
  5. Distributional hypothesis (Harris 1954; Firth 1957) 10 … packed with

    people drinking beer or wine. Many restaurants … into alcoholic drinks such as beer or hard liquor and derive … … in miles per hour, pints of beer, and inches for clothes. M… …ns and for pints for draught beer, cider, and milk sales. The carbonated beverages such as beer and soft drinks in non-ref… …g of a few young people to a beer blast or fancy formal part… …c and alcoholic drinks, like beer and mead, contributed to a… People are depicted drinking beer, listening to music, flirt… … and for the pint of draught beer sold in pubs (see Metricat… beer beer beer beer beer beer beer beer beer … ith people drinking beer or wine. Many restaurants can be f… …gan to drink regularly, host wine parties and consume prepar… principal grapes for the red wines are the grenache, mourved… … four or more glasses of red wine per week had a 50 percent … …e would drink two bottles of wine in an evening. According t… …. Teran is the principal red wine grape in these regions. In… …a beneficial compound in red wine that other types of alcohol … Colorino and even the white wine grapes like Trebbiano and … In Shakesperean theatre, red wine was used in a glass contai… wine wine wines wine wine wine wine wine wine You shall know a word by the company it keeps Z Harris. 1954. Distributional structure. Word, 10(23):146-162. J Firth. 1957. A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32.
  6. Word-context matrix 11 beer wine car ride have new drink

    bottle train book speed read 36 108 578 291 841 14 14 284 94 201 72 92 3 3 0 57 86 2 0 0 3 0 37 72 2 0 1 44 43 1 1 2 3 2 338 Context: words appearing in ±ℎ word offsets to the target word cols rows : Frequency of co-occurrences of the word with context word (for example, “train” co-occurred with “drink” three times) This row vector represents the meaning of the word “beer” Context Word
  7. Measure the similarity of two vectors with cos 12 Given

    two vectors and whose angle is , ⋅ = cos Therefore, cos = ⋅ The value of cos is,  → 0 (same direction): cos → +1  → /2 (orthogonal): cos → 0  → (opposite direction): cos → −1 In this way, cos can measure the similarity of two vectors within the range of −1, +1 θ cosθ v u 0 1
  8. Let’s compute cosine similarity 13 beer wine car ride have

    new drink bottle train book speed read 36 108 578 291 841 14 14 284 94 201 72 92 3 3 0 57 86 2 0 0 3 0 37 72 2 0 1 44 43 1 1 2 3 2 338 Context Word cos = ⋅ Cosine similarity between “beer” and “wine”: cos = 36 × 108 + 14 × 14 + 72 × 92 + 57 × 86 + 3 × 0 + 0 × 1 + 1 × 2 362 + 142 + 722 + 572 + 32 + 02 + 12 1082 + 142 + 922 + 862 + 02 + 12 + 22 = 0.941 Cosine similarity between “beer” and “train”: cos = 36 × 291 + 14 × 94 + 72 × 3 + 57 × 0 + 3 × 72 + 0 × 43 + 1 × 2 362 + 142 + 722 + 572 + 32 + 02 + 12 2912 + 942 + 32 + 02 + 722 + 432 + 22 = 0.387
  9. Positive Pointwise Mutual Information (PPMI) (Bullinaria+ 2007) 14 ,𝑗𝑗 =

    max 0, log (, ) () = max 0, log #(, ) + log #(∗,∗) − log #(∗, ) − log #(,∗) , = #(, )/#(∗,∗), = #(,∗)/#(∗,∗), () = #(∗, )/#(∗,∗) #(,∗) = ∑𝑗𝑗 #(,) ,#(∗, ) = ∑ #(, ) ,#(∗,∗) = ∑,𝑗𝑗 #(, ) Discount frequent words and frequent context words beer wine car ride have new drink bottle train book speed read 0 0 0.09 0.03 0.09 0 0 0.49 0.02 0 2.04 1.78 0 0 0 1.97 1.87 0 0 0 0 0 0.13 1.43 0 0 0 0.55 1.16 0 0 0 0 0 0.85 Context Word cos(beer,wine) = 0.99 > 0.941 cos(beer,train) = 0.00 < 0.387 J Bullinaria and J Levy. 2007. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39:510–526.
  10. Latent Semantic Analysis (LSA) (Deerwester, 1990) 15 S Deerwester, S

    Dumais, G Furnas, T Landauer, R Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407.  Singular Value Decomposition (SVD) on = Σ T  Truncate Σ with singular values = Σ T (-rank approximation) ( is a minimizer of − among rank- matrices)  Use Σ as -dimensional vectors = Σ T Σ T T = Σ Σ T Inner product of Σ is equal to that of ( × ) ( × ) ( × ) ( × ) : unitary matrix Σ: diagonal matrix with singular values T: unitary matrix
  11. Low-rank approximation by SVD ( = 3) 16 Truncate with

    three SVs Uses up to three columns Uses up to three rows (SVD on the original matrix) (3-rank approximation) beer wine car train book Truncated SVD (Halko, 2011) finds top- singular values of the matrix efficiently (for example, sklearn.decomposition.TruncatedSVD) cos(beer,wine) = 0.96 cos(beer,train) = 0.37 N Halko, P G Martinsson, and J A Tropp. 2011. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review, 53(2), 217-288.
  12. Skip-gram with Negative Sampling (SGNS) (Mikolov+ 2013) 18 draught offer

    pubs beer, cider, and wine last use place people make city full know build time group have new game rather age show take take team season say Word vector ∈ ℝ Context vector � ∈ ℝ : Positive : Negative Update rule Corpus Each word vector predicts 2ℎ context words Sample 𝑘 words as negative words from the unigram distribution. Update vectors such that word vectors do not predict the negative words T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119.
  13. SGD algorithm for updating vectors 19  Initialization:  ←

    0  Word vectors ( ): Initialize with random values of [0,1]  Context vectors ( ): Initialize with zero  Repeat from the head to tail of the training corpus:  ← + 1  Learning rate = 0 1 − +1  For each connected with the target word    = 1 − ⋅ inner product → +∞ ⋅ inner product → −∞ ← + ← +
  14. Demo with word vectors 20  English: GoogleNews-vectors-negative300.bin.gz  Trained

    on Google News dataset (100B words)  https://code.google.com/archive/p/word2vec/  Japanese: (trained by me)  Trained on Japanese Wikipedia articles (400M words)  Use gensim for manipulating them in Python https://github.com/chokkan/deeplearning/blob/master/notebook/word2vec_ja.ipynb https://github.com/chokkan/deeplearning/blob/master/notebook/word2vec_en.ipynb
  15. Evaluation on the word analogy task 25 (Mikolov+ 2013) Example

    of semantic analogy: Athens : Greece = Tokyo : Japan Example of syntactic analogy: cool : cooler = deep : deeper T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119.
  16. Word vectors exhibit additive composition 26 Famous example: king −

    man + woman ≈ queen (Mikolov+ 2013) T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119.
  17. The objective function of SGNS 27  The objective function

    (MLE)  (|) is modeled by softmax  Approximate (|) with logistic regressions = − � ∈ � ∈ log (|) : corpus (sequence of words) : a set of words appearing within the offset ±ℎ from the word Probability to predict ∈ from = exp � ∑ ′∈ exp( � ′) log ≈ log ⋅ � + � Ε ∼ log − ⋅ � Too heavy computation as this requires the sum over exponentials of inner products between the word with all words ′ ∈ Sample a word from the unigram distribution ( times)
  18. SGNS is equivalent to Shifted PMI (Levy+ 2014) 28 

    SGNS models a co-occurrence matrix , = PMI , − log ≈ �  This is similar to training word vectors by building a co- occurrence matrix by using PMI  The previous approach (PMI) could also realize additive composition Shifting PMI to negative O Levy and Y Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS 2014, pp. 2177–2185.
  19. Derivation of Shifted PMI (Levy+ 2014) 29 The objective function

    of SGNS (#(, ): co-occurrence frequency of with ; #(): frequency of ), = − � ∈ � ∈ log � − � Ε ∼ log − � = − � ∈ � ∈ #(, ) log � − � ∈ #() ⋅ � Ε ∼ log − � Compute the expectation explicitly, Ε ∼ log − � = � ∈ #() || log − � = #() || log − � + � ∈∖{} #() || log − � Extract the portion of the objective function related to and (we can ignore ), , = −#(, ) log � − # ⋅ ⋅ #() || log − � Let = � . Compute the gradient of , with respect to by using log () = − = 1 − (), (, ) = −#(, ) − + # #() = # , − 1 + # #() Find the point where the gradient gets zero, 1 + #()#() #(, ) = 1 ⇔ 1 + #()#() #(, ) ⋅ 1 1 + −𝑥𝑥 = 1 ⇔ exp − = #()#() #(, ) Therefore (we assume = #(∗,∗)), = � = log #(, ) #()#() = log #(, ) #()#() − log = PMI , − log O Levy and Y Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS 2014, pp. 2177–2185.
  20. GloVe (Pennington+ 2014) 31 = � ,𝑗𝑗=1 (,𝑗𝑗 ) (

    � 𝑗𝑗 + + � 𝑗𝑗 − log ,𝑗𝑗 )2 Minimize = (/max ) (if < max ) 1 (otherwise) Co-occurrence frequency between words and Total number of words Vector of word Vector of word Bias for word Bias for word Vector #1 Vector #2 Similarly to SGNS, each word has two vectors assigned. This study uses ( + � ) after training the vectors (this treatment improves the performance) 𝑚𝑚 = 100, α = 0.75 → (by AdaGrad) J Pennington, R Socher, and C Manning. 2014. Glove: Global vectors for word representation. In EMNLP-2014, pp. 1532–1543.
  21. Rationale of � 𝑗𝑗 + + � 𝑗𝑗 − log

    ,𝑗𝑗 (1/4) 32  Consider representing a relation of words and on an aspect by using context word  E.g., Relation between ice and steam on thermodynamics  , /𝑗𝑗, may be more useful than , = (|) to capture the characteristics of words and  E.g., solid and gas is more useful than water and fashion (Pennington+ 2014) J Pennington, R Socher, and C Manning. 2014. Glove: Global vectors for word representation. In EMNLP-2014, pp. 1532–1543.
  22. Rationale of � 𝑗𝑗 + + � 𝑗𝑗 − log

    ,𝑗𝑗 (2/4) 33  Let , 𝑗𝑗 , � the vectors of words , ,  In order to represent , /𝑗𝑗, with word vectors, − 𝑗𝑗 , � = , /𝑗𝑗,  The most simple way to cast the type of the left (vector) into that of the right (scalar), − 𝑗𝑗 � = , /𝑗𝑗, Represent the contrast of the characteristics of words and with vector subtraction We will decide the form of later Different from
  23. Rationale of � 𝑗𝑗 + + � 𝑗𝑗 − log

    ,𝑗𝑗 (3/4) 34  Use exp: ℝ → ℝ+ as exp − 𝑗𝑗 � = exp � exp 𝑗𝑗 � = , 𝑗𝑗,  Therefore, exp � = , = , /  Take the logarithms of the both sides, � = log , − log
  24. Rationale of � 𝑗𝑗 + + � 𝑗𝑗 − log

    ,𝑗𝑗 (4/4) 35  Words and contexts should be interchangeable  Consider ↔ � and ↔ at the same time  Words and contexts are not interchangeable � = log , − log  Because we have no constant for  Represent log as a bias term , and introduce a new bias term � about � = log , − − � � + + � = log ,
  25. Rationale of (,𝑗𝑗 ) 36  We cannot compute log

    ,𝑗𝑗 when ,𝑗𝑗 = 0  Most elements in are 0 (sparse matrix)  We ignore unobserved statistics  We should not respect rare co-occurrences  Hard to reproduce rare co-occurrences with vectors  Force the weight (,𝑗𝑗 /max ) when ,𝑗𝑗 < max  We should not respect frequent ones too much  Treat frequent co-occurrences with the same importance  Clip the weight 1 when ,𝑗𝑗 ≥ max
  26. Tricks used in implementations (Levy+ 2015) 38 Description Values PPMI

    SVD SGNS GloVe win Window size (ℎ) ℎ ∈ {2, 5, 10}     dyn Weighted context with(/ℎ), none     *1 sub Subsampling with, none     del Rare word removal with, none     neg Negative samples ∈ {1, 5, 15}  *2  *2  cds Distribution correction α ∈ {1, 0.75}  *3  *3  w+c Vector summation , ( + � )    eig Weighted SVs ∈ {0, 0.5, 1.0}  nrm Normalization *4 both, col, row, none     *1: The same weighting method implemented in word2vec *2: These are set by shifted PPMI *3: These are implemented by modifying the denominator of PMIs *4: Normalization for each word vector was the best Preprocessing Association measure Postprocessing O Levy, Y Goldberg, and I Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics (TACL), 3:211-225.
  27. Tips for training word embeddings (Levy+ 2015) 39  Use

    context distribution smoothing (cds=0.75)  Use SVD with symmetric variants (eig=0 or 0.5)  No effect with neg > 1 in Shifted PPMI  SGNS is a robust baseline  It does not underperform in any scenario  It trains word embeddings the fastest with the cheapest memory consumption  Larger negative samples are better in SGNS  Worth trying w+c in SGNS and GloVe  May result in substantial gains (but sometimes in losses) O Levy, Y Goldberg, and I Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics (TACL), 3:211-225.
  28. Different evaluations favor different embeddings (Schnabel+ 2015) 40 Task: a

    human worker chooses the most similar word among the candidates computed by word embeddings GloVe was poor at adverbs for some reason CBOW suffers from larger candidates (50 NN) T Schnabel, I Labutov, D Mimno, T Joachims. Evaluation methods for unsupervised word embeddings. In EMNLP 2015, pp. 298-307.
  29. Different tasks favor different embeddings (Schnabel+ 2015) 41  No

    almighty word embeddings for all tasks  In order to improve the performance on a task, we should fine-tune word embeddings on the target task (Schnabel+ 2015) T Schnabel, I Labutov, D Mimno, T Joachims. Evaluation methods for unsupervised word embeddings. In EMNLP 2015, pp. 298-307.
  30. fastText (Bojanowski+ 2017) 42  SGNS and GloVe are unaware

    of internal letters in words  Extend SGNS to consider letter -grams (subword units)  The use of subword units is also effective in machine translation <offer> <of off ffe fer er> pubs draught show age take Word vector Subword vectors Sum Context vectors The update procedure is the same P Bojanowsk, E Grave, A Joulin, T Mikolov. 2017. Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics (TACL), 5:135-146.
  31. Comparison between SGNS and fastText 43 (Bojanowski+ 2017) fastText (sisg)

    favors syntactic analogy more than semantic analogy fastText (sisg) outperforms the other except for WS353 in English P Bojanowsk, E Grave, A Joulin, T Mikolov. 2017. Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics (TACL), 5:135-146.
  32. Summary  Word embeddings capture syntactic and semantic information to

    some extent  The underlying idea is distributional hypothesis  You shall know the word by the company it keeps  You shall know the word by predicting its companies  No almighty word embeddings for all downstream tasks  Next question Can we represent a phrase/sentence with a vector? 45