Word Embeddings Naoaki Okazaki School of Computing, Tokyo Institute of Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/

Deep Neural Networks (DNNs) and Natural Language Processing (NLP) 1 very good movie very good movie very good movie very good movie とても よい 映画 Word embeddings (representing a word as a vector) Semantic composition (computing the vector of phrases from constituent words) Encoder-decoder model (generating a sequence of words from the composed vector) DNN made breakthroughs in speech processing and computer vision Reduced the error rate of image recognition more than 10% (ILSVRC 2012) At first, DNN had limited impacts on NLP Natural languages have symbols that represent semantic information Recently, DNNs have successfully been applied to various tasks DNNs achieve the state-of-the-art performance on most NLP tasks DNNs learn vector representations of text and generate a text (e.g., sequence of words) from the representations

Word embedding 2 very good movie ∈ ℝ ∈ ℝ ∈ ℝ Represents a word with a vector of real numbers Embeds a word into a neural network Expresses semantic and syntactic aspects of a word

Distributed representation (Hinton+ 1986) 3 Local representation Assigns a unit (neuron, dimension, symbol) to every concept Distributed representation Each concept is represented by multiple units (micro-features) Each unit commits to multiple concepts … … #249 … … #809 … … #18329

Limitation of dictionary: compositionality 8 http://wordnetweb.princeton.edu/perl/webwn?s=apple+tree No “apple tea” nor ”apple production” in the dictionary

Distributional hypothesis (Harris 1954; Firth 1957) 10 … packed with people drinking beer or wine. Many restaurants … into alcoholic drinks such as beer or hard liquor and derive … … in miles per hour, pints of beer, and inches for clothes. M… …ns and for pints for draught beer, cider, and milk sales. The carbonated beverages such as beer and soft drinks in non-ref… …g of a few young people to a beer blast or fancy formal part… …c and alcoholic drinks, like beer and mead, contributed to a… People are depicted drinking beer, listening to music, flirt… … and for the pint of draught beer sold in pubs (see Metricat… beer beer beer beer beer beer beer beer beer … ith people drinking beer or wine. Many restaurants can be f… …gan to drink regularly, host wine parties and consume prepar… principal grapes for the red wines are the grenache, mourved… … four or more glasses of red wine per week had a 50 percent … …e would drink two bottles of wine in an evening. According t… …. Teran is the principal red wine grape in these regions. In… …a beneficial compound in red wine that other types of alcohol … Colorino and even the white wine grapes like Trebbiano and … In Shakesperean theatre, red wine was used in a glass contai… wine wine wines wine wine wine wine wine wine You shall know a word by the company it keeps Z Harris. 1954. Distributional structure. Word, 10(23):146-162. J Firth. 1957. A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32.

Word-context matrix 11 beer wine car ride have new drink bottle train book speed read 36 108 578 291 841 14 14 284 94 201 72 92 3 3 0 57 86 2 0 0 3 0 37 72 2 0 1 44 43 1 1 2 3 2 338 Context: words appearing in ±ℎ word offsets to the target word

cols

rows

: Frequency of co-occurrences of the word with context word (for example, “train” co-occurred with “drink” three times) This row vector represents the meaning of the word “beer” Context Word

Measure the similarity of two vectors with cos 12 Given two vectors and whose angle is , ⋅ = cos Therefore, cos = ⋅

The value of cos is, → 0 (same direction): cos → +1 → /2 (orthogonal): cos → 0 → (opposite direction): cos → −1 In this way, cos can measure the similarity of two vectors within the range of −1, +1 θ cosθ v u 0 1

Latent Semantic Analysis (LSA) (Deerwester, 1990) 15 S Deerwester, S Dumais, G Furnas, T Landauer, R Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407. Singular Value Decomposition (SVD) on = Σ T Truncate Σ with singular values

= Σ T (-rank approximation) ( is a minimizer of − among rank- matrices) Use Σ as -dimensional vectors

= Σ T Σ T T = Σ Σ T Inner product of Σ is equal to that of ( × ) ( × ) ( × ) ( × ) : unitary matrix Σ: diagonal matrix with singular values T: unitary matrix

Low-rank approximation by SVD ( = 3) 16 Truncate with three SVs Uses up to three columns Uses up to three rows (SVD on the original matrix) (3-rank approximation) beer wine car train book Truncated SVD (Halko, 2011) finds top- singular values of the matrix efficiently (for example, sklearn.decomposition.TruncatedSVD) cos(beer,wine) = 0.96 cos(beer,train) = 0.37 N Halko, P G Martinsson, and J A Tropp. 2011. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review, 53(2), 217-288.

Skip-gram with Negative Sampling (SGNS) (Mikolov+ 2013) 18 draught offer pubs beer, cider, and wine last use place people make city full know build time group have new game rather age show take take team season say Word vector ∈ ℝ Context vector �

∈ ℝ : Positive : Negative Update rule Corpus Each word vector predicts 2ℎ context words Sample 𝑘 words as negative words from the unigram distribution. Update vectors such that word vectors do not predict the negative words T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119.

SGD algorithm for updating vectors 19 Initialization: ← 0 Word vectors ( ): Initialize with random values of [0,1] Context vectors ( ): Initialize with zero Repeat from the head to tail of the training corpus: ← + 1 Learning rate = 0 1 − +1 For each connected with the target word = 1 − ⋅ inner product → +∞ ⋅ inner product → −∞ ← + ← +

Demo with word vectors 20 English: GoogleNews-vectors-negative300.bin.gz Trained on Google News dataset (100B words) https://code.google.com/archive/p/word2vec/ Japanese: (trained by me) Trained on Japanese Wikipedia articles (400M words) Use gensim for manipulating them in Python https://github.com/chokkan/deeplearning/blob/master/notebook/word2vec_ja.ipynb https://github.com/chokkan/deeplearning/blob/master/notebook/word2vec_en.ipynb

Evaluation on the word analogy task 25 (Mikolov+ 2013) Example of semantic analogy: Athens : Greece = Tokyo : Japan Example of syntactic analogy: cool : cooler = deep : deeper T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119.

Word vectors exhibit additive composition 26 Famous example: king − man + woman ≈ queen (Mikolov+ 2013) T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119.

The objective function of SGNS 27 The objective function (MLE) (|) is modeled by softmax Approximate (|) with logistic regressions = − � ∈ � ∈ log (|) : corpus (sequence of words)

: a set of words appearing within the offset ±ℎ from the word Probability to predict ∈ from = exp �

∑ ′∈ exp( � ′) log ≈ log ⋅ �

+ � Ε ∼ log − ⋅ �

Too heavy computation as this requires the sum over exponentials of inner products between the word with all words ′ ∈ Sample a word from the unigram distribution ( times)

SGNS is equivalent to Shifted PMI (Levy+ 2014) 28 SGNS models a co-occurrence matrix , = PMI , − log ≈ �

This is similar to training word vectors by building a co- occurrence matrix by using PMI The previous approach (PMI) could also realize additive composition Shifting PMI to negative O Levy and Y Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS 2014, pp. 2177–2185.

Derivation of Shifted PMI (Levy+ 2014) 29 The objective function of SGNS (#(, ): co-occurrence frequency of with ; #(): frequency of ), = − � ∈ � ∈ log �

− � Ε ∼ log − �

= − � ∈ � ∈ #(, ) log �

− � ∈ #() ⋅ � Ε ∼ log − �

Compute the expectation explicitly, Ε ∼ log − �

= � ∈ #() || log − �

= #() || log − �

+ � ∈∖{} #() || log − �

Extract the portion of the objective function related to and (we can ignore ), , = −#(, ) log �

− # ⋅ ⋅ #() || log − �

Let = �

. Compute the gradient of , with respect to by using

= log #(, ) #()#() = log #(, ) #()#() − log = PMI , − log O Levy and Y Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS 2014, pp. 2177–2185.

(,𝑗𝑗 ) ( � 𝑗𝑗 + + � 𝑗𝑗 − log ,𝑗𝑗 )2 Minimize = (/max ) (if < max ) 1 (otherwise) Co-occurrence frequency between words and Total number of words Vector of word Vector of word Bias for word Bias for word Vector #1 Vector #2 Similarly to SGNS, each word has two vectors assigned. This study uses ( + �

) after training the vectors (this treatment improves the performance) 𝑚𝑚 = 100, α = 0.75 → (by AdaGrad) J Pennington, R Socher, and C Manning. 2014. Glove: Global vectors for word representation. In EMNLP-2014, pp. 1532–1543.

Rationale of � 𝑗𝑗 + + � 𝑗𝑗 − log ,𝑗𝑗 (1/4) 32 Consider representing a relation of words and on an aspect by using context word E.g., Relation between ice and steam on thermodynamics , /𝑗𝑗, may be more useful than , = (|) to capture the characteristics of words and E.g., solid and gas is more useful than water and fashion (Pennington+ 2014) J Pennington, R Socher, and C Manning. 2014. Glove: Global vectors for word representation. In EMNLP-2014, pp. 1532–1543.

Rationale of (,𝑗𝑗 ) 36 We cannot compute log ,𝑗𝑗 when ,𝑗𝑗 = 0 Most elements in are 0 (sparse matrix) We ignore unobserved statistics We should not respect rare co-occurrences Hard to reproduce rare co-occurrences with vectors Force the weight (,𝑗𝑗 /max ) when ,𝑗𝑗 < max We should not respect frequent ones too much Treat frequent co-occurrences with the same importance Clip the weight 1 when ,𝑗𝑗 ≥ max

) eig Weighted SVs ∈ {0, 0.5, 1.0} nrm Normalization *4 both, col, row, none *1: The same weighting method implemented in word2vec *2: These are set by shifted PPMI *3: These are implemented by modifying the denominator of PMIs *4: Normalization for each word vector was the best Preprocessing Association measure Postprocessing O Levy, Y Goldberg, and I Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics (TACL), 3:211-225.

Tips for training word embeddings (Levy+ 2015) 39 Use context distribution smoothing (cds=0.75) Use SVD with symmetric variants (eig=0 or 0.5) No effect with neg > 1 in Shifted PPMI SGNS is a robust baseline It does not underperform in any scenario It trains word embeddings the fastest with the cheapest memory consumption Larger negative samples are better in SGNS Worth trying w+c in SGNS and GloVe May result in substantial gains (but sometimes in losses) O Levy, Y Goldberg, and I Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics (TACL), 3:211-225.

Different evaluations favor different embeddings (Schnabel+ 2015) 40 Task: a human worker chooses the most similar word among the candidates computed by word embeddings GloVe was poor at adverbs for some reason CBOW suffers from larger candidates (50 NN) T Schnabel, I Labutov, D Mimno, T Joachims. Evaluation methods for unsupervised word embeddings. In EMNLP 2015, pp. 298-307.

Different tasks favor different embeddings (Schnabel+ 2015) 41 No almighty word embeddings for all tasks In order to improve the performance on a task, we should fine-tune word embeddings on the target task (Schnabel+ 2015) T Schnabel, I Labutov, D Mimno, T Joachims. Evaluation methods for unsupervised word embeddings. In EMNLP 2015, pp. 298-307.

fastText (Bojanowski+ 2017) 42 SGNS and GloVe are unaware of internal letters in words Extend SGNS to consider letter -grams (subword units) The use of subword units is also effective in machine translation

off ffe fer er> pubs draught show age take Word vector Subword vectors Sum Context vectors The update procedure is the same P Bojanowsk, E Grave, A Joulin, T Mikolov. 2017. Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics (TACL), 5:135-146.

Comparison between SGNS and fastText 43 (Bojanowski+ 2017) fastText (sisg) favors syntactic analogy more than semantic analogy fastText (sisg) outperforms the other except for WS353 in English P Bojanowsk, E Grave, A Joulin, T Mikolov. 2017. Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics (TACL), 5:135-146.

Summary Word embeddings capture syntactic and semantic information to some extent The underlying idea is distributional hypothesis You shall know the word by the company it keeps You shall know the word by predicting its companies No almighty word embeddings for all downstream tasks Next question Can we represent a phrase/sentence with a vector? 45