Slide 1

Slide 1 text

Word Embeddings Naoaki Okazaki School of Computing, Tokyo Institute of Technology okazaki@c.titech.ac.jp PowerPoint template designed by https://ppt.design4u.jp/template/

Slide 2

Slide 2 text

Deep Neural Networks (DNNs) and Natural Language Processing (NLP) 1 very good movie very good movie very good movie very good movie とても よい 映画 Word embeddings (representing a word as a vector) Semantic composition (computing the vector of phrases from constituent words) Encoder-decoder model (generating a sequence of words from the composed vector)  DNN made breakthroughs in speech processing and computer vision  Reduced the error rate of image recognition more than 10% (ILSVRC 2012)  At first, DNN had limited impacts on NLP  Natural languages have symbols that represent semantic information  Recently, DNNs have successfully been applied to various tasks  DNNs achieve the state-of-the-art performance on most NLP tasks  DNNs learn vector representations of text and generate a text (e.g., sequence of words) from the representations

Slide 3

Slide 3 text

Word embedding 2 very good movie ∈ ℝ ∈ ℝ ∈ ℝ  Represents a word with a vector of real numbers  Embeds a word into a neural network  Expresses semantic and syntactic aspects of a word

Slide 4

Slide 4 text

Distributed representation (Hinton+ 1986) 3  Local representation  Assigns a unit (neuron, dimension, symbol) to every concept  Distributed representation  Each concept is represented by multiple units (micro-features)  Each unit commits to multiple concepts … … #249 … … #809 … … #18329

Slide 5

Slide 5 text

Lexical dictionary 4

Slide 6

Slide 6 text

Lexical dictionary 5 http://wordnetweb.princeton.edu/perl/webwn?s=bass

Slide 7

Slide 7 text

Limitation of dictionary: named entities 6 http://wordnetweb.princeton.edu/perl/webwn?s=apple No sense of “Apple” as a company

Slide 8

Slide 8 text

Limitation of dictionary: neologism 7 http://wordnetweb.princeton.edu/perl/webwn?s=tweet No sense of “tweet” as posting a short text

Slide 9

Slide 9 text

Limitation of dictionary: compositionality 8 http://wordnetweb.princeton.edu/perl/webwn?s=apple+tree No “apple tea” nor ”apple production” in the dictionary

Slide 10

Slide 10 text

Distributional Hypothesis and Word-Context Matrix 9

Slide 11

Slide 11 text

Distributional hypothesis (Harris 1954; Firth 1957) 10 … packed with people drinking beer or wine. Many restaurants … into alcoholic drinks such as beer or hard liquor and derive … … in miles per hour, pints of beer, and inches for clothes. M… …ns and for pints for draught beer, cider, and milk sales. The carbonated beverages such as beer and soft drinks in non-ref… …g of a few young people to a beer blast or fancy formal part… …c and alcoholic drinks, like beer and mead, contributed to a… People are depicted drinking beer, listening to music, flirt… … and for the pint of draught beer sold in pubs (see Metricat… beer beer beer beer beer beer beer beer beer … ith people drinking beer or wine. Many restaurants can be f… …gan to drink regularly, host wine parties and consume prepar… principal grapes for the red wines are the grenache, mourved… … four or more glasses of red wine per week had a 50 percent … …e would drink two bottles of wine in an evening. According t… …. Teran is the principal red wine grape in these regions. In… …a beneficial compound in red wine that other types of alcohol … Colorino and even the white wine grapes like Trebbiano and … In Shakesperean theatre, red wine was used in a glass contai… wine wine wines wine wine wine wine wine wine You shall know a word by the company it keeps Z Harris. 1954. Distributional structure. Word, 10(23):146-162. J Firth. 1957. A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32.

Slide 12

Slide 12 text

Word-context matrix 11 beer wine car ride have new drink bottle train book speed read 36 108 578 291 841 14 14 284 94 201 72 92 3 3 0 57 86 2 0 0 3 0 37 72 2 0 1 44 43 1 1 2 3 2 338 Context: words appearing in ±ℎ word offsets to the target word cols rows : Frequency of co-occurrences of the word with context word (for example, “train” co-occurred with “drink” three times) This row vector represents the meaning of the word “beer” Context Word

Slide 13

Slide 13 text

Measure the similarity of two vectors with cos 12 Given two vectors and whose angle is , ⋅ = cos Therefore, cos = ⋅ The value of cos is,  → 0 (same direction): cos → +1  → /2 (orthogonal): cos → 0  → (opposite direction): cos → −1 In this way, cos can measure the similarity of two vectors within the range of −1, +1 θ cosθ v u 0 1

Slide 14

Slide 14 text

Let’s compute cosine similarity 13 beer wine car ride have new drink bottle train book speed read 36 108 578 291 841 14 14 284 94 201 72 92 3 3 0 57 86 2 0 0 3 0 37 72 2 0 1 44 43 1 1 2 3 2 338 Context Word cos = ⋅ Cosine similarity between “beer” and “wine”: cos = 36 × 108 + 14 × 14 + 72 × 92 + 57 × 86 + 3 × 0 + 0 × 1 + 1 × 2 362 + 142 + 722 + 572 + 32 + 02 + 12 1082 + 142 + 922 + 862 + 02 + 12 + 22 = 0.941 Cosine similarity between “beer” and “train”: cos = 36 × 291 + 14 × 94 + 72 × 3 + 57 × 0 + 3 × 72 + 0 × 43 + 1 × 2 362 + 142 + 722 + 572 + 32 + 02 + 12 2912 + 942 + 32 + 02 + 722 + 432 + 22 = 0.387

Slide 15

Slide 15 text

Positive Pointwise Mutual Information (PPMI) (Bullinaria+ 2007) 14 ,𝑗𝑗 = max 0, log (, ) () = max 0, log #(, ) + log #(∗,∗) − log #(∗, ) − log #(,∗) , = #(, )/#(∗,∗), = #(,∗)/#(∗,∗), () = #(∗, )/#(∗,∗) #(,∗) = ∑𝑗𝑗 #(,) ,#(∗, ) = ∑ #(, ) ,#(∗,∗) = ∑,𝑗𝑗 #(, ) Discount frequent words and frequent context words beer wine car ride have new drink bottle train book speed read 0 0 0.09 0.03 0.09 0 0 0.49 0.02 0 2.04 1.78 0 0 0 1.97 1.87 0 0 0 0 0 0.13 1.43 0 0 0 0.55 1.16 0 0 0 0 0 0.85 Context Word cos(beer,wine) = 0.99 > 0.941 cos(beer,train) = 0.00 < 0.387 J Bullinaria and J Levy. 2007. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39:510–526.

Slide 16

Slide 16 text

Latent Semantic Analysis (LSA) (Deerwester, 1990) 15 S Deerwester, S Dumais, G Furnas, T Landauer, R Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407.  Singular Value Decomposition (SVD) on = Σ T  Truncate Σ with singular values = Σ T (-rank approximation) ( is a minimizer of − among rank- matrices)  Use Σ as -dimensional vectors = Σ T Σ T T = Σ Σ T Inner product of Σ is equal to that of ( × ) ( × ) ( × ) ( × ) : unitary matrix Σ: diagonal matrix with singular values T: unitary matrix

Slide 17

Slide 17 text

Low-rank approximation by SVD ( = 3) 16 Truncate with three SVs Uses up to three columns Uses up to three rows (SVD on the original matrix) (3-rank approximation) beer wine car train book Truncated SVD (Halko, 2011) finds top- singular values of the matrix efficiently (for example, sklearn.decomposition.TruncatedSVD) cos(beer,wine) = 0.96 cos(beer,train) = 0.37 N Halko, P G Martinsson, and J A Tropp. 2011. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review, 53(2), 217-288.

Slide 18

Slide 18 text

word2vec 17

Slide 19

Slide 19 text

Skip-gram with Negative Sampling (SGNS) (Mikolov+ 2013) 18 draught offer pubs beer, cider, and wine last use place people make city full know build time group have new game rather age show take take team season say Word vector ∈ ℝ Context vector � ∈ ℝ : Positive : Negative Update rule Corpus Each word vector predicts 2ℎ context words Sample 𝑘 words as negative words from the unigram distribution. Update vectors such that word vectors do not predict the negative words T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119.

Slide 20

Slide 20 text

SGD algorithm for updating vectors 19  Initialization:  ← 0  Word vectors ( ): Initialize with random values of [0,1]  Context vectors ( ): Initialize with zero  Repeat from the head to tail of the training corpus:  ← + 1  Learning rate = 0 1 − +1  For each connected with the target word    = 1 − ⋅ inner product → +∞ ⋅ inner product → −∞ ← + ← +

Slide 21

Slide 21 text

Demo with word vectors 20  English: GoogleNews-vectors-negative300.bin.gz  Trained on Google News dataset (100B words)  https://code.google.com/archive/p/word2vec/  Japanese: (trained by me)  Trained on Japanese Wikipedia articles (400M words)  Use gensim for manipulating them in Python https://github.com/chokkan/deeplearning/blob/master/notebook/word2vec_ja.ipynb https://github.com/chokkan/deeplearning/blob/master/notebook/word2vec_en.ipynb

Slide 22

Slide 22 text

Plotted word vectors 21

Slide 23

Slide 23 text

Measure the similarity of word vectors 22

Slide 24

Slide 24 text

Finding similar words 23

Slide 25

Slide 25 text

Word analogy 24

Slide 26

Slide 26 text

Evaluation on the word analogy task 25 (Mikolov+ 2013) Example of semantic analogy: Athens : Greece = Tokyo : Japan Example of syntactic analogy: cool : cooler = deep : deeper T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119.

Slide 27

Slide 27 text

Word vectors exhibit additive composition 26 Famous example: king − man + woman ≈ queen (Mikolov+ 2013) T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119.

Slide 28

Slide 28 text

The objective function of SGNS 27  The objective function (MLE)  (|) is modeled by softmax  Approximate (|) with logistic regressions = − � ∈ � ∈ log (|) : corpus (sequence of words) : a set of words appearing within the offset ±ℎ from the word Probability to predict ∈ from = exp � ∑ ′∈ exp( � ′) log ≈ log ⋅ � + � Ε ∼ log − ⋅ � Too heavy computation as this requires the sum over exponentials of inner products between the word with all words ′ ∈ Sample a word from the unigram distribution ( times)

Slide 29

Slide 29 text

SGNS is equivalent to Shifted PMI (Levy+ 2014) 28  SGNS models a co-occurrence matrix , = PMI , − log ≈ �  This is similar to training word vectors by building a co- occurrence matrix by using PMI  The previous approach (PMI) could also realize additive composition Shifting PMI to negative O Levy and Y Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS 2014, pp. 2177–2185.

Slide 30

Slide 30 text

Derivation of Shifted PMI (Levy+ 2014) 29 The objective function of SGNS (#(, ): co-occurrence frequency of with ; #(): frequency of ), = − � ∈ � ∈ log � − � Ε ∼ log − � = − � ∈ � ∈ #(, ) log � − � ∈ #() ⋅ � Ε ∼ log − � Compute the expectation explicitly, Ε ∼ log − � = � ∈ #() || log − � = #() || log − � + � ∈∖{} #() || log − � Extract the portion of the objective function related to and (we can ignore ), , = −#(, ) log � − # ⋅ ⋅ #() || log − � Let = � . Compute the gradient of , with respect to by using log () = − = 1 − (), (, ) = −#(, ) − + # #() = # , − 1 + # #() Find the point where the gradient gets zero, 1 + #()#() #(, ) = 1 ⇔ 1 + #()#() #(, ) ⋅ 1 1 + −𝑥𝑥 = 1 ⇔ exp − = #()#() #(, ) Therefore (we assume = #(∗,∗)), = � = log #(, ) #()#() = log #(, ) #()#() − log = PMI , − log O Levy and Y Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS 2014, pp. 2177–2185.

Slide 31

Slide 31 text

GloVe 30

Slide 32

Slide 32 text

GloVe (Pennington+ 2014) 31 = � ,𝑗𝑗=1 (,𝑗𝑗 ) ( � 𝑗𝑗 + + � 𝑗𝑗 − log ,𝑗𝑗 )2 Minimize = (/max ) (if < max ) 1 (otherwise) Co-occurrence frequency between words and Total number of words Vector of word Vector of word Bias for word Bias for word Vector #1 Vector #2 Similarly to SGNS, each word has two vectors assigned. This study uses ( + � ) after training the vectors (this treatment improves the performance) 𝑚𝑚 = 100, α = 0.75 → (by AdaGrad) J Pennington, R Socher, and C Manning. 2014. Glove: Global vectors for word representation. In EMNLP-2014, pp. 1532–1543.

Slide 33

Slide 33 text

Rationale of � 𝑗𝑗 + + � 𝑗𝑗 − log ,𝑗𝑗 (1/4) 32  Consider representing a relation of words and on an aspect by using context word  E.g., Relation between ice and steam on thermodynamics  , /𝑗𝑗, may be more useful than , = (|) to capture the characteristics of words and  E.g., solid and gas is more useful than water and fashion (Pennington+ 2014) J Pennington, R Socher, and C Manning. 2014. Glove: Global vectors for word representation. In EMNLP-2014, pp. 1532–1543.

Slide 34

Slide 34 text

Rationale of � 𝑗𝑗 + + � 𝑗𝑗 − log ,𝑗𝑗 (2/4) 33  Let , 𝑗𝑗 , � the vectors of words , ,  In order to represent , /𝑗𝑗, with word vectors, − 𝑗𝑗 , � = , /𝑗𝑗,  The most simple way to cast the type of the left (vector) into that of the right (scalar), − 𝑗𝑗 � = , /𝑗𝑗, Represent the contrast of the characteristics of words and with vector subtraction We will decide the form of later Different from

Slide 35

Slide 35 text

Rationale of � 𝑗𝑗 + + � 𝑗𝑗 − log ,𝑗𝑗 (3/4) 34  Use exp: ℝ → ℝ+ as exp − 𝑗𝑗 � = exp � exp 𝑗𝑗 � = , 𝑗𝑗,  Therefore, exp � = , = , /  Take the logarithms of the both sides, � = log , − log

Slide 36

Slide 36 text

Rationale of � 𝑗𝑗 + + � 𝑗𝑗 − log ,𝑗𝑗 (4/4) 35  Words and contexts should be interchangeable  Consider ↔ � and ↔ at the same time  Words and contexts are not interchangeable � = log , − log  Because we have no constant for  Represent log as a bias term , and introduce a new bias term � about � = log , − − � � + + � = log ,

Slide 37

Slide 37 text

Rationale of (,𝑗𝑗 ) 36  We cannot compute log ,𝑗𝑗 when ,𝑗𝑗 = 0  Most elements in are 0 (sparse matrix)  We ignore unobserved statistics  We should not respect rare co-occurrences  Hard to reproduce rare co-occurrences with vectors  Force the weight (,𝑗𝑗 /max ) when ,𝑗𝑗 < max  We should not respect frequent ones too much  Treat frequent co-occurrences with the same importance  Clip the weight 1 when ,𝑗𝑗 ≥ max

Slide 38

Slide 38 text

Advanced topics 37

Slide 39

Slide 39 text

Tricks used in implementations (Levy+ 2015) 38 Description Values PPMI SVD SGNS GloVe win Window size (ℎ) ℎ ∈ {2, 5, 10}     dyn Weighted context with(/ℎ), none     *1 sub Subsampling with, none     del Rare word removal with, none     neg Negative samples ∈ {1, 5, 15}  *2  *2  cds Distribution correction α ∈ {1, 0.75}  *3  *3  w+c Vector summation , ( + � )    eig Weighted SVs ∈ {0, 0.5, 1.0}  nrm Normalization *4 both, col, row, none     *1: The same weighting method implemented in word2vec *2: These are set by shifted PPMI *3: These are implemented by modifying the denominator of PMIs *4: Normalization for each word vector was the best Preprocessing Association measure Postprocessing O Levy, Y Goldberg, and I Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics (TACL), 3:211-225.

Slide 40

Slide 40 text

Tips for training word embeddings (Levy+ 2015) 39  Use context distribution smoothing (cds=0.75)  Use SVD with symmetric variants (eig=0 or 0.5)  No effect with neg > 1 in Shifted PPMI  SGNS is a robust baseline  It does not underperform in any scenario  It trains word embeddings the fastest with the cheapest memory consumption  Larger negative samples are better in SGNS  Worth trying w+c in SGNS and GloVe  May result in substantial gains (but sometimes in losses) O Levy, Y Goldberg, and I Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics (TACL), 3:211-225.

Slide 41

Slide 41 text

Different evaluations favor different embeddings (Schnabel+ 2015) 40 Task: a human worker chooses the most similar word among the candidates computed by word embeddings GloVe was poor at adverbs for some reason CBOW suffers from larger candidates (50 NN) T Schnabel, I Labutov, D Mimno, T Joachims. Evaluation methods for unsupervised word embeddings. In EMNLP 2015, pp. 298-307.

Slide 42

Slide 42 text

Different tasks favor different embeddings (Schnabel+ 2015) 41  No almighty word embeddings for all tasks  In order to improve the performance on a task, we should fine-tune word embeddings on the target task (Schnabel+ 2015) T Schnabel, I Labutov, D Mimno, T Joachims. Evaluation methods for unsupervised word embeddings. In EMNLP 2015, pp. 298-307.

Slide 43

Slide 43 text

fastText (Bojanowski+ 2017) 42  SGNS and GloVe are unaware of internal letters in words  Extend SGNS to consider letter -grams (subword units)  The use of subword units is also effective in machine translation pubs draught show age take Word vector Subword vectors Sum Context vectors The update procedure is the same P Bojanowsk, E Grave, A Joulin, T Mikolov. 2017. Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics (TACL), 5:135-146.

Slide 44

Slide 44 text

Comparison between SGNS and fastText 43 (Bojanowski+ 2017) fastText (sisg) favors syntactic analogy more than semantic analogy fastText (sisg) outperforms the other except for WS353 in English P Bojanowsk, E Grave, A Joulin, T Mikolov. 2017. Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics (TACL), 5:135-146.

Slide 45

Slide 45 text

Demo with fastText 44 ※Trumponomics は造語

Slide 46

Slide 46 text

Summary  Word embeddings capture syntactic and semantic information to some extent  The underlying idea is distributional hypothesis  You shall know the word by the company it keeps  You shall know the word by predicting its companies  No almighty word embeddings for all downstream tasks  Next question Can we represent a phrase/sentence with a vector? 45