Encoder Decoder Models - Speaker Deck

Slide 1

Slide 1 text

Encoder Decoder Models Naoaki Okazaki School of Computing, Tokyo Institute of Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/

Slide 2

Slide 2 text

Main task: Machine Translation (MT) 1  Translate a text in a language into another  Basic idea  How do Computers Learn a New Language? An Introduction to Statistical Machine Translation  https://www.youtube.com/watch?v=_ghMKb6iDMM (6:29) こんにちは Hello 您好 Hola

Slide 3

Slide 3 text

Statistical Machine Translation (SMT) 2 私は動物園に行った．彼らは東京に行った． I went to the zoo. They went to Tokyo. I went to Tokyo. 私は東京に行った． 𝑃𝑃(𝑦𝑦|𝑥𝑥) 𝑃𝑃 I 私は = 0.8, 𝑃𝑃 they 彼らは = 0.8 𝑃𝑃 went 行った = 0.9, 𝑃𝑃 to に = 0.9 𝑃𝑃 the zoo 動物園 = 0.8, 𝑃𝑃 Tokyo 東京 = 0.8 Supervision data (parallel corpus) Translation model: Japanese to English I went to Tokyo to meet my friend last Sunday. It was the first time since …… We went to the zoo near Ueno for …… 𝑃𝑃 the of = .012243, 𝑃𝑃 the in = .007208, 𝑃𝑃 the to = .005042, … … , 𝑃𝑃 was it = .000522, 𝑃𝑃 to went = .000080, Supervision data (monolingual corpus) Language model: Naturalness in English Input: 𝑥𝑥 Output: 𝑦𝑦 Building probabilistic models

Slide 4

Slide 4 text

DNNs applied to MT 3 Replace the probabilistic models with DNNs I went to Tokyo. 私は東京に行った． Input: 𝑥𝑥 Output: 𝑦𝑦 𝑃𝑃(𝑦𝑦|𝑥𝑥): Deep Neural Networks (DNNs) It’s not that simple as introducing DNN architectures that were successful in other research fields (e.g., computer vision)

Slide 5

Slide 5 text

Connection to the previous lecture 4  Embeddings for phrases and sentences seem to be useful for solving tasks  Is it possible to generate a sentence (sequence of words) from embeddings?  Yes, encoder-decoder models can do that! very good movie とても良い映画 E n c D e c

Slide 6

Slide 6 text

Language modeling 5

Slide 7

Slide 7 text

Demo: Text-generation with GPT-2 6 https://github.com/graykode/gpt-2-Pytorch Text generated by giving the first paragraph of the Wikipedia article of “Harry Potter” https://en.wikipedia.org/wiki/Harry_Potter

Slide 8

Slide 8 text

Language model (LM) 7  For a given word sequence 𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 , LMs compute the joint probability 𝑃𝑃(𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 )  Example: which word fills the blank in “I have a .”? argmax 𝑤𝑤∈𝑉𝑉 𝑃𝑃(I, have, a, 𝑤𝑤)  Used to assess the naturalness of a sentence (sequence of words) generated by machine translation, speech recognition, etc pen dog PC …… what Set of all words in the vocabulary

Slide 9

Slide 9 text

Probabilistic language models 8 𝑃𝑃 𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 = � 𝑖𝑖=1 𝑚𝑚 𝑃𝑃(𝑤𝑤𝑖𝑖 |𝑤𝑤1 , … , 𝑤𝑤𝑖𝑖−1 ) ☹ Data sparseness problem: Insufficient statistics to estimate the probability with a longer sequence of words Predict the next word 𝑤𝑤𝑖𝑖 after the word sequence 𝑤𝑤1 , … , 𝑤𝑤𝑖𝑖−1 #(𝑤𝑤1 , … , 𝑤𝑤𝑖𝑖−1 , 𝑤𝑤𝑖𝑖 ) #(𝑤𝑤1 , … , 𝑤𝑤𝑖𝑖−1 ) || 𝑃𝑃 This, is, a, pen = 𝑃𝑃 This BOS 𝑃𝑃 is this 𝑃𝑃 a This is 𝑃𝑃 pen This is a 𝑃𝑃(EOS|This is a pen)

Slide 10

Slide 10 text

𝑛𝑛-gram probabilistic language modeling 9 𝑃𝑃 𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 ≈ � 𝑖𝑖=1 𝑚𝑚 𝑃𝑃(𝑤𝑤𝑖𝑖 |𝑤𝑤𝑖𝑖−𝑛𝑛+1 , … , 𝑤𝑤𝑖𝑖−1 ) Remedy the data sparseness problem by compromising with a shorter context Predict the next word 𝑤𝑤𝑖𝑖 after a word sequence 𝑤𝑤𝑖𝑖−𝑛𝑛+1 , … , 𝑤𝑤𝑖𝑖−1 of length 𝑛𝑛 − 1 #(𝑤𝑤𝑖𝑖−𝑛𝑛+1 , … , 𝑤𝑤𝑖𝑖−1 , 𝑤𝑤𝑖𝑖 ) #(𝑤𝑤𝑖𝑖−𝑛𝑛+1 , … , 𝑤𝑤𝑖𝑖−1 ) || We have more counts! 𝑃𝑃 This, is, a, pen = 𝑃𝑃 This BOS 𝑃𝑃 is this 𝑃𝑃 a is 𝑃𝑃 pen a 𝑃𝑃(EOS|pen) (example with 2-gram)

Slide 11

Slide 11 text

Sentence generation with LM 10  Find the word sequence 𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 argmax 𝑤𝑤1,…,𝑤𝑤𝑚𝑚∈𝑉𝑉 𝑃𝑃(𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 )  However, we cannot specify a desired output  Generate a sentence 𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 conditioned on an input 𝑥𝑥1 , … , 𝑥𝑥𝑛𝑛 argmax 𝑤𝑤1,…,𝑤𝑤𝑚𝑚∈𝑉𝑉 𝑃𝑃 𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 𝑄𝑄(𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 |𝑥𝑥1 , … , 𝑥𝑥𝑛𝑛 )  𝑄𝑄 is translation model in machine translation  𝑄𝑄: Whether the output is a correct translation of the input  𝑃𝑃: Is the generated sentence natural as the language?

Slide 12

Slide 12 text

Sentence generation as a search problem 11  Sentence generation has 𝑂𝑂( 𝑉𝑉 𝑚𝑚) time complexity argmax 𝑤𝑤1,…,𝑤𝑤𝑚𝑚∈𝑉𝑉 𝑃𝑃 𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 𝑄𝑄(𝑤𝑤1 , … , 𝑤𝑤𝑚𝑚 |𝑥𝑥1 , … , 𝑥𝑥𝑛𝑛 )  Unrealistic to enumerate all possible candidates  Usually, 𝑉𝑉 > 10,000 and 𝑚𝑚 is 20~100: 10,00020 = 1080  Search a word after words (i.e., greedy / beam search) BOS a b … … I … have a … … … … … a … … … … … a … pen … … 𝑃𝑃 𝑤𝑤1 𝑄𝑄(𝑤𝑤1 |𝑋𝑋) 𝑃𝑃 𝑤𝑤1 , 𝑤𝑤2 𝑄𝑄(𝑤𝑤1 , 𝑤𝑤2 |𝑋𝑋) 𝑃𝑃 𝑤𝑤1 , 𝑤𝑤2 , 𝑤𝑤3 𝑄𝑄(𝑤𝑤1 , 𝑤𝑤2, , 𝑤𝑤3 |𝑋𝑋) 𝑃𝑃 𝑤𝑤1 , 𝑤𝑤2 , 𝑤𝑤3 , 𝑤𝑤4 𝑄𝑄(𝑤𝑤1 , 𝑤𝑤2, , 𝑤𝑤3 , 𝑤𝑤4 |𝑋𝑋)

Slide 13

Slide 13 text

Issues in LM (before the DNN era) 12  Data sparseness  Rare words suffer from the insufficiency of statistics  The insufficiency gets worse when using 𝑛𝑛-grams (word combinations)  Addressed by smoothing methods (e.g., Good-Turing, Kneser-Ney)  Surface variations  Surface variations with the same meaning have different probabilities  For example, 𝑃𝑃(girl|clever) and 𝑃𝑃(girl|smart) are independent even if ‘clever’ and ‘smart’ have the similar meaning  Addressed by ‘class’ models that merge similar words into a group  Long-distance dependency  𝑛𝑛-gram models cannot consider dependencies longer than 𝑛𝑛 words  Neural LMs address these issues using distributed representations (word embeddings and their compositions)

Slide 14

Slide 14 text

Recurrent Neural Network Language Model (Mikolov+ 2010) 13 BOS I 𝑤𝑤0 have a pen softmax softmax softmax softmax softmax 𝑤𝑤1 𝑤𝑤2 𝑤𝑤3 𝑤𝑤4 𝑝𝑝1 (I) 𝑝𝑝2 (have) 𝑝𝑝3 (a) 𝑝𝑝4 (pen) 𝑝𝑝5 (EOS) 𝑃𝑃(𝑤𝑤1 , … 𝑤𝑤𝑚𝑚 ) = × × × × The number of dimensions of the output layer is |𝑉𝑉|, where 𝑉𝑉 is the set of possible words. Each element presents the probability of generating the corresponding word Probability of a sequence of words is a product of token prediction probabilities T Mikolov, M Karafiát, L Burget, J Černocký, S Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH, pp. 1045-1048.

Slide 15

Slide 15 text

Encoder 14

Slide 16

Slide 16 text

Recurrent Neural Networks (RNNs) (Sutskever+ 2011) 15 I Sutskever, J Martens, G Hinton. 2011. Generating text with recurrent neural networks. In ICML, pp. 1017–1024. John loves 𝑊𝑊ℎ𝑥𝑥 𝒉𝒉4 𝑊𝑊𝑦𝑦𝑦 𝑊𝑊ℎℎ Mary 𝑊𝑊ℎℎ much 𝑊𝑊ℎℎ softmax Word embeddings Represent a word with a vector 𝒙𝒙𝑡𝑡 ∈ ℝ𝑑𝑑 𝑊𝑊ℎ𝑥𝑥 𝑊𝑊ℎ𝑥𝑥 𝑊𝑊ℎ𝑥𝑥 𝒉𝒉1 𝒉𝒉2 𝒉𝒉3 𝒙𝒙1 𝒙𝒙2 𝒙𝒙3 𝒙𝒙4 Recurrent computation Compose a hidden vector 𝒉𝒉𝑡𝑡 from an input word 𝒙𝒙𝑡𝑡 and the hidden vector 𝒉𝒉𝑡𝑡−1 at the previous timestep 𝒉𝒉𝑡𝑡 = 𝑓𝑓(𝑊𝑊ℎ𝑥𝑥 𝒙𝒙𝑡𝑡 + 𝑊𝑊ℎℎ𝒉𝒉𝑡𝑡−1) Fully-connected layer for a task Make a prediction from the hidden vector 𝒉𝒉4 , which are composed from all words in the sentence, by using a fully-connected layer and softmax 𝒚𝒚 𝒉𝒉0 = 0 𝑊𝑊ℎℎ ☺ The parameters 𝑊𝑊ℎ𝑥𝑥 , 𝑊𝑊ℎℎ , 𝑊𝑊𝑦𝑦𝑦 are shared over the entire sequence They are trained by the supervision signal 𝒙𝒙1 , … , 𝒙𝒙4 , 𝒚𝒚 using backpropagation

Slide 17

Slide 17 text

Convolutional Neural Network (CNN) (Kim 2014) 16 Y Kim. 2014. Convolutional neural networks for sentence classification. In EMNLP, pp. 1746-1751. It is a very good movie indeed 𝑥𝑥𝑡𝑡 𝑥𝑥𝑡𝑡:𝑡𝑡+𝛿𝛿 ・・・・・・ 𝑊𝑊 𝑝𝑝𝑡𝑡 𝑐𝑐𝑖𝑖 = max 1<𝑡𝑡<𝑇𝑇−𝛿𝛿+1 𝑝𝑝𝑡𝑡,𝑖𝑖 Max pooling: each dimension 𝑐𝑐𝑖𝑖 is the maximum number of the values 𝑝𝑝𝑡𝑡,𝑖𝑖 over timesteps softmax 𝑦𝑦 𝑊𝑊(𝑦𝑦𝑦𝑦) ☺

Slide 18

Slide 18 text

Encoding 17  These models can be decomposed into  Encoding (variable-length input to feature vector)  𝑧𝑧 = 𝜙𝜙(𝑥𝑥1 , … , 𝑥𝑥𝑚𝑚 ) (𝜙𝜙 is a part of the NN)  Solving the task (e.g., classify the text using the feature vector)  𝑦𝑦 = 𝜓𝜓(𝑧𝑧) (𝜓𝜓 is also a part of the NN) I have 𝑊𝑊ℎ𝑥𝑥 𝒉𝒉4 𝑊𝑊𝑦𝑦ℎ 𝑊𝑊ℎℎ a 𝑊𝑊ℎℎ pen 𝑊𝑊ℎℎ softmax 𝑊𝑊ℎ𝑥𝑥 𝑊𝑊ℎ𝑥𝑥 𝑊𝑊ℎ𝑥𝑥 𝒉𝒉1 𝒉𝒉2 𝒉𝒉3 𝒙𝒙1 𝒙𝒙2 𝒙𝒙3 𝒙𝒙4 𝒚𝒚 𝑊𝑊ℎℎ ☺ It is a very good movie indeed 𝑥𝑥𝑡𝑡 𝑥𝑥𝑡𝑡:𝑡𝑡+𝛿𝛿 ・・・・・・ 𝑊𝑊 𝑝𝑝𝑡𝑡 𝑐𝑐𝑖𝑖 = max 1<𝑡𝑡<𝑇𝑇−𝛿𝛿+1 𝑝𝑝𝑡𝑡,𝑖𝑖 Max pooling: each dimension 𝑐𝑐𝑖𝑖 is the maximum number of the values 𝑝𝑝𝑡𝑡,𝑖𝑖 over timesteps softmax 𝑦𝑦 𝑊𝑊(𝑦𝑦𝑦𝑦) ☺

Slide 19

Slide 19 text

Encoder decoder models 18

Slide 20

Slide 20 text

Using RNNLM for generating sentences 19 Predict a sequence of words for an given input , in addition to score the naturalness of the generated sentence BOS I have a pen I have a pen EOS Input

Slide 21

Slide 21 text

Encoder decoder model (EncDec) (Sutskever+ 2014; Cho+ 2014) 20 I Sutskever, O Vinyals, Q V Le. 2014. Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. K Cho, van B Merrienboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. 2014. Learning phrase representations using RNN encoder– decoder for statistical machine translation. In EMNLP, pp. 1724–1734. I have a ペンを持つ pen BOS ペンを持つ EOS Encoder Decoder ※ This illustration omits the matrices of RNNs Representation of the input  Encode an input sentence into a feature vector, and generate a sentence by decoding (predicting) a word sequence from the feature vector  Also known as sequence-to-sequence model  Machine translation is realized by a single NN!  Machine translation had been a mix of various theories and methods before the neural machine translation (NMT)

Slide 22

Slide 22 text

Caption generation (Vinyals+ 2015) 21 O Vinyals, A Toshev, S Bengio, D Erhan. 2015. Show and tell: A neural image caption generator. In CVPR.

Slide 23

Slide 23 text

Chatbot (Vinyals+ 2015) 22  Supervision data: OpenSubtitles  Scripts extracted from movie subtitles (6G sentences)  A chat example from the EncDec model O Vinyals, Q V Le. 2015. A neural conversational model, In ICML Deep Learning Workshop.

Slide 24

Slide 24 text

Summary  Encoder-decoder architecture  An encoder converts an input sentence into a feature vector  A decoder generates a sentence based on the vector  We can train an encoder-decoder model in end-to-end fashion  (An autoregressive) decoder predicts a token sequence by feeding predicted tokens into the input layer  We can connect different modalities (e.g., language and vision) in a single NN as long as they are represented as vectors 23

Slide 25

Slide 25 text

Attention mechanism 24

Slide 26

Slide 26 text

Weakness of EncDec 25  EncDec represents an input of a variable-length with a fixed-size vector  EncDec has no flexibility about the amount of the information of an input  EncDec suffers from handling longer sentences I have a ペンを持つ pen BOS ペンを持つ EOS

Slide 27

Slide 27 text

The idea of attention mechanism 26 This is a pen BOS + これはペン BOS これはペン EOS At each timestep in the decoder, predict a word using the weighted sum of all hidden vectors in the input Attention mechanism determines the weights automatically from the decoder state The decoder now has an access to all hidden vectors in the input 𝑎𝑎(1) 𝑎𝑎(2) 𝑎𝑎(3) 𝑎𝑎(4) 𝑎𝑎(5)

Slide 28

Slide 28 text

Attention mechanism (Bahdanau+ 2015, Luong+ 2015) 27 is (𝑠𝑠 = 2) a (𝑠𝑠 = 3) pen (𝑠𝑠 = 4) これは BOS (𝑡𝑡 = 1) これ (𝑡𝑡 = 2) 𝒉𝒉𝑡𝑡 𝑎𝑎𝑡𝑡 (𝑠𝑠) � 𝒉𝒉𝑡𝑡 = tanh(𝑊𝑊 𝑐𝑐 [𝒄𝒄𝑡𝑡 ; 𝒉𝒉𝑡𝑡 ]) 𝑎𝑎𝑡𝑡 (𝑠𝑠) = exp score(𝒉𝒉𝑡𝑡 , 𝒉𝒉𝑠𝑠 ) ∑𝑠𝑠′ exp score(𝒉𝒉𝑡𝑡 , 𝒉𝒉𝑠𝑠𝑠 ) 𝒄𝒄𝑡𝑡 = � 𝑠𝑠 𝑎𝑎𝑡𝑡 (𝑠𝑠)𝒉𝒉𝑠𝑠 𝒚𝒚𝑡𝑡 = softmax(𝑊𝑊 𝑦𝑦 � 𝒉𝒉𝑡𝑡 ) This (𝑠𝑠 = 1) 𝒙𝒙𝑠𝑠 𝒉𝒉𝑠𝑠 score 𝒉𝒉𝑡𝑡 , 𝒉𝒉𝑠𝑠 = 𝒉𝒉𝑡𝑡 ⋅ 𝒉𝒉𝑠𝑠  Different variables of time steps used for the encoder (𝑠𝑠) and decoder (𝑡𝑡)  Computation flow (Luong+ 2015): 𝒚𝒚𝑡𝑡−1 → 𝒉𝒉𝑡𝑡 → 𝑎𝑎𝑡𝑡 𝑠𝑠 → 𝒄𝒄𝑡𝑡 → � 𝒉𝒉𝑡𝑡 → 𝒚𝒚𝑡𝑡 → 𝒉𝒉𝑡𝑡+1 D Bahdanau, K Cho, Y Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR. M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412-1421.  score ℎ𝑡𝑡 , ℎ𝑠𝑠 : how much the decoder at time step 𝑡𝑡 needs information from the time step 𝑠𝑠 in the encoder

Slide 29

Slide 29 text

Computing attention scores 28  Attention 𝑎𝑎𝑡𝑡 (𝑠𝑠) = exp score(𝒉𝒉𝑡𝑡 , 𝒉𝒉𝑠𝑠 ) ∑𝑠𝑠′ exp score(𝒉𝒉𝑡𝑡 , 𝒉𝒉𝑠𝑠𝑠 )  Scores are normalized into a probability distribution  Various approaches for computing attention score score 𝒉𝒉𝑡𝑡 , 𝒉𝒉𝑠𝑠 = 𝒉𝒉𝑡𝑡 ⋅ 𝒉𝒉𝑠𝑠 (dot) 𝒉𝒉𝑡𝑡 𝑊𝑊 𝑎𝑎 𝒉𝒉𝑠𝑠 (product) 𝒗𝒗𝑎𝑎 ⋅ tanh 𝑊𝑊 𝑎𝑎 𝒉𝒉𝑡𝑡 ; 𝒉𝒉𝑠𝑠 (concat)  𝒗𝒗𝑎𝑎 and 𝑊𝑊 𝑎𝑎 are parameters (trained by backpropagation)

Slide 30

Slide 30 text

Attention has an advantage on longer sentences 29 (Luong+ 2015) local-p: Attention mechanism that predicts the focal range of the input sequence based on the hidden state of the decoder M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412-1421.

Slide 31

Slide 31 text

Attention roughly represents alignments 30 Global attention Local monotonic focus Gold alignment Local predictive focus (Luong+ 2015) M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412-1421.

Slide 32

Slide 32 text

Show, attend and tell (Xu+ 2015) 31 (Xu+ 2015) K Xu, J Ba, R Kiros, K Cho, A Courville, R Salakhutdinov, R Zemel, Y Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, pp. 2048-2057.

Slide 33

Slide 33 text

Convolutional Neural Network for encoder-decoder models 32

Slide 34

Slide 34 text

RNN/LSTM and CNN 33  It is hard to parallelize RNN/LSTM for time steps  It is easy to parallelize CNN for time steps ☺ RNN may capture distant dependencies of tokens ☹ Need to traverse the full distance of the path of words ☹ Hard to parallelize ☺ We can compute convolutions in parallel ☹ CNN may not capture dependencies beyond the window

Slide 35

Slide 35 text

ByteNet (Kalchbrenner+ 16) 34 ☺ Requires log 𝑛𝑛 traverses for handling 𝑛𝑛 distant dependencies (Kalchbrenner+ 2016) N Kalchbrenner, L Espeholt, K Simonyan, A van den Oord, A Graves, K Kavukcuoglu. 2016. Neural Machine Translation in Linear Time. arXiv:1610.10099.

Slide 36

Slide 36 text

Convolutional Sequence to Sequence (ConvS2S) (Gehring+ 17) 35 これはペンです _ EOS _ Encoder _ BOS This is _ a pen Decoder EOS A rotation animation represents a composition of a hidden state of the decoder by attending the ones in the encoder Predict a word → Compose the decoder vector → Predict a next word In order to realize this, we put dummy tokens _ Encoder decoder model only with CNN J Gehring, M Auli, D Grangier, D Yarats, Y N Dauphin. 2017. Convolutional sequence to sequence learning. In ICML. pp. 1243-1252.

Slide 37

Slide 37 text

Vector composition in ConvS2S 36 これはこれ <1> は <2> ＋＋ Position embedding: 𝑒𝑒𝑡𝑡 = 𝑤𝑤𝑡𝑡 + 𝑝𝑝𝑡𝑡 𝑒𝑒𝑡𝑡 𝑤𝑤𝑡𝑡 𝑝𝑝𝑡𝑡 𝑤𝑤𝑡𝑡+1 𝑝𝑝𝑡𝑡+1 Gated Linear Unit (GLU): ℎ𝑡𝑡 ′ = (𝐸𝐸𝐸𝐸 + 𝑏𝑏𝑐𝑐 ) ⊗ 𝜎𝜎 𝐸𝐸𝐸𝐸 + 𝑏𝑏𝑔𝑔 × ＋ Residual connection: ℎ𝑡𝑡 = ℎ𝑡𝑡 ′ + 𝑤𝑤𝑡𝑡 Encoder and decoder use the same architecture Their experiments use 20-layer CNNs with the window length of 3 𝑒𝑒𝑡𝑡+1 ℎ𝑡𝑡 ′ ℎ𝑡𝑡

Slide 38

Slide 38 text

Transformer 37

Slide 39

Slide 39 text

Transformer: “Attention is all you need” (Vaswani+ 2017) 38 A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008. https://research.googleblog.com/2017/08/transformer-novel-neural-network.html

Slide 40

Slide 40 text

The architecture of Transformer 39 (Vaswani+ 2017) A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.

Slide 41

Slide 41 text

The architecture of Transformer 40 loves Mary John ジョンは BOS ジョンはメアリー 2. Positional encoding 1. Multi-head attention 3. Residual + Layer-norm 4. Feedforward 1. Multi-head attention (cross attention) 2. Positional encoding 1. Multi-head attention 3. Residual + Layer-norm 3. Residual + Layer-norm 3. Residual + Layer-norm 4. Feedforward 3. Residual + Layer-norm

Slide 42

Slide 42 text

QKV attention mechanism 41  Attention mechanism with query (Q), key (K), and value (V)  Query an associative array of key-value store  Queries, keys, and values are represented by vectors  Yields a weighted sum of values instead of returning a single value  The weights are computed by the relatedness between a query and keys  A query 𝒒𝒒 attends keys and obtains � 𝒒𝒒 as a weighted sum of values 𝑲𝑲, 𝑽𝑽 = 𝒌𝒌1 , … , 𝒌𝒌𝐼𝐼 , 𝒗𝒗1 , … , 𝒗𝒗𝐼𝐼 (𝒌𝒌𝑖𝑖 , 𝒗𝒗𝑖𝑖 ∈ ℝ𝑑𝑑) 𝒒𝒒 ∈ ℝ𝑑𝑑 � 𝒒𝒒 ∈ ℝ𝑑𝑑 � 𝒒𝒒 = 𝑽𝑽softmax 𝑐𝑐𝑲𝑲⊤𝒒𝒒 , 𝑐𝑐 = 1/ 𝑑𝑑 Relatedness by 𝒌𝒌𝑖𝑖 ⊤𝒒𝒒 Weighted sum of 𝒗𝒗1 , … , 𝒗𝒗𝐼𝐼 𝒌𝒌1 𝒗𝒗1 𝒌𝒌𝐼𝐼 𝒗𝒗𝐼𝐼 1. Multi-head attention

Slide 43

Slide 43 text

QKV generalizes the conventional attention mechanism 42 � 𝒛𝒛𝑗𝑗 = tanh 𝑾𝑾 ̂ 𝑧𝑧ℎ[𝒛𝒛𝑗𝑗 ; � 𝒉𝒉𝑗𝑗 ] , � 𝒉𝒉𝑗𝑗 = 𝑯𝑯𝒂𝒂𝑗𝑗 , 𝒂𝒂𝑗𝑗 = softmax 𝒂𝒂𝑗𝑗 ′ , 𝒂𝒂𝑗𝑗 ′ = 𝑯𝑯⊤𝒛𝒛𝑗𝑗 𝑯𝑯 = 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 ∈ ℝ𝑑𝑑ℎ×𝐼𝐼 � 𝒒𝒒 = 𝑽𝑽𝑽𝑽, 𝒂𝒂 = softmax 𝒂𝒂′ , 𝒂𝒂′ = 𝑐𝑐𝑲𝑲⊤𝒒𝒒 𝒒𝒒, � 𝒒𝒒 ∈ ℝ𝑑𝑑 𝑲𝑲 ∈ ℝ𝑑𝑑×𝐼𝐼 𝑽𝑽 ∈ ℝ𝑑𝑑×𝐼𝐼 (Compose � 𝒛𝒛𝑗𝑗 with 𝒛𝒛𝑗𝑗 and � 𝒉𝒉𝑗𝑗 ) (A sum of 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 weighted by 𝒂𝒂) (Normalization 𝒂𝒂𝒂 → 𝒂𝒂) (Weights 𝒂𝒂𝑗𝑗 ′ ∈ ℝ𝐼𝐼 are dot products of 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 and 𝒛𝒛𝑗𝑗 ) (Attend 𝐼𝐼 vectors 𝑯𝑯 = (𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 )) Conventional: Construct a decoder vector � 𝒛𝒛𝑗𝑗 from 𝒛𝒛𝑗𝑗 by attending the encoder 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 QKV: Compute weights 𝒂𝒂 = 𝑲𝑲⊤𝒒𝒒 and construct � 𝒒𝒒 as a weighted sum of 𝑽𝑽 (A sum of 𝑽𝑽 = (𝒗𝒗1 , … , 𝒗𝒗𝐼𝐼 ) weighted by 𝒂𝒂) (Normalization 𝒂𝒂′ → 𝒂𝒂) (Weights 𝒂𝒂𝒂 ∈ ℝ𝐼𝐼 are dot products of 𝑲𝑲 = (𝒌𝒌1 , … , 𝒌𝒌𝐼𝐼 ) and 𝒒𝒒) (𝑐𝑐 = 1/ 𝑑𝑑 compensates a larger dot product when 𝑑𝑑 is larger) (Query vector) (Keys) (Values) 1. Multi-head attention

Slide 44

Slide 44 text

Computing QKV attention 43 𝒒𝒒 = 1 −2 −1 2 𝑽𝑽𝑽𝑽 = 0.67 × 1 0 0 1 + 0.24 × 0 1 0 1 + 0.09 × 0 0 1 1 = 0.67 0.24 0.09 1.00 ⟶ � 𝒒𝒒 𝑲𝑲 = 1 3 −1 1 1 0 −1 3 1 1 1 0 𝑽𝑽 = 1 0 0 0 1 0 0 0 1 1 1 1 1 4 1 1 −1 1 1 −2 −1 2 = 1 1 4 3 1 3 1 1 −2 −1 2 = 0 1 4 −1 0 1 0 1 −2 −1 2 = −1 softmax 1 0 −1 = 0.67 0.24 0.09 𝒂𝒂′ 𝒂𝒂 𝒂𝒂′ = 𝑐𝑐𝑲𝑲⊤𝒒𝒒 1. Multi-head attention

Slide 45

Slide 45 text

Cross (source-target) attention with QKV 44 ジョン 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 × 1/ 𝑑𝑑𝑘𝑘 softmax + × � 𝑸𝑸 = 𝑽𝑽𝑽𝑽 = 𝑽𝑽softmax(𝑐𝑐𝑲𝑲⊤𝑸𝑸) 𝑸𝑸 = 𝑾𝑾𝑄𝑄 𝒁𝒁, 𝑲𝑲 = 𝑾𝑾𝐾𝐾 𝑯𝑯, 𝑽𝑽 = 𝑾𝑾𝑉𝑉 𝑯𝑯 𝑯𝑯 = 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 ∈ ℝ𝑑𝑑×𝐼𝐼, 𝒁𝒁 = 𝒛𝒛1 , … , 𝒛𝒛𝐽𝐽 ∈ ℝ𝑑𝑑×𝐽𝐽, 𝑸𝑸 = 𝒒𝒒1 , … , 𝒒𝒒𝐽𝐽 ∈ ℝ𝑑𝑑×𝐽𝐽, � 𝑸𝑸 = � 𝒒𝒒1 , … , � 𝒒𝒒𝐽𝐽 ∈ ℝ𝑑𝑑×𝐽𝐽, 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 ∈ ℝ𝑑𝑑×𝑑𝑑 John は 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 loves 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 Mary メアリーを 𝑾𝑾𝑄𝑄 × 1/ 𝑑𝑑𝑘𝑘 softmax 𝑾𝑾𝑄𝑄 × 1/ 𝑑𝑑𝑘𝑘 softmax 𝑾𝑾𝑄𝑄 × 1/ 𝑑𝑑𝑘𝑘 softmax + × + × + × 𝑯𝑯 𝒁𝒁 � 𝑸𝑸 Compute a weighted sum of encoder vectors 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 based on the decoder vector 𝒒𝒒𝑗𝑗 1. Multi-head attention

Slide 46

Slide 46 text

Self attention with QKV (encoder) 45 loves Mary John 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 1/ 𝑑𝑑𝑘𝑘 1/ 𝑑𝑑𝑘𝑘 1/ 𝑑𝑑𝑘𝑘 softmax softmax softmax + + + × × × Compute a weighted sum of encoder vectors 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 based on word pairs in the encoder � 𝑸𝑸 = 𝑽𝑽𝑽𝑽 = 𝑽𝑽softmax 𝑐𝑐𝑲𝑲⊤𝑸𝑸 , 𝑸𝑸 = 𝑾𝑾𝑄𝑄 𝑯𝑯, 𝑲𝑲 = 𝑾𝑾𝐾𝐾 𝑯𝑯, 𝑽𝑽 = 𝑾𝑾𝑉𝑉 𝑯𝑯 𝑯𝑯 = 𝒉𝒉1 , … , 𝒉𝒉𝐼𝐼 ∈ ℝ𝑑𝑑×𝐼𝐼, 𝑸𝑸 = 𝒒𝒒1 , … , 𝒒𝒒𝐼𝐼 ∈ ℝ𝑑𝑑×𝐼𝐼, � 𝑸𝑸 = � 𝒒𝒒1 , … , � 𝒒𝒒𝐼𝐼 ∈ ℝ𝑑𝑑×𝐼𝐼, 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 ∈ ℝ𝑑𝑑×𝑑𝑑 𝑯𝑯 � 𝑸𝑸 1. Multi-head attention

Slide 47

Slide 47 text

Self attention with QKV (decoder) 46 はメアリージョン 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 𝑾𝑾𝑉𝑉 𝑾𝑾𝐾𝐾 𝑾𝑾𝑄𝑄 1/ 𝑑𝑑𝑘𝑘 1/ 𝑑𝑑𝑘𝑘 1/ 𝑑𝑑𝑘𝑘 softmax softmax softmax + + + × × × Compute a weighted sum of decoder vectors 𝒛𝒛1 , … , 𝒛𝒛𝐽𝐽 based on word pairs in the decoder � 𝑸𝑸 = 𝑽𝑽𝑽𝑽 = 𝑽𝑽softmax 𝑐𝑐𝑲𝑲⊤𝑸𝑸 , 𝑸𝑸 = 𝑾𝑾𝑄𝑄 𝒁𝒁, 𝑲𝑲 = 𝑾𝑾𝐾𝐾 𝒁𝒁, 𝑽𝑽 = 𝑾𝑾𝑉𝑉 𝒁𝒁 𝒁𝒁 = 𝒛𝒛1 , … , 𝒛𝒛𝐽𝐽 ∈ ℝ𝑑𝑑×𝐽𝐽, 𝑸𝑸 = 𝒒𝒒1 , … , 𝒒𝒒𝐽𝐽 ∈ ℝ𝑑𝑑×𝐽𝐽, � 𝑸𝑸 = � 𝒒𝒒1 , … , � 𝒒𝒒𝐽𝐽 ∈ ℝ𝑑𝑑×𝐽𝐽, 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 ∈ ℝ𝑑𝑑×𝑑𝑑 𝒁𝒁 � 𝑸𝑸 1. Multi-head attention

Slide 48

Slide 48 text

Formalization of QKV attention 47 Construct � 𝑸𝑸 = � 𝒒𝒒1 , … , � 𝒒𝒒𝑇𝑇 by sums of 𝑽𝑽 = (𝒗𝒗1 , … , 𝒗𝒗𝑆𝑆 ) weighted by dot products between 𝑸𝑸 = (𝒒𝒒1 , … , 𝒒𝒒𝑇𝑇 ) and 𝑲𝑲 = (𝒌𝒌1 , … , 𝒌𝒌𝑆𝑆 ) � 𝑸𝑸 = Attention 𝑸𝑸, 𝑲𝑲, 𝑽𝑽 = 𝑽𝑽softmax 𝑐𝑐𝑲𝑲⊤𝑸𝑸 𝑸𝑸 = 𝒒𝒒1 , … , 𝒒𝒒𝐽𝐽 ∈ ℝ𝑑𝑑×𝑇𝑇, � 𝑸𝑸 = � 𝒒𝒒1 , … , � 𝒒𝒒𝐽𝐽 ∈ ℝ𝑑𝑑×𝑇𝑇, 𝑲𝑲 = 𝒌𝒌1 , … , 𝒌𝒌𝑆𝑆 ∈ ℝ𝑑𝑑×𝑆𝑆, 𝑽𝑽 = (𝒗𝒗1 , … , 𝒗𝒗𝑆𝑆 ) ∈ ℝ𝑑𝑑×𝑆𝑆  Self attention in encoders (Reconstruct 𝑯𝑯 by attending 𝑯𝑯) 𝑸𝑸 = 𝑾𝑾𝑄𝑄 𝑯𝑯, 𝑲𝑲 = 𝑾𝑾𝐾𝐾 𝑯𝑯, 𝑽𝑽 = 𝑾𝑾𝑉𝑉 𝑯𝑯, 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 ∈ ℝ𝑑𝑑×𝑑𝑑 (𝑆𝑆 = 𝑇𝑇 = 𝐼𝐼)  Self attention in decoders (Reconstruct 𝒁𝒁 by attending 𝒁𝒁) 𝑸𝑸 = 𝑾𝑾𝑄𝑄 𝒁𝒁, 𝑲𝑲 = 𝑾𝑾𝐾𝐾 𝒁𝒁, 𝑽𝑽 = 𝑾𝑾𝑉𝑉 𝒁𝒁, 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 ∈ ℝ𝑑𝑑×𝑑𝑑 (𝑆𝑆 = 𝑇𝑇 = 𝐽𝐽)  Cross attention (Reconstruct 𝑯𝑯 by attending 𝑯𝑯 from 𝒁𝒁) 𝑸𝑸 = 𝑾𝑾𝑄𝑄 𝒁𝒁, 𝑲𝑲 = 𝑾𝑾𝐾𝐾 𝑯𝑯, 𝑽𝑽 = 𝑾𝑾𝑉𝑉 𝑯𝑯, 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 ∈ ℝ𝑑𝑑×𝑑𝑑 (𝑆𝑆 = 𝐼𝐼, 𝑇𝑇 = 𝐽𝐽) 1. Multi-head attention

Slide 49

Slide 49 text

Multi-head attention mechanism 48  QKV attention can compute a single pattern of weights (𝑨𝑨 ∈ ℝ𝑆𝑆×𝑆𝑆)  Introduce multiple attention mechanisms with different perspectives � 𝑸𝑸 = MultiHead 𝑸𝑸, 𝑲𝑲, 𝑽𝑽 = 𝑾𝑾𝑂𝑂 Concat � 𝑸𝑸(1), … , � 𝑸𝑸(𝐻𝐻) � 𝑸𝑸(ℎ) = Attention 𝑾𝑾𝑄𝑄 (ℎ)𝑸𝑸, 𝑾𝑾𝐾𝐾 (ℎ)𝑲𝑲, 𝑾𝑾𝑉𝑉 (ℎ)𝑽𝑽 (ℎ = 1, … , 𝐻𝐻) � 𝑸𝑸(ℎ) ∈ ℝ𝑑𝑑𝑘𝑘×𝑇𝑇, 𝑾𝑾𝑂𝑂 ∈ ℝ𝑑𝑑×𝐻𝐻𝑑𝑑𝑣𝑣, 𝑾𝑾𝑄𝑄 (ℎ) ∈ ℝ𝑑𝑑𝑘𝑘×𝑑𝑑, 𝑾𝑾𝐾𝐾 (ℎ) ∈ ℝ𝑑𝑑𝑘𝑘×𝑑𝑑, 𝑾𝑾𝑉𝑉 (ℎ) ∈ ℝ𝑑𝑑𝑣𝑣×𝑑𝑑  Usually, we set 𝑑𝑑𝑘𝑘 = 𝑑𝑑𝑣𝑣 = 𝑑𝑑/𝐻𝐻 and create a subspace for each attention head  An example of self-attention of an encoder � 𝑸𝑸(ℎ) = Attention 𝑾𝑾𝑄𝑄 (ℎ)𝑾𝑾𝑄𝑄 𝑯𝑯, 𝑾𝑾𝐾𝐾 (ℎ)𝑾𝑾𝐾𝐾 𝑯𝑯, 𝑾𝑾𝑉𝑉 (ℎ)𝑾𝑾𝑉𝑉 𝑯𝑯  Equivalent to splitting the transformed matrices of 𝑯𝑯 by 𝑾𝑾𝑄𝑄 , 𝑾𝑾𝐾𝐾 , 𝑾𝑾𝑉𝑉 into 𝐻𝐻 Regard them as matrices to transform 𝑯𝑯 into query, key, and value subspaces of ℝ𝑑𝑑𝑘𝑘 1. Multi-head attention

Slide 50

Slide 50 text

𝑾𝑾𝑄𝑄 (ℎ) 𝑾𝑾𝐾𝐾 (ℎ) 𝑾𝑾𝑉𝑉 (ℎ) Scaled Dot-Product Attention 𝑾𝑾𝑄𝑄 (ℎ) 𝑾𝑾𝐾𝐾 (ℎ) 𝑾𝑾𝑉𝑉 (ℎ) Scaled Dot-Product Attention Computing multi-head attention mechanisms (𝑑𝑑 = 8, 𝐻𝐻 = 4) 49 𝑾𝑾𝑄𝑄 (ℎ) 𝑾𝑾𝐾𝐾 (ℎ) 𝑾𝑾𝑉𝑉 (ℎ) QKV attention Concatenation 𝑾𝑾𝑂𝑂 𝐻𝐻 QKV attention QKV attention QKV attention QKV attention 𝑽𝑽 𝑲𝑲 𝒒𝒒 𝑽𝑽(1) 𝑲𝑲(1) 𝒒𝒒(1) 𝑽𝑽(2) 𝑲𝑲(2) 𝒒𝒒(2) 𝑽𝑽(3) 𝑲𝑲(3) 𝒒𝒒(3) 𝑽𝑽(4) 𝑲𝑲(4) 𝒒𝒒(4) = � 𝒒𝒒(1) � 𝒒𝒒(2) � 𝒒𝒒(3) � 𝒒𝒒(4) 𝑾𝑾𝑂𝑂 � 𝒒𝒒 1. Multi-head attention

Slide 51

Slide 51 text

Why self-attention? 50  Self-attention is usually faster than RNNs (𝑛𝑛 < 𝑑𝑑)  “NLP researchers scared 𝑛𝑛2 much, but the Google engineer didn’t”  Self-attention is parallelizable over a sequence  Self-attention connects all positions with 𝑂𝑂(1) step  RNN requires 𝑂𝑂(𝑛𝑛) computations  CNN requires 𝑂𝑂 log𝑘𝑘 𝑛𝑛 convolution operations (Vaswani+ 2017) A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008. 1. Multi-head attention

Slide 52

Slide 52 text

Positional encoding 51  Transformer has no recurrence nor convolution  We need to inject position information to hidden states in some way  Add a positional encoding 𝒑𝒑𝑡𝑡 ∈ ℝ𝑑𝑑 to token embeddings 𝒘𝒘𝑡𝑡 ∈ ℝ𝑑𝑑 at position 𝑡𝑡 to represent positions of hidden states in the encoder and decoder 𝒙𝒙𝑡𝑡 = 𝒘𝒘𝑡𝑡 + 𝒑𝒑𝑡𝑡  𝑑𝑑: a constant presenting the number of dimension of vectors 𝒑𝒑𝑡𝑡 𝑖𝑖 = � sin 𝜔𝜔𝑘𝑘 𝑡𝑡 𝑖𝑖 = 2𝑘𝑘 cos 𝜔𝜔𝑘𝑘 𝑡𝑡 𝑖𝑖 = 2𝑘𝑘 + 1 𝜔𝜔𝑘𝑘 = 1 100002𝑘𝑘/𝑑𝑑 𝒑𝒑𝑡𝑡 𝒑𝒑𝑡𝑡 𝒘𝒘𝑡𝑡 𝒙𝒙𝑡𝑡 Value of the 𝑖𝑖-th dimension of the vector 𝒑𝒑𝑡𝑡 Modified from the figure in (Vaswani+ 2017) A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008. 2. Positional encoding

Slide 53

Slide 53 text

Properties of positional encoding 52 PE𝑑𝑑 𝑡𝑡, 𝑖𝑖 = sin 𝜔𝜔𝑘𝑘 𝑡𝑡 𝑖𝑖 = 2𝑘𝑘 cos 𝜔𝜔𝑘𝑘 𝑡𝑡 𝑖𝑖 = 2𝑘𝑘 + 1 , 𝜔𝜔𝑘𝑘 = 1 100002𝑘𝑘/𝑑𝑑  Values of lower dimensions change a lot, but those of higher ones do not  It looks like a continuous version of binary code  Positional encodings from close positions yield similar values 2. Positional encoding

Slide 54

Slide 54 text

Residual connection (He+ 16) 53  Suppose that we want to learn a function ℎ(𝒙𝒙)  We consider another mapping: 𝑓𝑓 𝒙𝒙 = ℎ 𝒙𝒙 − 𝒙𝒙  Then, the original mapping is ℎ 𝒙𝒙 = 𝑓𝑓 𝒙𝒙 + 𝒙𝒙  We hypothesize that training 𝑓𝑓 𝒙𝒙 is easier than ℎ(𝒙𝒙)  If an identical mapping is default, pushing 𝑓𝑓 𝒙𝒙 = 0 may be easier  We can view 𝑓𝑓 𝒙𝒙 + 𝒙𝒙 as a feedforward neural network with shortcut connections  Useful to build a deeper network  Gradients flow on shortcut connections  Proposed in ResNet (He+ 2016) 𝒙𝒙 𝑓𝑓(𝒙𝒙) 𝑓𝑓 𝒙𝒙 + 𝒙𝒙 K He, X Zhang, S Ren, J Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 3. Residual + Layer-norm

Slide 55

Slide 55 text

Layer normalization (Ba+ 16) 54  Ensure zero mean and unit variance of a vector 𝒙𝒙(new) from 𝒙𝒙 ∈ ℝ𝑑𝑑 𝑥𝑥 𝑖𝑖 (new) ← 𝑥𝑥𝑖𝑖 − 𝜇𝜇 𝜎𝜎2 + 𝜖𝜖 , 𝜇𝜇 = 1 𝑑𝑑 � 𝑖𝑖=1 𝑑𝑑 𝑥𝑥𝑖𝑖 , 𝜎𝜎2 = 1 𝑑𝑑 � 𝑖𝑖=1 𝑑𝑑 𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑖𝑖 2  This is used in various places in Transformer  A mean 𝜇𝜇 and variance 𝜎𝜎2 are computed at each time step  How it works(?) (adapted from batch normalization (Bjorck+ 2018))  Large activations in a lower layer cannot be propagated uncontrollably to upper layers because of the normalization operation  This prevents gradients from exploding (e.g., becoming too large)  This enables higher learning rates (recall that the amount of a parameter update is a product of a learning rate and a gradient)  A large learning rate 𝜂𝜂 leads to a larger noise in SGD (proportional to 𝜂𝜂2)  A larger SGD noise prevents the network from getting “trapped” in sharp minima and biases it towards wider minima with better generalization J. L. Ba, J. R. Kiros, G. E. Hinton. 2016. Layer Normalization. arXiv:1607.06450. J. Bjorck, C. Gomes, B. Selman, K. Q. Weinberger. 2018. Understanding Batch Normalization. In NIPS, pp. 7694-7705. 3. Residual + Layer-norm

Slide 56

Slide 56 text

Feedforward layer 55 4. Feedforward  Two linear transformations with ReLU in between: FFN 𝒙𝒙 = 𝑾𝑾2 max 0, 𝑾𝑾1 𝒙𝒙 + 𝒃𝒃1 + 𝒃𝒃2 𝑾𝑾1 ∈ ℝ𝑑𝑑𝑓𝑓×𝑑𝑑, 𝒃𝒃1 ∈ ℝ𝑑𝑑𝑓𝑓, 𝑾𝑾2 ∈ ℝ𝑑𝑑×𝑑𝑑𝑓𝑓, 𝒃𝒃2 ∈ ℝ𝑑𝑑  The original paper sets 𝑑𝑑𝑓𝑓 = 4𝑑𝑑 in the experiments  Linear -> ReLU -> Linear transformations for each timestep 𝒙𝒙 ∈ ℝ𝑑𝑑 FFN 𝒙𝒙 ∈ ℝ𝑑𝑑 𝑾𝑾1 , 𝒃𝒃1 𝑾𝑾2 , 𝒃𝒃2

Slide 57

Slide 57 text

Masked self attention (in decoders) 56  When training an encoder-decoder model, we give all source and target tokens in the input layer at a time  We want to complete all computation as matrix operations for better parallelization  However, we should not look at future tokens  Before computing the softmax, we force to set −∞ (e.g., −109) to all elements in the score matrix that point to future tokens (masking) ジョンは BOS が大好きメリージョンは EOS が大好きメリー BOS ジョンはメリーが大好き BOS ジョンはメリーが大好き Mask

Slide 58

Slide 58 text

Hyper-parameters 57 Parameter Base Big # layers (𝑁𝑁) 6 6 # dimensions (𝑑𝑑) 512 1024 # dimensions for FFN (𝑑𝑑𝑓𝑓 ) 2048 4096 # attention heads (ℎ) 8 16 # dimensions of keys/queries (𝑑𝑑𝑘𝑘 ) 64 64 # dimensions of values (𝑑𝑑𝑣𝑣 ) 64 64 Dropout rate 𝑃𝑃drop 0.1 0.3 # training steps 100K 300K # total parameters 65M 213M  Some training tips exist for Transformer  E.g., The learning rate is increased linearly for the first warm-up steps, and then decreased proportionally to the inverse square root of the step number

Slide 59

Slide 59 text

Task performance 58 Transformer established the new state-of-the-art performance on En-De translation even with the base model (fewer parameters than big) (Vaswani+ 2017) A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.

Slide 60

Slide 60 text

Coreference handling in self attention 59 The animal didn’t cross the street because it was too tired. The animal didn’t cross the street because it was too wide. A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008. (Vaswani+ 2017)

Slide 61

Slide 61 text

Performance improvements on machine translation 60 35 29.3 33.3 28.4 25.16 24.61 23 21.6 20.7 0 5 10 15 20 25 30 35 40 Transformer Big + Back translation (Edunov+ 18) Transformer Big (Ott+ 18) DeepL (press release, 17) Transformer (Vaswani+ 17) ConvS2S (Gehring+ 17) Google's NMT (Wu+ 16) Attention mechanism (Luong+ 15) RNNsearch (Jean+ 15) Statistical Machine Translation (Durrani+ 14) BLEU score for English to German translation on WMT 2014 dataset (higher is better) 20 years of research on SMT

Slide 62

Slide 62 text

Machine Translation (DeepL) 61 https://www.deepl.com/ 回転することを強いられ、ある場所にタケ盗品の老人と呼ばれるおじいちゃんがいました。ある日です。通常通りおじいちゃんがタケ果樹園に入ったとき、ルートはきらめきブリリアントカット・タケを持っていました。

Slide 63

Slide 63 text

GPT 62

Slide 64

Slide 64 text

What is GPT (Radford+ 2018)? 63  A generic language model that is transferable to various NLP tasks  A single model for different tasks  Question answering, document classification, semantic similarity, …  GPT-3 has been a hot topic recently (in 2020)  Pretraining and finetuning (a kind of transfer learning)  Pretraining learns parameters that are generic to the language  Finetuning learns task-specific parameters on supervision data, leveraging the parameters acquired in pretraining  Based on Transformer decoder  Generative Pre-Training (GPT) A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

Slide 65

Slide 65 text

GPT-3 (Brown+ 2020) 64 https://twitter.com/sharifshameem/statu s/1282676454690451457 T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165

Slide 66

Slide 66 text

#この記事は実在しません: GPT-2 Text Generation Demo 65 https://cl.asahi.com/api_data/gpt2-demo.html

Slide 67

Slide 67 text

Automatic completion of program (GitHub Copilot) 66 https://copilot.github.com/

Slide 68

Slide 68 text

Architecture of GPT 67 (Radford+ 2018)  GPT uses Transformer decoder across different tasks  Pretraining is based on language modeling  Adding an output layer to a pretrained model for a target, finetuning trains output layers using supervision data A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

Slide 69

Slide 69 text

Pretraining: Language modeling 68 Once upon a time there was a girl who really loved + + + + + + + + + + + Token embeddings Position embeddings Input embeddings Output embeddings GPT (Transformer decoder with 𝐿𝐿 layers) Input sequence ℎ1 0 𝑃𝑃𝑃𝑃1 𝑃𝑃𝑃𝑃2 𝑃𝑃𝑃𝑃3 𝑃𝑃𝑃𝑃4 𝑃𝑃𝑃𝑃5 𝑃𝑃𝑃𝑃6 𝑃𝑃𝑃𝑃7 𝑃𝑃𝑃𝑃8 𝑃𝑃𝑃𝑃9 𝑃𝑃𝑃𝑃10 𝑃𝑃𝑃𝑃11 The model (Transformer decoder) is trained to predict the next token for each time step by using a large corpus as the supervision data 𝑊𝑊𝑊𝑊1 𝑊𝑊𝑊𝑊2 𝑊𝑊𝑊𝑊3 𝑊𝑊𝑊𝑊4 𝑊𝑊𝑊𝑊5 𝑊𝑊𝑊𝑊6 𝑊𝑊𝑊𝑊7 𝑊𝑊𝑊𝑊8 𝑊𝑊𝑊𝑊9 𝑊𝑊𝑊𝑊10 𝑊𝑊𝑊𝑊11 ℎ2 0 ℎ3 0 ℎ4 0 ℎ5 0 ℎ6 0 ℎ7 0 ℎ8 0 ℎ9 0 ℎ10 0 ℎ11 0 ℎ1 𝐿𝐿 ℎ2 𝐿𝐿 ℎ3 𝐿𝐿 ℎ4 𝐿𝐿 ℎ5 𝐿𝐿 ℎ6 𝐿𝐿 ℎ7 𝐿𝐿 ℎ8 𝐿𝐿 ℎ9 𝐿𝐿 ℎ10 𝐿𝐿 ℎ11 𝐿𝐿 upon a time there was a girl who really loved books Output sequence

Slide 70

Slide 70 text

Example of finetuning: Textual entailment 69 Tokyo Tech is located in Ookayama $ Japan has a university + + + + + + + + + + + Token embeddings Position embeddings Input embeddings Output embeddings GPT (Transformer decoder with 𝐿𝐿 layers) Input sequence ℎ1 0 𝑃𝑃𝑃𝑃1 𝑃𝑃𝑃𝑃2 𝑃𝑃𝑃𝑃3 𝑃𝑃𝑃𝑃4 𝑃𝑃𝑃𝑃5 𝑃𝑃𝑃𝑃6 𝑃𝑃𝑃𝑃7 𝑃𝑃𝑃𝑃8 𝑃𝑃𝑃𝑃9 𝑃𝑃𝑃𝑃10 𝑃𝑃𝑃𝑃11 After training the language model, add a linear and softmax layers to predict labels of a target task, and adapt the parameters using the supervision data 𝑊𝑊𝑊𝑊1 𝑊𝑊𝑊𝑊2 𝑊𝑊𝑊𝑊3 𝑊𝑊𝑊𝑊4 𝑊𝑊𝑊𝑊5 𝑊𝑊𝑊𝑊6 𝑊𝑊𝑊𝑊7 𝑊𝑊𝑊𝑊8 𝑊𝑊𝑊𝑊9 𝑊𝑊𝑊𝑊10 𝑊𝑊𝑊𝑊11 ℎ2 0 ℎ3 0 ℎ4 0 ℎ5 0 ℎ6 0 ℎ7 0 ℎ8 0 ℎ9 0 ℎ10 0 ℎ11 0 ℎ1 𝐿𝐿 ℎ2 𝐿𝐿 ℎ3 𝐿𝐿 ℎ4 𝐿𝐿 ℎ5 𝐿𝐿 ℎ6 𝐿𝐿 ℎ7 𝐿𝐿 ℎ8 𝐿𝐿 ℎ9 𝐿𝐿 ℎ10 𝐿𝐿 ℎ11 𝐿𝐿 Entail 𝑊𝑊 𝑦𝑦 softmax In addition to the objective of the target task, we also train the model with the objective of language modeling on the supervision data

Slide 71

Slide 71 text

Training the GPT model 70  Pretraining  BooksCorpus dataset (7,000 unique books) and 1B Words Benchmark  Finetuning  Detail of the Transformer architecture  12-layer decoder-only transformer with masked self attention  Number of dimension 𝑑𝑑 = 768 (12 attention heads)  Vocabulary of 40,000 subword tokens built by Byte-Pair-Encoding (BPE)  117M parameters in total A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (Radford+ 2018)

Slide 72

Slide 72 text

Evaluation results 71  Natural Language Inference: SoTA on all datasets  Improvements: 1.5% on MNLI, 5% on SciTail, 5.8% on QNLI, and 0.6% on SNLI  Question answering and commonsense reasoning: SoTA on all datasets  Improvements: 8.9% on Story Cloze, and 5.7% overall on RACE  Semantic similarity: SoTA on two ouf of three datasets  Classification: SoTA on GLUE benchmark (72.8 ← 68.9)  Performance drastically drops without pre-training (see the table below) (Radford+ 2018) A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

Slide 73

Slide 73 text

 The paper explores whether the language model trained on a text can solve NLP tasks such as question answering without finetuning  The architecture is the same as GPT  But the Transformer architecture is changed from Post-LN to Pre-LN  Training of the language model  8M high-quality documents (40GB) crawled from the Web GPT-2 (Radford+ 2019) 72 𝑥𝑥𝑡𝑡 𝑙𝑙 𝑥𝑥𝑡𝑡 𝑙𝑙+1 Attention FFN Layer Norm Layer Norm 𝑥𝑥𝑡𝑡 𝑙𝑙 𝑥𝑥𝑡𝑡 𝑙𝑙+1 Attention FFN Layer Norm Layer Norm Post-LN Pre-LN A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical report, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Slide 74

Slide 74 text

Performance of GPT-2 73 117M (12 layers, 768 dims); 345M (24 layers, 1024 dims); 762M (36 layers, 1280 dims); 1542M (48 layers, 1600 dims) A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical report, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (Radford+ 2019)

Slide 75

Slide 75 text

Answers generated by GPT-2 on the dev set of Natural Questions 74 A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical report, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Slide 76

Slide 76 text

The paper explores whether the language model trained on a text can solve NLP tasks with zero-shot, one-shot, or few-shot on the tasks and without updating parameters for the task (no fine-tuning) GPT-3 (Brown+ 2020) 75 T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.

Slide 77

Slide 77 text

 The architecture is the same as GPT-2  But GPT-3 use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to Sparse Transformer  The GPT-3 models are extremely large  Training GPT-3 (175B) requires 3.14 × 1023 flops  “Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run.” [1] The architecture of GPT-3 76 T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165. [1] OpenAI's GPT-3 Language Model: A Technical Overview. https://lambdalabs.com/blog/demystifying-gpt-3/

Slide 78

Slide 78 text

Performance of GPT-3 77 T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.

Slide 79

Slide 79 text

Limitations even with GPT-3 (Brown+ 2020) 78  Inferior performance on some tasks to finetuning approach  Notable weaknesses in text synthesis  Repetitions/contradictions at the document level, lost coherence over sufficiently long passages, and non-sequitur sentences  Difficulty with “common sense physics”  Difficult to answer a question like “If I put cheese into the fridge, will it melt?”  Structural and algorithmic limitations  No bi-directional architecture (unlike BERT), which is disadvantageous to some tasks (e.g., fill-in-the-blank tasks) that require re-reading or carefully considering a long passage and then generating a very short answer  Poor sample efficiency during pre-training  Pre-training requires much more text than a human does in the their lifetime  Test-time sample efficiency is closer to that of humans (one/zero-shot) though  Other limitations that are shared by most deep learning systems  Interpretability of decisions, biases of the data, gender, etc. T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.

Slide 80

Slide 80 text

BERT 79

Slide 81

Slide 81 text

What is BERT (Devlin+ 2019)? 80  A generic model for various NLP tasks  Question answering, document classification, semantic inference, …  Became a popular methodology, achieving state-of-the-art performance  Pretraining and finetuning (a kind of transfer learning)  Pretraining learns parameters that are generic to the language  Finetuning learns task-specific parameters on supervision data, leveraging the parameters acquired in pretraining  Based on Transformer encoder  BERT is not an encoder-decoder model (without a decoder)  A kind of contextualized word embeddings  Word embeddings that can represent context  Bidirectional Encoder Representations from Transformer (BERT)  → Embeddings from Language Models (ELMo) J Devlin, M-W Chang, K Lee, K Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pp. 4171-4186.

Slide 82

Slide 82 text

Pretraining and finetuning 81 (Devlin+ 2019)  BERT uses a unified architecture across different tasks  Pretraining is based on bidirectional language modeling  Starting with a pretrained model, finetuning updates output layers (sometimes tailored for target tasks) as well as all internal parameters J Devlin, M-W Chang, K Lee, K Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pp. 4171-4186.

Slide 83

Slide 83 text

 Idea: Train the model so that it can solve Cloze task  Obtain supervision data by masking tokens in large corpora  BooksCorpus (800M words) and English Wikipedia (2,500M words)  Procedure:  Choose 15% of token positions at random for prediction  Choose one of the following operations Pretraining task 1: Masked language model 82 My dog is [ ]. My dog is cute  [80%]: Replace the target token with [MASK]  [10%]: Replace the target token with a random token  [10%]: Keep the target token unchanged [ ] = cute BERT My dog is [MASK] My dog is apple My dog is cute These treatments are because [MASK] token does not appear in downstream tasks

Slide 84

Slide 84 text

Masked language modeling (15% × 80%): [MASK] input 83 [CLS] my dog [MASK] cute [SEP] he likes [MASK] ##ing [SEP] + + + + + + + + + + + + + + + + + + + + + + Token embeddings Segment embeddings Position embeddings Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸4 𝐸𝐸3 𝐸𝐸[SEP] 𝐸𝐸1 ′ 𝐸𝐸2 ′ 𝐸𝐸4 ′ 𝐸𝐸3 ′ 𝐸𝐸[SEP] ′ 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇4 𝑇𝑇3 𝑇𝑇[SEP] 𝑇𝑇1 ′ 𝑇𝑇2 ′ 𝑇𝑇4 ′ 𝑇𝑇3 ′ 𝑇𝑇[SEP] ′ 𝑃𝑃𝑃𝑃0 𝑃𝑃𝑃𝑃1 𝑃𝑃𝑃𝑃2 𝑃𝑃𝑃𝑃3 𝑃𝑃𝑃𝑃4 𝑃𝑃𝑃𝑃5 𝑃𝑃𝑃𝑃6 𝑃𝑃𝑃𝑃7 𝑃𝑃𝑃𝑃8 𝑃𝑃𝑃𝑃9 𝑃𝑃𝑃𝑃10 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B Sentence 1 Sentence 2 is play The model is trained to predict the masked tokens (we don’t predict other tokens)

Slide 85

Slide 85 text

Masked language modeling (15% × 10%): random input 84 [CLS] my dog look cute [SEP] he likes cat ##ing [SEP] + + + + + + + + + + + + + + + + + + + + + + Token embeddings Segment embeddings Position embeddings Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸4 𝐸𝐸3 𝐸𝐸[SEP] 𝐸𝐸1 ′ 𝐸𝐸2 ′ 𝐸𝐸4 ′ 𝐸𝐸3 ′ 𝐸𝐸[SEP] ′ 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇4 𝑇𝑇3 𝑇𝑇[SEP] 𝑇𝑇1 ′ 𝑇𝑇2 ′ 𝑇𝑇4 ′ 𝑇𝑇3 ′ 𝑇𝑇[SEP] ′ 𝑃𝑃𝑃𝑃0 𝑃𝑃𝑃𝑃1 𝑃𝑃𝑃𝑃2 𝑃𝑃𝑃𝑃3 𝑃𝑃𝑃𝑃4 𝑃𝑃𝑃𝑃5 𝑃𝑃𝑃𝑃6 𝑃𝑃𝑃𝑃7 𝑃𝑃𝑃𝑃8 𝑃𝑃𝑃𝑃9 𝑃𝑃𝑃𝑃10 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B Sentence 1 Sentence 2 is play The model is trained to predict the target tokens (we don’t predict other tokens)

Slide 86

Slide 86 text

Masked language modeling (15% × 10%): original input 85 [CLS] my dog is cute [SEP] he likes play ##ing [SEP] + + + + + + + + + + + + + + + + + + + + + + Token embeddings Segment embeddings Position embeddings Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸4 𝐸𝐸3 𝐸𝐸[SEP] 𝐸𝐸1 ′ 𝐸𝐸2 ′ 𝐸𝐸4 ′ 𝐸𝐸3 ′ 𝐸𝐸[SEP] ′ 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇4 𝑇𝑇3 𝑇𝑇[SEP] 𝑇𝑇1 ′ 𝑇𝑇2 ′ 𝑇𝑇4 ′ 𝑇𝑇3 ′ 𝑇𝑇[SEP] ′ 𝑃𝑃𝑃𝑃0 𝑃𝑃𝑃𝑃1 𝑃𝑃𝑃𝑃2 𝑃𝑃𝑃𝑃3 𝑃𝑃𝑃𝑃4 𝑃𝑃𝑃𝑃5 𝑃𝑃𝑃𝑃6 𝑃𝑃𝑃𝑃7 𝑃𝑃𝑃𝑃8 𝑃𝑃𝑃𝑃9 𝑃𝑃𝑃𝑃10 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B Sentence 1 Sentence 2 is play The model is trained to predict the target tokens (we don’t predict other tokens)

Slide 87

Slide 87 text

 Idea: Train the model so that it can classify whether given two sentences are consecutive or not.  Obtain supervision data by extracting sentences in large corpora  BooksCorpus (800M words) and English Wikipedia (2,500M words)  Procedure:  Choose two sentences that are consecutive 50% of the time  Choose two sentences that are not consecutive 50% of the time Pretraining task 2: Next sentence prediction 86 My dog is cute. He likes playing. Yes BERT My dog is cute. I went to the station. No BERT

Slide 88

Slide 88 text

Next sentence prediction 87 [CLS] my dog is cute [SEP] he likes play ##ing [SEP] + + + + + + + + + + + + + + + + + + + + + + Token embeddings Segment embeddings Position embeddings Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸4 𝐸𝐸3 𝐸𝐸[SEP] 𝐸𝐸1 ′ 𝐸𝐸2 ′ 𝐸𝐸4 ′ 𝐸𝐸3 ′ 𝐸𝐸[SEP] ′ 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇4 𝑇𝑇3 𝑇𝑇[SEP] 𝑇𝑇1 ′ 𝑇𝑇2 ′ 𝑇𝑇4 ′ 𝑇𝑇3 ′ 𝑇𝑇[SEP] ′ 𝑃𝑃𝑃𝑃0 𝑃𝑃𝑃𝑃1 𝑃𝑃𝑃𝑃2 𝑃𝑃𝑃𝑃3 𝑃𝑃𝑃𝑃4 𝑃𝑃𝑃𝑃5 𝑃𝑃𝑃𝑃6 𝑃𝑃𝑃𝑃7 𝑃𝑃𝑃𝑃8 𝑃𝑃𝑃𝑃9 𝑃𝑃𝑃𝑃10 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆A 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B 𝑆𝑆𝑆𝑆B Sentence 1 Sentence 2 IsNext (or NotNext otherwise)

Slide 89

Slide 89 text

Finetuning 88 Input embeddings Output embeddings BERT (Transformer encoder) 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇… 𝑇𝑇… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸𝑁𝑁 𝑇𝑇… 𝑇𝑇… 𝑇𝑇… 𝑇𝑇… 𝑇𝑇𝑁𝑁 𝑇𝑇…  BERT models are flexible to tasks of single text or text pairs  Self-attention allows bidirectional cross attention between two sentences  We can view output embeddings as feature representations of input text  𝑇𝑇𝑖𝑖: Contextual word embeddings of the token at position 𝑖𝑖  𝐶𝐶: Embeddings for single or two sentences  We reuse the model architecture and parameters for downstream tasks  Finetune BERT models on target tasks  We modify a label definition and output layers for a downstream task

Slide 90

Slide 90 text

Finetuning task type 1: Sentence pair classification 89 [CLS] Tok1 Tok2 … TokN [SEP] Tok1 Tok2 … TokM [SEP] Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸4 𝐸𝐸3 𝐸𝐸[SEP] 𝐸𝐸1 ′ 𝐸𝐸2 ′ 𝐸𝐸4 ′ 𝐸𝐸3 ′ 𝐸𝐸[SEP] ′ 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇4 𝑇𝑇3 𝑇𝑇[SEP] 𝑇𝑇1 ′ 𝑇𝑇2 ′ 𝑇𝑇4 ′ 𝑇𝑇3 ′ 𝑇𝑇[SEP] ′ Sentence 1 Sentence 2 Label Task example: Multi-Genre Natural Language Inference (MultiNLI)  Sentence 1: “At the other end of Pennsylvania Avenue, people began to line up for a White House tour.”  Sentence 2: “People formed a line at the end of Pennsylvania Avenue.”  Label: entailment

Slide 91

Slide 91 text

Finetuning task type 2: Single sentence classification 90 [CLS] Tok1 Tok2 … … … … … … … TokN Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇… 𝑇𝑇… Label 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸𝑁𝑁 𝑇𝑇… 𝑇𝑇… 𝑇𝑇… 𝑇𝑇… 𝑇𝑇𝑁𝑁 𝑇𝑇… Task example: Stanford Sentiment Treebank (SST)  Input sentence: “You’ll probably love it.”  Label: positive

Slide 92

Slide 92 text

Finetuning task type 3: Question answering 91 [CLS] Tok1 Tok2 … TokN [SEP] Tok1 Tok2 … TokM [SEP] Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸4 𝐸𝐸3 𝐸𝐸[SEP] 𝐸𝐸1 ′ 𝐸𝐸2 ′ 𝐸𝐸4 ′ 𝐸𝐸3 ′ 𝐸𝐸[SEP] ′ 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇4 𝑇𝑇3 𝑇𝑇[SEP] 𝑇𝑇1 ′ 𝑇𝑇2 ′ 𝑇𝑇4 ′ 𝑇𝑇3 ′ 𝑇𝑇[SEP] ′ Question Paragraph START END Stanford Question Answering Dataset (SQuAD) https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/Doctor_Who.html

Slide 93

Slide 93 text

Finetuning task type 4: Single sentence tagging 92 [CLS] Tok1 Tok2 … … … … … … … TokN Input embeddings Output embeddings BERT (Transformer encoder) Input sequence 𝐸𝐸[CLS] 𝐸𝐸1 𝐸𝐸2 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐶𝐶 𝑇𝑇1 𝑇𝑇2 𝑇𝑇… 𝑇𝑇… O 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸… 𝐸𝐸𝑁𝑁 𝑇𝑇… 𝑇𝑇… 𝑇𝑇… 𝑇𝑇… 𝑇𝑇𝑁𝑁 𝑇𝑇… B-PER I-PER O B-ORG I-ORG I-ORG O O O Task example: Named Entity Recognition (NER) (as sequential labeling)  Input: “In March 2005, the New York Times acquired About, Inc .”  Output: O B-TEMP I-TEMP O B-ORG I-ORG I-ORG I-ORG O B-ORG

Slide 94

Slide 94 text

Performance on downstream tasks 93 GLUE benchmark [1] SQuAD 1.0 (Q&A) CoNLL 2003 (NER) J Devlin, M-W Chang, K Lee, K Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pp. 4171-4186. [1] https://gluebenchmark.com/leaderboard

Slide 95

Slide 95 text

Summary  Attention mechanism addresses the drawback of fixed-size vector representation in encoder-decoder models  A decoder can extract features directly from encoder states  Parameters in attention mechanism are trained by a target task (without explicit supervision data for attention mechanism)  RNN/LSTM is difficult to parallelize across timesteps  Encoder-decoder models using CNN and positional encoding  Transformer removes recurrent computation by using self-attention and positional encoding  GPT and BERT are Transformer models applicable to various NLP tasks  GPT is a uni-directional language model based on Transformer decoder  BERT is a bi-directional model based on Transformer encoder 94