Encoder Decoder Models

Encoder Decoder Models

language modeling, Recurrent Neural Network Language Model (RNNLM), encoder-decoder models, sequence-to-sequence models, attention mechanism, reading comprehension, question answering, headline generation, multi-task learning, character-based RNN, byte-pair encoding, SentencePiece, Convolutional Sequence to Sequence (ConvS2S), Transformer, coverage, round-trip translation

6325a9b34da54d5cbddb814c3987a2fe?s=128

Naoaki Okazaki

August 07, 2020
Tweet

Transcript

  1. Encoder Decoder Models Naoaki Okazaki School of Computing, Tokyo Institute

    of Technology okazaki@c.titech.ac.jp PowerPoint template designed by https://ppt.design4u.jp/template/
  2. Main task: Machine Translation (MT) 1  Translate a text

    in a language into another  Basic idea  How do Computers Learn a New Language? An Introduction to Statistical Machine Translation  https://www.youtube.com/watch?v=_ghMKb6iDMM (6:29) こんにちは Hello 您好 Hola
  3. Statistical Machine Translation (SMT) 2 私は動物園に行った.彼らは東京に行った. I went to the

    zoo. They went to Tokyo. I went to Tokyo. 私は東京に行った. (|) I 私は = 0.8, they 彼らは = 0.8 went 行った = 0.9, to に = 0.9 the zoo 動物園 = 0.8, Tokyo 東京 = 0.8 Supervision data (parallel corpus) Translation model: Japanese to English I went to Tokyo to meet my friend last Sunday. It was the first time since …… We went to the zoo near Ueno for …… the of = .012243, the in = .007208, the to = .005042, … … , was it = .000522, to went = .000080, Supervision data (monolingual corpus) Language model: Naturalness in English Input: Output: Building probabilistic models
  4. DNNs applied to MT 3 Replace the probabilistic models with

    DNNs I went to Tokyo. 私は東京に行った. Input: Output: (|): Deep Neural Networks (DNNs) It’s not that simple as introducing DNN architectures that were successful in other research fields (e.g., computer vision)
  5. Connection to the previous lecture 4  Embeddings for phrases

    and sentences seem to be useful for solving tasks  Is it possible to generate a sentence (sequence of words) from embeddings?  Yes, encoder-decoder models can do that! very good movie とても 良い 映画 E n c D e c
  6. Language modeling 5

  7. Demo: Text-generation with GPT-2 6 https://github.com/graykode/gpt-2-Pytorch Text generated by giving

    the first paragraph of the Wikipedia article of “Harry Potter” https://en.wikipedia.org/wiki/Harry_Potter
  8. Language model (LM) 7  For a given word sequence

    1 , … , , LMs compute the joint probability (1 , … , )  Example: which word fills the blank in “I have a .”? argmax ∈ (I, have, a, )  Used to assess the naturalness of a sentence (sequence of words) generated by machine translation, speech recognition, etc pen dog PC …… what Set of all words in the vocabulary
  9. Probabilistic language models 8 1 , … , = �

    =1 ( |1 , … , −1 ) ☹ Data sparseness problem: Insufficient statistics to estimate the probability with a longer sequence of words Predict the next word after the word sequence 1 , … , −1 #(1 , … , −1 , ) #(1 , … , −1 ) || This, is, a, pen = This BOS is this a This is pen This is a (EOS|This is a pen)
  10. -gram probabilistic language modeling 9 1 , … , ≈

    � =1 ( |−+1 , … , −1 ) Remedy the data sparseness problem by compromising with a shorter context Predict the next word after a word sequence −+1 , … , −1 of length − 1 #(−+1 , … , −1 , ) #(−+1 , … , −1 ) || We have more counts! This, is, a, pen = This BOS is this a is pen a (EOS|pen) (example with 2-gram)
  11. Sentence generation with LM 10  Find the word sequence1

    , … , argmax 1,…,∈ (1 , … , )  However, we cannot specify a desired output  Generate a sentence 1 , … , conditioned on an input 1 , … , argmax 1,…,∈ 1 , … , (1 , … , |1 , … , )  is translation model in machine translation  : Whether the output is a correct translation of the input  : Is the generated sentence natural as the language
  12. Sentence generation as a search problem 11  Sentence generation

    has ( ) time complexity argmax 1,…,∈ 1 , … , (1 , … , |1 , … , )  Unrealistic to enumerate all possible candidates  Usually, > 10,000 and is 20~100  Search a word after words (i.e., greedy / beam search) BOS a b … … I … have a … … … … … a … … … … … a … pen … … 1 (1 |) 1 , 2 (1 , 2 |) 1 , 2 , 3 (1 , 2, , 3 |) 1 , 2 , 3 , 4 (1 , 2, , 3 , 4 |)
  13. Issues in LM (before the DNN era) 12  Data

    sparseness  Rare words suffer from the insufficiency of statistics  The insufficiency gets worse when using -grams (word combinations)  Addressed by smoothing methods (e.g., Good-Turing, Kneser-Ney)  Surface variations  Surface variations with the same meaning have different probabilities  For example, (girl|clever) and (girl|smart) are independent even if ‘clever’ and ‘smart’ have the similar meaning  Addressed by ‘class’ models that merge similar words into a group  Long-distance dependency  -gram models cannot consider dependencies longer than words  Neural LMs address these issues using distributed representations (word embeddings and their compositions)
  14. Recurrent Neural Network Language Model (Mikolov+ 2010) 13 BOS I

    0 have a pen softmax softmax softmax softmax softmax 1 2 3 4 1 (I) 2 (have) 3 (a) 4 (pen) 5 (EOS) (1 , … ) = × × × × The number of dimensions of the output layer is ||, where is the set of possible words. Each element presents the probability of generating the corresponding word Probability of a sequence of words is a product of token prediction probabilities T Mikolov, M Karafiát, L Burget, J Černocký, S Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH, pp. 1045-1048.
  15. Encoder 14

  16. Recurrent Neural Networks (RNNs) (Sutskever+ 2011) 15 I Sutskever, J

    Martens, G Hinton. 2011. Generating text with recurrent neural networks. In ICML, pp. 1017–1024. John loves ℎ 4 𝑦 ℎℎ Mary ℎℎ much ℎℎ softmax Word embeddings Represent a word with a vector ∈ ℝ ℎ ℎ ℎ 1 2 3 1 2 3 4 Recurrent computation Compose a hidden vector from an input word and the hidden vector −1 at the previous timestep = (ℎ + ℎℎ−1) Fully-connected layer for a task Make a prediction from the hidden vector 4 , which are composed from all words in the sentence, by using a fully-connected layer and softmax 0 = 0 ℎℎ ☺ The parameters ℎ , ℎℎ , 𝑦 are shared over the entire sequence They are trained by the supervision signal 1 , … , 4 , using backpropagation
  17. Convolutional Neural Network (CNN) (Kim 2014) 16 Y Kim. 2014.

    Convolutional neural networks for sentence classification. In EMNLP, pp. 1746-1751. It is a very good movie indeed :+ ・ ・ ・ ・ ・ ・ = max 1<<−+1 , Max pooling: each dimension is the maximum number of the values , over timesteps softmax () ☺
  18. Encoding 17  These models can be decomposed into 

    Encoding (variable-length input to feature vector)  = (1 , … , ) ( is a part of the NN)  Solving the task (e.g., classify the text using the feature vector)  = () ( is also a part of the NN) I have ℎ 4 ℎ ℎℎ a ℎℎ pen ℎℎ softmax ℎ ℎ ℎ 1 2 3 1 2 3 4 ℎℎ ☺ It is a very good movie indeed :+ ・ ・ ・ ・ ・ ・ = max 1<<−+1 , Max pooling: each dimension is the maximum number of the values , over timesteps softmax () ☺
  19. Encoder decoder models 18

  20. Using RNNLM for generating sentences 19 Predict a sequence of

    words for an given input , in addition to score the naturalness of the generated sentence BOS I have a pen I have a pen EOS Input
  21. Encoder decoder model (EncDec) (Sutskever+ 2014; Cho+ 2014) 20 I

    Sutskever, O Vinyals, Q V Le. 2014. Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. K Cho, van B Merrienboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. 2014. Learning phrase representations using RNN encoder– decoder for statistical machine translation. In EMNLP, pp. 1724–1734. I have a ペン を 持つ pen BOS ペン を 持つ EOS Encoder Decoder ※ This illustration omits the matrices of RNNs Representation of the input  Encode an input sentence into a feature vector, and generate a sentence by decoding (predicting) a word sequence from the feature  Also known as sequence-to-sequence model  Machine translation is realized by a single NN!  Machine translation had been a mix of various theories and methods before the neural machine translation (NMT)
  22. Caption generation (Vinyals+ 2015) 21 O Vinyals, A Toshev, S

    Bengio, D Erhan. 2015. Show and tell: A neural image caption generator. In CVPR.
  23. Chatbot (Vinyals+ 2015) 22  Supervision data: OpenSubtitles  Scripts

    extracted from movie subtitles (6G sentences)  A chat example from the EncDec model O Vinyals, Q V Le. 2015. A neural conversational model, In ICML Deep Learning Workshop.
  24. Summary  Encoder-decoder architecture  An encoder converts an input

    sentence into a feature vector  A decoder generates a sentence based on the vector  We can train an encoder-decoder model in end-to-end fashion  (An autoregressive) decoder predicts a token sequence by feeding predicted tokens into the input layer  We can connect different modalities (e.g., language and vision) in a single NN as long as they are represented as vectors 23
  25. Attention mechanism 24

  26. Weakness of EncDec 25  EncDec represents an input of

    a variable-length with a fixed-size vector  EncDec has no flexibility about the amount of the information of an input  EncDec suffers from handling longer sentences I have a ペン を 持つ pen BOS ペン を 持つ EOS
  27. The idea of attention mechanism 26 This is a pen

    BOS + これ は ペン BOS これ は ペン EOS At each timestep in the decoder, predict a word using the weighted sum of all hidden vectors in the input Attention mechanism determines the weights automatically from the decoder state The decoder now has an access to all hidden vectors in the input (1) (2) (3) (4) (5)
  28. Attention mechanism (Bahdanau+ 2015, Luong+ 2015) 27 is ( =

    2) a ( = 3) pen ( = 4) これ は BOS ( = 1) これ ( = 2) () � = tanh( 𝑐𝑐 [ ; ]) () = exp score( , ) ∑′ exp score( , 𝑠 ) = � () = softmax( � ) This ( = 1) score , = ⋅  Different variables of time steps used for the encoder () and decoder ()  Computation flow (Luong+ 2015): −1 → → → → � → → +1  score ℎ , ℎ : how much the decoder at  time step need information from the time  step in the encoder D Bahdanau, K Cho, Y Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR. M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412-1421.
  29. Computing attention scores 28  Attention () = exp score(

    , ) ∑′ exp score( , 𝑠 )  Scores are normalized into a probability distribution  Various approaches for computing attention score score , = ⋅ (dot) (product) ⋅ tanh ; (concat)  and are parameters (trained by backpropagation)
  30. Attention has an advantage on longer sentences 29 (Luong+ 2015)

    local-p: Attention mechanism that predicts the focal range of the input sequence based on the hidden state of the decoder M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412-1421.
  31. Attention roughly represents alignments 30 Global attention Local monotonic focus

    Gold alignment Local predictive focus (Luong+ 2015) M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412-1421.
  32. Show, attend and tell (Xu+ 2015) 31 (Xu+ 2015) K

    Xu, J Ba, R Kiros, K Cho, A Courville, R Salakhutdinov, R Zemel, Y Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, pp. 2048-2057.
  33. Convolutional Neural Network for encoder-decoder models 32

  34. RNN/LSTM and CNN 33  It is hard to parallelize

    RNN/LSTM for time steps  It is easy to parallelize CNN for time steps ☺ RNN may capture distant dependencies of tokens ☹ Need to traverse the full distance of the path of words ☹ Hard to parallelize ☺ We can compute convolutions in parallel ☹ CNN may not capture dependencies beyond the window
  35. ByteNet (Kalchbrenner+ 16) 34 ☺ Requires log traverses for handling

    distant dependencies (Kalchbrenner+ 2016) N Kalchbrenner, L Espeholt, K Simonyan, A van den Oord, A Graves, K Kavukcuoglu. 2016. Neural Machine Translation in Linear Time. arXiv:1610.10099.
  36. Convolutional Sequence to Sequence (ConvS2S) (Gehring+ 17) 35 これ は

    ペン です _ EOS _ Encoder _ BOS This is _ a pen Decoder EOS A rotation animation represents a composition of a hidden state of the decoder by attending the ones in the encoder Predict a word → Compose the decoder vector → Predict a next word In order to realize this, we put dummy tokens _ Encoder decoder model only with CNN J Gehring, M Auli, D Grangier, D Yarats, Y N Dauphin. 2017. Convolutional sequence to sequence learning. In ICML. pp. 1243-1252.
  37. Vector composition in ConvS2S 36 これ は これ <1> は

    <2> + + Position embedding: = + +1 +1 Gated Linear Unit (GLU): ℎ ′ = (𝐸𝐸 + 𝑐𝑐 ) ⊗ 𝐸𝐸 + × + Residual connection: ℎ = ℎ ′ + Encoder and decoder use the same architecture Their experiments use 20-layer CNNs with the window length of 3 +1 ℎ ′ ℎ
  38. Transformer 37

  39. Transformer: “Attention is all you need” (Vaswani+ 2017) 38 A

    Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008. https://research.googleblog.com/2017/08/transformer-novel-neural-network.html
  40. The architecture of Transformer 39 (Vaswani+ 2017) A Vaswani, N

    Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.
  41. The architecture of Transformer 40 loves Mary John ジョン は

    BOS ジョン は メアリー Positional encoding Self attention Residual + Layer-norm Feedforward Source-target attention Positional encoding Masked self attention Residual + Layer-norm Residual + Layer-norm Residual + Layer-norm Feedforward Residual + Layer-norm
  42. Positional encoding 41  Transformer has no recurrence nor convolution

     We need to inject position information to hidden states in some way  Add a positional encoding ∈ ℝ to token embeddings ∈ ℝ at position to represent positions of hidden states in the encoder and decoder = +  : a constant presenting the number of dimension of vectors = � sin = 2 cos = 2 + 1 = 1 100002/ Value of the -th dimension of the vector Modified from the figure in (Vaswani+ 2017) A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.
  43. Properties of positional encoding 42 PE , = sin =

    2 cos = 2 + 1 , = 1 100002/  Values of lower dimensions change a lot, but those of higher ones do not  It looks like a continuous version of binary code  Positional encodings from close positions yield similar values
  44. Self attention (in encoders) 43 loves Mary John 1/ 1/

    1/ softmax softmax softmax + + + × × × Each token (timestep) attends to every token (timestep) in the input
  45. Self attention expressed as scaled dot-product attention 44  Generalized

    expression of attention mechanism:  Computing a matching score between two vectors (query and key)  Convert the matching score into a weight  Add a vector (value) with the weight  We represent an input ( tokens) with a matrix = 1 , … , ∈ ℝ×  The encoder converts into query, key, and value: = , = , =  ( ∈ ℝ×, ∈ ℝ×, ∈ ℝ×, ∈ ℝ×, ∈ ℝ×, ∈ ℝ×)  Parallel computation of self attention over inputs with scaled dot-product: Attention , , = softmax 1 1 1 2 2 2 1 2 1 1 2 2 11 12 21 22 1 1 1 2 2 2 × = softmax = × query key score weight value 11 12 21 22 = 1 1 1 2 2 2 output This prevents a dot product from being too large, which causes a zero gradient
  46. Scaled Dot-Product Attention Scaled Dot-Product Attention Multi-head self-attention 45 

    Project an input ∈ ℝ× into ℎ sub-spaces of queries, keys, and values = , = , = ( ∈ 1, … , ℎ , ∈ ℝ×, ∈ ℝ×, ∈ ℝ×)  Compute an attention head ∈ ℝ× with scaled dot-product attention = Attention( , , ) = Attention( , , )  Project attention heads back to dimensional space with ∈ ℝℎ× 1 ⨁2 ⨁ … ⨁ℎ (⨁: concatenation) Scaled Dot-Product Attention Concatenation ℎ  The scaled dot-product attention computes a single weight matrix, i.e., a single pattern of how to attend the input  It may be useful to have multiple perspectives of how to attend the input  Let’s consider multiple attentions, altering the queries, keys, and values
  47. Why self-attention? 46  Self-attention is usually faster than RNNs

    ( < )  “NLP researchers scared 2 much, but the Google engineer didn’t”  Self-attention is parallelizable over a sequence  Self-attention connects all positions with (1) step  RNN requires () computations  CNN requires log convolution operations (Vaswani+ 2017) A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.
  48. Residual connection (He+ 16) 47  Suppose that we want

    to learn a function ℎ()  We consider another mapping: = ℎ −  Then, the original mapping is ℎ = +  We hypothesize that training is easier than ℎ()  If an identical mapping is default, pushing = 0 may be easier  We can view + as a feedforward neural network with shortcut connections  Useful to build a deeper network  Gradients flow on shortcut connections  Proposed in ResNet (He+ 2016) () + K He, X Zhang, S Ren, J Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.
  49. Layer normalization (Ba+ 16) 48  Ensure zero mean and

    unit variance of a vector (new) from ∈ ℝ (new) ← − 2 + , = 1 � =1 , 2 = 1 � =1 − 2  This is used in various places in Transformer  A mean and variance 2 are computed at each time step  How it works(?) (adapted from batch normalization (Bjorck+ 2018))  Large activations in a lower layer cannot be propagated uncontrollably to upper layers because of the normalization operation  This prevents gradients from exploding (e.g., becoming too large)  This enables higher learning rates (recall that the amount of a parameter update is a product of a learning rate and a gradient)  A large learning rate leads to a larger noise in SGD (proportional to 2)  A larger SGD noise prevents the network from getting “trapped” in sharp minima and biases it towards wider minima with better generalization J. L. Ba, J. R. Kiros, G. E. Hinton. 2016. Layer Normalization. arXiv:1607.06450. J. Bjorck, C. Gomes, B. Selman, K. Q. Weinberger. 2018. Understanding Batch Normalization. In NIPS, pp. 7694-7705.
  50. Feedforward layer 49  Two linear transformations with ReLU in

    between: FFN = max 0, 1 + 1 2 + 2 1 ∈ ℝ×, 1 ∈ ℝ, 2 ∈ ℝ×, 2 ∈ ℝ  Linear transformation -> ReLU -> Linear transformation  The original paper sets = 4 in the experiments
  51. Masked self-attention (in decoders) 50 は メアリー ジョン mask mask

    mask softmax softmax softmax + + + × × × Ignore attention scores looking at later tokens as decoders do not know future tokens × 1/ × 1/ × 1/
  52. Masked self attention (in decoders) 51  When training an

    encoder-decoder model, we give all source and target tokens in the input layer at a time  We want to complete all computation as matrix operations for better parallelization  However, we should not look at future tokens  Before computing the softmax, we force to set −∞ (e.g., −109) to all elements in the score matrix that point to future tokens (masking) ジョン は BOS が 大好き メリー ジョン は EOS が 大好き メリー BOS ジョン は メリー が 大好き BOS ジ ョ ン は メ リ ー が 大 好 き Mask
  53. Encoder-decoder attention (in decoders) 52 ジョン × 1/ softmax +

    × Every position in the decoder can attend over all positions in the input sequence, similarly to the typical attention mechanism in encoder-decoder models John は loves Mary メアリー を × 1/ softmax × 1/ softmax × 1/ softmax + × + × + ×
  54. Hyper-parameters 53 Parameter Base Big # layers () 6 6

    # dimensions () 512 1024 # dimensions for FFN ( ) 2048 4096 # attention heads (ℎ) 8 16 # dimensions of keys/queries ( ) 64 64 # dimensions of values ( ) 64 64 Dropout rate drop 0.1 0.3 # training steps 100K 300K # total parameters 65M 213M  Some training tips exist for Transformer  E.g., The learning rate is increased linearly for the first warm-up steps, and then decreased proportionally to the inverse square root of the step number
  55. Task performance 54 Transformer established the new state-of-the-art performance on

    En-De translation even with the base model (fewer parameters than big) (Vaswani+ 2017) A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.
  56. Coreference handling in self attention 55 The animal didn’t cross

    the street because it was too tired. The animal didn’t cross the street because it was too wide. A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008. (Vaswani+ 2017)
  57. GPT 56

  58. What is GPT (Radford+ 2018)? 57  A generic language

    model that is transferable to various NLP tasks  A single model for different tasks  Question answering, document classification, semantic similarity, …  GPT-3 has been a hot topic recently (in 2020)  Pretraining and finetuning (a kind of transfer learning)  Pretraining learns parameters that are generic to the language  Finetuning learns task-specific parameters on supervision data, leveraging the parameters acquired in pretraining  Based on Transformer decoder  Generative Pre-Training (GPT) A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
  59. GPT-3 (Brown+ 2020) 58 https://twitter.com/sharifshameem/statu s/1282676454690451457 T. B. Brown, B.

    Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165
  60. Architecture of GPT 59 (Radford+ 2018)  GPT uses Transformer

    decoder across different tasks  Pretraining is based on language modeling  Adding an output layer to a pretrained model for a target, finetuning trains output layers using supervision data A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
  61. Pretraining: Language modeling 60 Once upon a time there was

    a girl who really loved + + + + + + + + + + + Token embeddings Position embeddings Input embeddings Output embeddings GPT (Transformer decoder with layers) Input sequence ℎ1 0 𝑃𝑃1 𝑃𝑃2 𝑃𝑃3 𝑃𝑃4 𝑃𝑃5 𝑃𝑃6 𝑃𝑃7 𝑃𝑃8 𝑃𝑃9 𝑃𝑃10 𝑃𝑃11 The model (Transformer decoder) is trained to predict the next token for each time step by using a large corpus as the supervision data 𝑊𝑊1 𝑊𝑊2 𝑊𝑊3 𝑊𝑊4 𝑊𝑊5 𝑊𝑊6 𝑊𝑊7 𝑊𝑊8 𝑊𝑊9 𝑊𝑊10 𝑊𝑊11 ℎ2 0 ℎ3 0 ℎ4 0 ℎ5 0 ℎ6 0 ℎ7 0 ℎ8 0 ℎ9 0 ℎ10 0 ℎ11 0 ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ6 ℎ7 ℎ8 ℎ9 ℎ10 ℎ11 upon a time there was a girl who really loved books Output sequence
  62. Example of finetuning: Textual entailment 61 Tokyo Tech is located

    in Ookayama $ Japan has a university + + + + + + + + + + + Token embeddings Position embeddings Input embeddings Output embeddings GPT (Transformer decoder with layers) Input sequence ℎ1 0 𝑃𝑃1 𝑃𝑃2 𝑃𝑃3 𝑃𝑃4 𝑃𝑃5 𝑃𝑃6 𝑃𝑃7 𝑃𝑃8 𝑃𝑃9 𝑃𝑃10 𝑃𝑃11 After training the language model, add a linear and softmax layers to predict labels of a target task, and adapt the parameters using the supervision data 𝑊𝑊1 𝑊𝑊2 𝑊𝑊3 𝑊𝑊4 𝑊𝑊5 𝑊𝑊6 𝑊𝑊7 𝑊𝑊8 𝑊𝑊9 𝑊𝑊10 𝑊𝑊11 ℎ2 0 ℎ3 0 ℎ4 0 ℎ5 0 ℎ6 0 ℎ7 0 ℎ8 0 ℎ9 0 ℎ10 0 ℎ11 0 ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ6 ℎ7 ℎ8 ℎ9 ℎ10 ℎ11 Entail softmax In addition to the objective of the target task, we also train the model with the objective of language modeling on the supervision data
  63. Training the GPT model 62  Pretraining  BooksCorpus dataset

    (7,000 unique books) and 1B Words Benchmark  Finetuning  Detail of the Transformer architecture  12-layer decoder-only transformer with masked self attention  Number of dimension = 768 (12 attention heads)  Vocabulary of 40,000 subword tokens built by Byte-Pair-Encoding (BPE)  117M parameters in total A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (Radford+ 2018)
  64. Evaluation results 63  Natural Language Inference: SoTA on all

    datasets  Improvements: 1.5% on MNLI, 5% on SciTail, 5.8% on QNLI, and 0.6% on SNLI  Question answering and commonsense reasoning: SoTA on all datasets  Improvements: 8.9% on Story Cloze, and 5.7% overall on RACE  Semantic similarity: SoTA on two ouf of three datasets  Classification: SoTA on GLUE benchmark (72.8 ← 68.9)  Performance drastically drops without pre-training (see the table below) (Radford+ 2018) A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
  65.  The paper explores whether the language model trained on

    a text can solve NLP tasks such as question answering without finetuning  The architecture is the same as GPT  But the Transformer architecture is changed from Post-LN to Pre-LN  Training of the language model  8M high-quality documents (40GB) crawled from the Web GPT-2 (Radford+ 2019) 64 +1 Attention FFN Layer Norm Layer Norm +1 Attention FFN Layer Norm Layer Norm Post-LN Pre-LN A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical report, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
  66. Performance of GPT-2 65 117M (12 layers, 768 dims); 345M

    (24 layers, 1024 dims); 762M (36 layers, 1280 dims); 1542M (48 layers, 1600 dims) A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical report, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (Radford+ 2019)
  67. Answers generated by GPT-2 on the dev set of Natural

    Questions 66 A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical report, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
  68. The paper explores whether the language model trained on a

    text can solve NLP tasks with zero-shot, one-shot, or few-shot on the tasks and without updating parameters for the task GPT-3 (Brown+ 2020) 67 T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.
  69.  The architecture is the same as GPT-2  But

    GPT-3 use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to Sparse Transformer  The GPT-3 models are extremely large  Training GPT-3 (175B) requires 3.14 × 1023 flops  “Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run.” [1] The architecture of GPT-3 68 T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165. [1] OpenAI's GPT-3 Language Model: A Technical Overview. https://lambdalabs.com/blog/demystifying-gpt-3/
  70. Performance of GPT-3 69 T. B. Brown, B. Mann, N.

    Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.
  71. Limitations even with GPT-3 (Brown+ 2020) 70  Inferior performance

    on some tasks to finetuning approach  Notable weaknesses in text synthesis  Repetitions/contradictions at the document level, lost coherence over sufficiently long passages, and non-sequitur sentences  Difficulty with “common sense physics”  Difficult to answer a question like “If I put cheese into the fridge, will it melt?”  Structural and algorithmic limitations  No bi-directional architecture (unlike BERT), which is disadvantageous to some tasks (e.g., fill-in-the-blank tasks) that require re-reading or carefully considering a long passage and then generating a very short answer  Poor sample efficiency during pre-training  Pre-training requires much more text than a human does in the their lifetime  Test-time sample efficiency is closer to that of humans (one/zero-shot) though  Other limitations that are shared by most deep learning systems  Interpretability of decisions, biases of the data, gender, etc. T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.
  72. BERT 71

  73. What is BERT (Devlin+ 2019)? 72  A generic model

    for various NLP tasks  Question answering, document classification, semantic inference, …  Became a popular methodology, achieving state-of-the-art performance  Pretraining and finetuning (a kind of transfer learning)  Pretraining learns parameters that are generic to the language  Finetuning learns task-specific parameters on supervision data, leveraging the parameters acquired in pretraining  Based on Transformer encoder  BERT is not an encoder-decoder model (without a decoder)  A kind of contextualized word embeddings  Word embeddings that can represent context  Bidirectional Encoder Representations from Transformer (BERT)  → Embeddings from Language Models (ELMo) J Devlin, M-W Chang, K Lee, K Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pp. 4171-4186.
  74. Pretraining and finetuning 73 (Devlin+ 2019)  BERT uses a

    unified architecture across different tasks  Pretraining is based on bidirectional language modeling  Starting with a pretrained model, finetuning updates output layers (sometimes tailored for target tasks) as well as all internal parameters J Devlin, M-W Chang, K Lee, K Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pp. 4171-4186.
  75.  Idea: Train the model so that it can solve

    Cloze task  Obtain supervision data by masking tokens in large corpora  BooksCorpus (800M words) and English Wikipedia (2,500M words)  Procedure:  Choose 15% of token positions at random for prediction  Choose one of the following operations Pretraining task 1: Masked language model 74 My dog is [ ]. My dog is cute  [80%]: Replace the target token with [MASK]  [10%]: Replace the target token with a random token  [10%]: Keep the target token unchanged [ ] = cute BERT My dog is [MASK] My dog is apple My dog is cute These treatments are because [MASK] token does not appear in downstream tasks
  76. Masked language modeling (15% × 80%): [MASK] input 75 [CLS]

    my dog [MASK] cute [SEP] he likes [MASK] ##ing [SEP] + + + + + + + + + + + + + + + + + + + + + + Token embeddings Segment embeddings Position embeddings Input embeddings Output embeddings BERT (Transformer encoder) Input sequence [CLS] 1 2 4 3 [SEP] 1 ′ 2 ′ 4 ′ 3 ′ [SEP] ′ 1 2 4 3 [SEP] 1 ′ 2 ′ 4 ′ 3 ′ [SEP] ′ 𝑃𝑃0 𝑃𝑃1 𝑃𝑃2 𝑃𝑃3 𝑃𝑃4 𝑃𝑃5 𝑃𝑃6 𝑃𝑃7 𝑃𝑃8 𝑃𝑃9 𝑃𝑃10 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆B 𝑆𝑆B 𝑆𝑆B 𝑆𝑆B 𝑆𝑆B Sentence 1 Sentence 2 is play The model is trained to predict the masked tokens (we don’t predict other tokens)
  77. Masked language modeling (15% × 10%): random input 76 [CLS]

    my dog look cute [SEP] he likes cat ##ing [SEP] + + + + + + + + + + + + + + + + + + + + + + Token embeddings Segment embeddings Position embeddings Input embeddings Output embeddings BERT (Transformer encoder) Input sequence [CLS] 1 2 4 3 [SEP] 1 ′ 2 ′ 4 ′ 3 ′ [SEP] ′ 1 2 4 3 [SEP] 1 ′ 2 ′ 4 ′ 3 ′ [SEP] ′ 𝑃𝑃0 𝑃𝑃1 𝑃𝑃2 𝑃𝑃3 𝑃𝑃4 𝑃𝑃5 𝑃𝑃6 𝑃𝑃7 𝑃𝑃8 𝑃𝑃9 𝑃𝑃10 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆B 𝑆𝑆B 𝑆𝑆B 𝑆𝑆B 𝑆𝑆B Sentence 1 Sentence 2 is play The model is trained to predict the target tokens (we don’t predict other tokens)
  78. Masked language modeling (15% × 10%): original input 77 [CLS]

    my dog is cute [SEP] he likes play ##ing [SEP] + + + + + + + + + + + + + + + + + + + + + + Token embeddings Segment embeddings Position embeddings Input embeddings Output embeddings BERT (Transformer encoder) Input sequence [CLS] 1 2 4 3 [SEP] 1 ′ 2 ′ 4 ′ 3 ′ [SEP] ′ 1 2 4 3 [SEP] 1 ′ 2 ′ 4 ′ 3 ′ [SEP] ′ 𝑃𝑃0 𝑃𝑃1 𝑃𝑃2 𝑃𝑃3 𝑃𝑃4 𝑃𝑃5 𝑃𝑃6 𝑃𝑃7 𝑃𝑃8 𝑃𝑃9 𝑃𝑃10 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆B 𝑆𝑆B 𝑆𝑆B 𝑆𝑆B 𝑆𝑆B Sentence 1 Sentence 2 is play The model is trained to predict the target tokens (we don’t predict other tokens)
  79.  Idea: Train the model so that it can classify

    whether given two sentences are consecutive or not.  Obtain supervision data by extracting sentences in large corpora  BooksCorpus (800M words) and English Wikipedia (2,500M words)  Procedure:  Choose two sentences that are consecutive 50% of the time  Choose two sentences that are not consecutive 50% of the time Pretraining task 2: Next sentence prediction 78 My dog is cute. He likes playing. Yes BERT My dog is cute. I went to the station. No BERT
  80. Next sentence prediction 79 [CLS] my dog is cute [SEP]

    he likes play ##ing [SEP] + + + + + + + + + + + + + + + + + + + + + + Token embeddings Segment embeddings Position embeddings Input embeddings Output embeddings BERT (Transformer encoder) Input sequence [CLS] 1 2 4 3 [SEP] 1 ′ 2 ′ 4 ′ 3 ′ [SEP] ′ 1 2 4 3 [SEP] 1 ′ 2 ′ 4 ′ 3 ′ [SEP] ′ 𝑃𝑃0 𝑃𝑃1 𝑃𝑃2 𝑃𝑃3 𝑃𝑃4 𝑃𝑃5 𝑃𝑃6 𝑃𝑃7 𝑃𝑃8 𝑃𝑃9 𝑃𝑃10 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆A 𝑆𝑆B 𝑆𝑆B 𝑆𝑆B 𝑆𝑆B 𝑆𝑆B Sentence 1 Sentence 2 IsNext (or NotNext otherwise)
  81. Finetuning 80 Input embeddings Output embeddings BERT (Transformer encoder) [CLS]

    1 2 … … … 1 2 … … … … … … … … … … …  BERT models are flexible to tasks of single text or text pairs  Self-attention allows bidirectional cross attention between two sentences  We can view output embeddings as feature representations of input text  : Contextual word embeddings of the token at position  : Embeddings for single or two sentences  We reuse the model architecture and parameters for downstream tasks  Finetune BERT models on target tasks  We modify a label definition and output layers for a downstream task
  82. Finetuning task type 1: Sentence pair classification 81 [CLS] Tok1

    Tok2 … TokN [SEP] Tok1 Tok2 … TokM [SEP] Input embeddings Output embeddings BERT (Transformer encoder) Input sequence [CLS] 1 2 4 3 [SEP] 1 ′ 2 ′ 4 ′ 3 ′ [SEP] ′ 1 2 4 3 [SEP] 1 ′ 2 ′ 4 ′ 3 ′ [SEP] ′ Sentence 1 Sentence 2 Label Task example: Multi-Genre Natural Language Inference (MultiNLI)  Sentence 1: “At the other end of Pennsylvania Avenue, people began to line up for a White House tour.”  Sentence 2: “People formed a line at the end of Pennsylvania Avenue.”  Label: entailment
  83. Finetuning task type 2: Single sentence classification 82 [CLS] Tok1

    Tok2 … … … … … … … TokN Input embeddings Output embeddings BERT (Transformer encoder) Input sequence [CLS] 1 2 … … … 1 2 … … Label … … … … … … … … … Task example: Stanford Sentiment Treebank (SST)  Input sentence: “You’ll probably love it.”  Label: positive
  84. Finetuning task type 3: Question answering 83 [CLS] Tok1 Tok2

    … TokN [SEP] Tok1 Tok2 … TokM [SEP] Input embeddings Output embeddings BERT (Transformer encoder) Input sequence [CLS] 1 2 4 3 [SEP] 1 ′ 2 ′ 4 ′ 3 ′ [SEP] ′ 1 2 4 3 [SEP] 1 ′ 2 ′ 4 ′ 3 ′ [SEP] ′ Question Paragraph START END Stanford Question Answering Dataset (SQuAD) https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/Doctor_Who.html
  85. Finetuning task type 4: Single sentence tagging 84 [CLS] Tok1

    Tok2 … … … … … … … TokN Input embeddings Output embeddings BERT (Transformer encoder) Input sequence [CLS] 1 2 … … … 1 2 … … O … … … … … … … … … B-PER I-PER O B-ORG I-ORG I-ORG O O O Task example: Named Entity Recognition (NER) (as sequential labeling)  Input: “In March 2005, the New York Times acquired About, Inc .”  Output: O B-TEMP I-TEMP O B-ORG I-ORG I-ORG I-ORG O B-ORG
  86. Performance on downstream tasks 85 GLUE benchmark [1] SQuAD 1.0

    (Q&A) CoNLL 2003 (NER) J Devlin, M-W Chang, K Lee, K Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pp. 4171-4186. [1] https://gluebenchmark.com/leaderboard
  87. Summary  Attention mechanism addresses the drawback of fixed-size vector

    representation in encoder-decoder models  A decoder can extract features directly from encoder states  Parameters in attention mechanism are trained by a target task (without explicit supervision data for attention mechanism)  RNN/LSTM is difficult to parallelize across timesteps  Encoder-decoder models using CNN and positional encoding  Transformer removes recurrent computation by using self-attention and positional encoding  GPT and BERT are Transformer models applicable to various NLP tasks  GPT is a uni-directional language model based on Transformer decoder  BERT is a bi-directional model based on Transformer encoder 86