Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Encoder Decoder Models

Encoder Decoder Models

language modeling, Recurrent Neural Network Language Model (RNNLM), encoder-decoder models, sequence-to-sequence models, attention mechanism, reading comprehension, question answering, headline generation, multi-task learning, character-based RNN, byte-pair encoding, SentencePiece, Convolutional Sequence to Sequence (ConvS2S), Transformer, coverage, round-trip translation

Naoaki Okazaki
PRO

August 07, 2020
Tweet

More Decks by Naoaki Okazaki

Other Decks in Research

Transcript

  1. Encoder Decoder Models
    Naoaki Okazaki
    School of Computing,
    Tokyo Institute of Technology
    [email protected]
    PowerPoint template designed by https://ppt.design4u.jp/template/

    View Slide

  2. Main task: Machine Translation (MT)
    1
     Translate a text in a language into another
     Basic idea
     How do Computers Learn a New Language? An Introduction to
    Statistical Machine Translation
     https://www.youtube.com/watch?v=_ghMKb6iDMM (6:29)
    こんにちは
    Hello
    您好
    Hola

    View Slide

  3. Statistical Machine Translation (SMT)
    2
    私は動物園に行った.彼らは東京に行った.
    I went to the zoo. They went to Tokyo.
    I went to Tokyo.
    私は東京に行った. (|)
    I 私は = 0.8, they 彼らは = 0.8
    went 行った = 0.9, to に = 0.9
    the zoo 動物園 = 0.8, Tokyo 東京 = 0.8
    Supervision data (parallel corpus)
    Translation model: Japanese to English
    I went to Tokyo to meet my friend last Sunday.
    It was the first time since ……
    We went to the zoo near Ueno for ……
    the of = .012243, the in = .007208,
    the to = .005042, … … ,
    was it = .000522, to went = .000080,
    Supervision data (monolingual corpus)
    Language model: Naturalness in English
    Input: Output:
    Building probabilistic models

    View Slide

  4. DNNs applied to MT
    3
    Replace the probabilistic models with DNNs
    I went to Tokyo.
    私は東京に行った.
    Input: Output:
    (|):
    Deep Neural Networks (DNNs)
    It’s not that simple as introducing DNN architectures
    that were successful in other research fields (e.g.,
    computer vision)

    View Slide

  5. Connection to the previous lecture
    4
     Embeddings for phrases and sentences seem to be useful
    for solving tasks
     Is it possible to generate a sentence (sequence of words)
    from embeddings?
     Yes, encoder-decoder models can do that!
    very
    good
    movie
    とても
    良い
    映画
    E
    n
    c
    D
    e
    c

    View Slide

  6. Language modeling
    5

    View Slide

  7. Demo: Text-generation with GPT-2
    6
    https://github.com/graykode/gpt-2-Pytorch
    Text generated by giving the first paragraph of the Wikipedia article of “Harry Potter”
    https://en.wikipedia.org/wiki/Harry_Potter

    View Slide

  8. Language model (LM)
    7
     For a given word sequence 1
    , … ,
    , LMs compute the
    joint probability (1
    , … ,
    )
     Example: which word fills the blank in “I have a .”?
    argmax

    (I, have, a, )
     Used to assess the naturalness of a sentence (sequence of
    words) generated by machine translation, speech
    recognition, etc
    pen
    dog
    PC
    ……
    what
    Set of all words in
    the vocabulary

    View Slide

  9. Probabilistic language models
    8
    1
    , … ,
    = �
    =1

    (
    |1
    , … , −1
    )
    ☹ Data sparseness problem: Insufficient statistics to estimate
    the probability with a longer sequence of words
    Predict the next word
    after the
    word sequence 1
    , … , −1
    #(1
    , … , −1
    ,
    )
    #(1
    , … , −1
    )
    ||
    This, is, a, pen
    = This BOS is this a This is pen This is a (EOS|This is a pen)

    View Slide

  10. -gram probabilistic language modeling
    9
    1
    , … ,
    ≈ �
    =1

    (
    |−+1
    , … , −1
    )
    Remedy the data sparseness problem by compromising
    with a shorter context
    Predict the next word
    after a word
    sequence −+1
    , … , −1
    of length − 1
    #(−+1
    , … , −1
    ,
    )
    #(−+1
    , … , −1
    )
    ||
    We have
    more counts!
    This, is, a, pen = This BOS is this a is pen a (EOS|pen)
    (example with 2-gram)

    View Slide

  11. Sentence generation with LM
    10
     Find the word sequence1
    , … ,
    argmax
    1,…,∈
    (1
    , … ,
    )
     However, we cannot specify a desired output
     Generate a sentence 1
    , … ,
    conditioned on an input
    1
    , … ,
    argmax
    1,…,∈
    1
    , … ,
    (1
    , … ,
    |1
    , … ,
    )
     is translation model in machine translation
     : Whether the output is a correct translation of the input
     : Is the generated sentence natural as the language

    View Slide

  12. Sentence generation as a search problem
    11
     Sentence generation has ( ) time complexity
    argmax
    1,…,∈
    1
    , … ,
    (1
    , … ,
    |1
    , … ,
    )
     Unrealistic to enumerate all possible candidates
     Usually, > 10,000 and is 20~100
     Search a word after words (i.e., greedy / beam search)
    BOS
    a
    b


    I

    have
    a





    a





    a

    pen


    1
    (1
    |)
    1
    , 2
    (1
    , 2
    |)
    1
    , 2
    , 3
    (1
    , 2,
    , 3
    |)
    1
    , 2
    , 3
    , 4
    (1
    , 2,
    , 3
    , 4
    |)

    View Slide

  13. Issues in LM (before the DNN era)
    12
     Data sparseness
     Rare words suffer from the insufficiency of statistics
     The insufficiency gets worse when using -grams (word combinations)
     Addressed by smoothing methods (e.g., Good-Turing, Kneser-Ney)
     Surface variations
     Surface variations with the same meaning have different probabilities
     For example, (girl|clever) and (girl|smart) are independent even if
    ‘clever’ and ‘smart’ have the similar meaning
     Addressed by ‘class’ models that merge similar words into a group
     Long-distance dependency
     -gram models cannot consider dependencies longer than words
     Neural LMs address these issues using distributed representations
    (word embeddings and their compositions)

    View Slide

  14. Recurrent Neural Network Language Model (Mikolov+ 2010)
    13
    BOS I
    0
    have a pen
    softmax softmax softmax softmax softmax
    1
    2
    3
    4
    1
    (I) 2
    (have) 3
    (a) 4
    (pen) 5
    (EOS)
    (1
    , …
    ) = × × × ×
    The number of
    dimensions of
    the output
    layer is ||,
    where is the
    set of possible
    words.
    Each element
    presents the
    probability of
    generating the
    corresponding
    word
    Probability of a sequence of words is a product of token prediction probabilities
    T Mikolov, M Karafiát, L Burget, J Černocký, S Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH, pp. 1045-1048.

    View Slide

  15. Encoder
    14

    View Slide

  16. Recurrent Neural Networks (RNNs) (Sutskever+ 2011)
    15
    I Sutskever, J Martens, G Hinton. 2011. Generating text with recurrent neural networks. In ICML, pp. 1017–1024.
    John loves

    4
    𝑦
    ℎℎ
    Mary
    ℎℎ
    much
    ℎℎ
    softmax
    Word embeddings
    Represent a word
    with a vector





    1
    2
    3
    1
    2
    3
    4
    Recurrent computation
    Compose a hidden vector
    from an
    input word
    and the hidden vector
    −1
    at the previous timestep

    = (ℎ
    + ℎℎ−1)
    Fully-connected layer for a task
    Make a prediction from the hidden
    vector 4
    , which are composed from
    all words in the sentence, by using a
    fully-connected layer and softmax

    0
    = 0
    ℎℎ

    The parameters ℎ
    , ℎℎ
    , 𝑦
    are shared over the entire sequence
    They are trained by the supervision signal 1
    , … , 4
    , using backpropagation

    View Slide

  17. Convolutional Neural Network (CNN) (Kim 2014)
    16
    Y Kim. 2014. Convolutional neural networks for sentence classification. In EMNLP, pp. 1746-1751.
    It is a very good movie indeed

    :+
    ・ ・ ・ ・ ・ ・



    = max
    1<<−+1
    ,
    Max pooling: each dimension
    is the maximum number
    of the values ,
    over timesteps
    softmax
    ()

    View Slide

  18. Encoding
    17
     These models can be decomposed into
     Encoding (variable-length input to feature vector)
     = (1
    , … ,
    ) ( is a part of the NN)
     Solving the task (e.g., classify the text using the feature vector)
     = () ( is also a part of the NN)
    I have

    4

    ℎℎ
    a
    ℎℎ
    pen
    ℎℎ
    softmax



    1
    2
    3
    1
    2
    3
    4

    ℎℎ

    It is a very good movie indeed

    :+
    ・ ・ ・ ・ ・ ・



    = max
    1<<−+1
    ,
    Max pooling: each dimension
    is the maximum number
    of the values ,
    over timesteps
    softmax
    ()

    View Slide

  19. Encoder decoder models
    18

    View Slide

  20. Using RNNLM for generating sentences
    19
    Predict a sequence of words for an given input , in addition to score the
    naturalness of the generated sentence
    BOS I have a pen
    I have a pen EOS
    Input

    View Slide

  21. Encoder decoder model (EncDec) (Sutskever+ 2014; Cho+ 2014)
    20
    I Sutskever, O Vinyals, Q V Le. 2014. Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112.
    K Cho, van B Merrienboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. 2014. Learning phrase representations using RNN encoder–
    decoder for statistical machine translation. In EMNLP, pp. 1724–1734.
    I have a
    ペン を 持つ
    pen BOS ペン を 持つ
    EOS
    Encoder
    Decoder
    ※ This illustration omits the matrices of RNNs
    Representation
    of the input
     Encode an input sentence into a feature vector, and generate a sentence by
    decoding (predicting) a word sequence from the feature
     Also known as sequence-to-sequence model
     Machine translation is realized by a single NN!
     Machine translation had been a mix of various theories and methods before the
    neural machine translation (NMT)

    View Slide

  22. Caption generation (Vinyals+ 2015)
    21
    O Vinyals, A Toshev, S Bengio, D Erhan. 2015. Show and tell: A neural image caption generator. In CVPR.

    View Slide

  23. Chatbot (Vinyals+ 2015)
    22
     Supervision data: OpenSubtitles
     Scripts extracted from movie subtitles (6G sentences)
     A chat example from the EncDec model
    O Vinyals, Q V Le. 2015. A neural conversational model, In ICML Deep Learning Workshop.

    View Slide

  24. Summary
     Encoder-decoder architecture
     An encoder converts an input sentence into a feature vector
     A decoder generates a sentence based on the vector
     We can train an encoder-decoder model in end-to-end fashion
     (An autoregressive) decoder predicts a token sequence by
    feeding predicted tokens into the input layer
     We can connect different modalities (e.g., language and
    vision) in a single NN as long as they are represented as
    vectors
    23

    View Slide

  25. Attention mechanism
    24

    View Slide

  26. Weakness of EncDec
    25
     EncDec represents an input of a variable-length with a
    fixed-size vector
     EncDec has no flexibility about the amount of the information of
    an input
     EncDec suffers from handling longer sentences
    I have a
    ペン を 持つ
    pen BOS ペン を 持つ
    EOS

    View Slide

  27. The idea of attention mechanism
    26
    This is a pen BOS
    +
    これ は ペン
    BOS これ は ペン
    EOS
    At each timestep in the
    decoder, predict a word
    using the weighted sum of
    all hidden vectors in the
    input
    Attention mechanism
    determines the weights
    automatically from the
    decoder state
    The decoder now has an
    access to all hidden vectors
    in the input
    (1) (2) (3) (4) (5)

    View Slide

  28. Attention mechanism (Bahdanau+ 2015, Luong+ 2015)
    27
    is
    ( = 2)
    a
    ( = 3)
    pen
    ( = 4)
    これ は
    BOS
    ( = 1)
    これ
    ( = 2)


    ()


    = tanh(
    𝑐𝑐
    [
    ;
    ])

    () =
    exp score(
    ,
    )
    ∑′
    exp score(
    , 𝑠
    )

    = �


    ()

    = softmax(



    )
    This
    ( = 1)


    score
    ,
    =

     Different variables of time steps used for the encoder () and decoder ()
     Computation flow (Luong+ 2015): −1



    → �


    → +1
     score ℎ
    , ℎ
    : how much the decoder at
     time step need information from the time
     step in the encoder
    D Bahdanau, K Cho, Y Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
    M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412-1421.

    View Slide

  29. Computing attention scores
    28
     Attention

    () =
    exp score(
    ,
    )
    ∑′
    exp score(
    , 𝑠
    )
     Scores are normalized into a probability distribution
     Various approaches for computing attention score
    score
    ,
    =


    (dot)




    (product)

    ⋅ tanh


    ;
    (concat)

    and

    are parameters (trained by backpropagation)

    View Slide

  30. Attention has an advantage on longer sentences
    29
    (Luong+ 2015)
    local-p: Attention mechanism that predicts
    the focal range of the input sequence
    based on the hidden state of the decoder
    M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412-1421.

    View Slide

  31. Attention roughly represents alignments
    30
    Global attention Local monotonic focus
    Gold alignment
    Local predictive focus
    (Luong+ 2015)
    M-T Luong, H Pham, C D Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412-1421.

    View Slide

  32. Show, attend and tell (Xu+ 2015)
    31
    (Xu+ 2015)
    K Xu, J Ba, R Kiros, K Cho, A Courville, R Salakhutdinov, R Zemel, Y Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with
    Visual Attention. In ICML, pp. 2048-2057.

    View Slide

  33. Convolutional Neural Network
    for encoder-decoder models
    32

    View Slide

  34. RNN/LSTM and CNN
    33
     It is hard to parallelize RNN/LSTM for time steps
     It is easy to parallelize CNN for time steps
    ☺ RNN may capture
    distant dependencies
    of tokens
    ☹ Need to traverse
    the full distance of
    the path of words
    ☹ Hard to parallelize
    ☺ We can compute
    convolutions in
    parallel
    ☹ CNN may not
    capture dependencies
    beyond the window

    View Slide

  35. ByteNet (Kalchbrenner+ 16)
    34
    ☺ Requires log
    traverses for
    handling distant
    dependencies
    (Kalchbrenner+ 2016)
    N Kalchbrenner, L Espeholt, K Simonyan, A van den Oord, A Graves, K Kavukcuoglu. 2016. Neural Machine Translation in Linear Time.
    arXiv:1610.10099.

    View Slide

  36. Convolutional Sequence to Sequence (ConvS2S) (Gehring+ 17)
    35
    これ は ペン です
    _ EOS _
    Encoder
    _ BOS This is
    _ a pen
    Decoder
    EOS
    A rotation animation represents a composition of a hidden state
    of the decoder by attending the ones in the encoder
    Predict a word → Compose the decoder vector → Predict a next word
    In order to realize this, we put dummy tokens _
    Encoder decoder model only with CNN
    J Gehring, M Auli, D Grangier, D Yarats, Y N Dauphin. 2017. Convolutional sequence to sequence learning. In ICML. pp. 1243-1252.

    View Slide

  37. Vector composition in ConvS2S
    36
    これ は これ <1> は <2>
    + +
    Position embedding:

    =
    +


    +1
    +1
    Gated Linear Unit (GLU):

    ′ = (𝐸𝐸 + 𝑐𝑐
    ) ⊗ 𝐸𝐸 +
    ×
    + Residual connection:

    = ℎ
    ′ +
    Encoder and
    decoder use
    the same
    architecture
    Their
    experiments
    use 20-layer
    CNNs with
    the window
    length of 3
    +1



    View Slide

  38. Transformer
    37

    View Slide

  39. Transformer: “Attention is all you need” (Vaswani+ 2017)
    38
    A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.
    https://research.googleblog.com/2017/08/transformer-novel-neural-network.html

    View Slide

  40. The architecture of Transformer
    39
    (Vaswani+ 2017)
    A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.

    View Slide

  41. The architecture of Transformer
    40
    loves Mary
    John ジョン は
    BOS
    ジョン は メアリー
    Positional
    encoding
    Self
    attention
    Residual +
    Layer-norm
    Feedforward
    Source-target
    attention
    Positional
    encoding
    Masked self
    attention
    Residual +
    Layer-norm
    Residual +
    Layer-norm
    Residual +
    Layer-norm
    Feedforward
    Residual +
    Layer-norm

    View Slide

  42. Positional encoding
    41
     Transformer has no recurrence nor convolution
     We need to inject position information to hidden states in some way
     Add a positional encoding
    ∈ ℝ to token embeddings
    ∈ ℝ at position
    to represent positions of hidden states in the encoder and decoder

    =
    +
     : a constant presenting the number of dimension of vectors

    = �
    sin
    = 2
    cos
    = 2 + 1

    =
    1
    100002/




    Value of the -th dimension of the
    vector Modified from the figure in (Vaswani+ 2017)
    A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.

    View Slide

  43. Properties of positional encoding
    42
    PE
    , =
    sin
    = 2
    cos
    = 2 + 1
    ,
    =
    1
    100002/
     Values of lower dimensions change a lot, but those of higher ones do not
     It looks like a continuous version of binary code
     Positional encodings from close positions yield similar values

    View Slide

  44. Self attention (in encoders)
    43
    loves Mary
    John

    1/
    1/ 1/
    softmax softmax softmax
    + + +
    × × ×
    Each token (timestep) attends to every token (timestep) in the input

    View Slide

  45. Self attention expressed as scaled dot-product attention
    44
     Generalized expression of attention mechanism:
     Computing a matching score between two vectors (query and key)
     Convert the matching score into a weight
     Add a vector (value) with the weight
     We represent an input ( tokens) with a matrix = 1
    , … ,
    ∈ ℝ×
     The encoder converts into query, key, and value: = , = , =
     ( ∈ ℝ×, ∈ ℝ×, ∈ ℝ×, ∈ ℝ×, ∈ ℝ×, ∈ ℝ×)
     Parallel computation of self attention over inputs with scaled dot-product:
    Attention , , = softmax



    1 1 1
    2 2 2
    1 2
    1
    1 2
    2
    11 12
    21 22 1 1 1
    2 2 2
    × =
    softmax

    = ×
    query key score weight
    value
    11 12
    21 22
    = 1 1 1
    2 2 2
    output
    This prevents a dot
    product from being
    too large, which causes
    a zero gradient

    View Slide






  46. Scaled Dot-Product
    Attention





    Scaled Dot-Product
    Attention
    Multi-head self-attention
    45
     Project an input ∈ ℝ× into ℎ sub-spaces of queries, keys, and values

    =

    ,
    =
    ,
    =
    ( ∈ 1, … , ℎ ,
    ∈ ℝ×,
    ∈ ℝ×,
    ∈ ℝ×)
     Compute an attention head
    ∈ ℝ× with scaled dot-product attention

    = Attention(
    ,
    ,
    ) = Attention(

    ,
    ,
    )
     Project attention heads
    back to dimensional space with ∈ ℝℎ×
    1
    ⨁2
    ⨁ … ⨁ℎ
    (⨁: concatenation)





    Scaled Dot-Product
    Attention
    Concatenation



     The scaled dot-product attention computes a
    single weight matrix, i.e., a single pattern of how to
    attend the input
     It may be useful to have multiple perspectives of
    how to attend the input
     Let’s consider multiple attentions, altering the
    queries, keys, and values

    View Slide

  47. Why self-attention?
    46
     Self-attention is usually faster than RNNs ( < )
     “NLP researchers scared 2 much, but the Google engineer didn’t”
     Self-attention is parallelizable over a sequence
     Self-attention connects all positions with (1) step
     RNN requires () computations
     CNN requires log
    convolution operations
    (Vaswani+ 2017)
    A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.

    View Slide

  48. Residual connection (He+ 16)
    47
     Suppose that we want to learn a function ℎ()
     We consider another mapping: = ℎ −
     Then, the original mapping is ℎ = +
     We hypothesize that training is easier than ℎ()
     If an identical mapping is default, pushing = 0 may be easier
     We can view + as a feedforward neural network with shortcut
    connections
     Useful to build a deeper network
     Gradients flow on shortcut connections
     Proposed in ResNet (He+ 2016)

    ()
    +
    K He, X Zhang, S Ren, J Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.

    View Slide

  49. Layer normalization (Ba+ 16)
    48
     Ensure zero mean and unit variance of a vector (new) from ∈ ℝ


    (new) ←


    2 +
    , =
    1


    =1


    , 2 =
    1


    =1



    2
     This is used in various places in Transformer
     A mean and variance 2 are computed at each time step
     How it works(?) (adapted from batch normalization (Bjorck+ 2018))
     Large activations in a lower layer cannot be propagated uncontrollably to upper
    layers because of the normalization operation
     This prevents gradients from exploding (e.g., becoming too large)
     This enables higher learning rates (recall that the amount of a parameter update is
    a product of a learning rate and a gradient)
     A large learning rate leads to a larger noise in SGD (proportional to 2)
     A larger SGD noise prevents the network from getting “trapped” in sharp minima
    and biases it towards wider minima with better generalization
    J. L. Ba, J. R. Kiros, G. E. Hinton. 2016. Layer Normalization. arXiv:1607.06450.
    J. Bjorck, C. Gomes, B. Selman, K. Q. Weinberger. 2018. Understanding Batch Normalization. In NIPS, pp. 7694-7705.

    View Slide

  50. Feedforward layer
    49
     Two linear transformations with ReLU in between:
    FFN = max 0, 1
    + 1
    2
    + 2
    1
    ∈ ℝ×, 1
    ∈ ℝ, 2
    ∈ ℝ×, 2
    ∈ ℝ
     Linear transformation -> ReLU -> Linear transformation
     The original paper sets
    = 4 in the experiments

    View Slide

  51. Masked self-attention (in decoders)
    50
    は メアリー
    ジョン

    mask mask mask
    softmax softmax softmax
    + + +
    × × ×
    Ignore attention scores looking at later tokens as decoders do not know future tokens
    × 1/ × 1/
    × 1/

    View Slide

  52. Masked self attention (in decoders)
    51
     When training an encoder-decoder model, we give all source and target
    tokens in the input layer at a time
     We want to complete all computation as matrix operations for better
    parallelization
     However, we should not look at future tokens
     Before computing the softmax, we force to set −∞ (e.g., −109) to all
    elements in the score matrix that point to future tokens (masking)
    ジョン は
    BOS が 大好き
    メリー
    ジョン は EOS
    が 大好き
    メリー
    BOS
    ジョン

    メリー

    大好き
    BOS


    ン は


    ー が



    Mask

    View Slide

  53. Encoder-decoder attention (in decoders)
    52
    ジョン

    × 1/
    softmax
    +
    ×
    Every position in the decoder can attend over all positions in the input sequence,
    similarly to the typical attention mechanism in encoder-decoder models
    John は

    loves

    Mary メアリー を

    × 1/
    softmax

    × 1/
    softmax

    × 1/
    softmax
    +
    ×
    +
    ×
    +
    ×

    View Slide

  54. Hyper-parameters
    53
    Parameter Base Big
    # layers () 6 6
    # dimensions () 512 1024
    # dimensions for FFN (
    ) 2048 4096
    # attention heads (ℎ) 8 16
    # dimensions of keys/queries (
    ) 64 64
    # dimensions of values (
    ) 64 64
    Dropout rate drop 0.1 0.3
    # training steps 100K 300K
    # total parameters 65M 213M
     Some training tips exist for Transformer
     E.g., The learning rate is increased linearly for the first warm-up steps, and then
    decreased proportionally to the inverse square root of the step number

    View Slide

  55. Task performance
    54
    Transformer established the new state-of-the-art performance on En-De
    translation even with the base model (fewer parameters than big)
    (Vaswani+ 2017)
    A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.

    View Slide

  56. Coreference handling in self attention
    55
    The animal didn’t cross the street
    because it was too tired.
    The animal didn’t cross the street
    because it was too wide.
    A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N. Gomez, L Kaiser, I Polosukhin. 2017. Attention is all you need. In NIPS, pp. 5998–6008.
    (Vaswani+ 2017)

    View Slide

  57. GPT
    56

    View Slide

  58. What is GPT (Radford+ 2018)?
    57
     A generic language model that is transferable to various NLP tasks
     A single model for different tasks
     Question answering, document classification, semantic similarity, …
     GPT-3 has been a hot topic recently (in 2020)
     Pretraining and finetuning (a kind of transfer learning)
     Pretraining learns parameters that are generic to the language
     Finetuning learns task-specific parameters on supervision data,
    leveraging the parameters acquired in pretraining
     Based on Transformer decoder
     Generative Pre-Training (GPT)
    A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report.
    https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

    View Slide

  59. GPT-3 (Brown+ 2020)
    58
    https://twitter.com/sharifshameem/statu
    s/1282676454690451457
    T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165

    View Slide

  60. Architecture of GPT
    59
    (Radford+ 2018)
     GPT uses Transformer decoder across different tasks
     Pretraining is based on language modeling
     Adding an output layer to a pretrained model for a target, finetuning trains
    output layers using supervision data
    A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report.
    https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

    View Slide

  61. Pretraining: Language modeling
    60
    Once upon a time there was a girl who really loved
    + + + + + + + + + + +
    Token
    embeddings
    Position
    embeddings
    Input
    embeddings
    Output
    embeddings
    GPT
    (Transformer
    decoder with
    layers)
    Input
    sequence
    ℎ1
    0
    𝑃𝑃1 𝑃𝑃2
    𝑃𝑃3 𝑃𝑃4
    𝑃𝑃5 𝑃𝑃6
    𝑃𝑃7 𝑃𝑃8
    𝑃𝑃9
    𝑃𝑃10 𝑃𝑃11
    The model (Transformer decoder) is trained to predict the next token for each time step by
    using a large corpus as the supervision data
    𝑊𝑊1 𝑊𝑊2
    𝑊𝑊3 𝑊𝑊4
    𝑊𝑊5 𝑊𝑊6
    𝑊𝑊7 𝑊𝑊8
    𝑊𝑊9
    𝑊𝑊10 𝑊𝑊11
    ℎ2
    0 ℎ3
    0 ℎ4
    0 ℎ5
    0 ℎ6
    0 ℎ7
    0 ℎ8
    0 ℎ9
    0 ℎ10
    0 ℎ11
    0
    ℎ1
    ℎ2
    ℎ3
    ℎ4
    ℎ5
    ℎ6
    ℎ7
    ℎ8
    ℎ9
    ℎ10
    ℎ11

    upon a time there was a girl who really loved books
    Output
    sequence

    View Slide

  62. Example of finetuning: Textual entailment
    61
    Tokyo Tech is located in Ookayama $ Japan has a university
    + + + + + + + + + + +
    Token
    embeddings
    Position
    embeddings
    Input
    embeddings
    Output
    embeddings
    GPT
    (Transformer
    decoder with
    layers)
    Input
    sequence
    ℎ1
    0
    𝑃𝑃1 𝑃𝑃2
    𝑃𝑃3 𝑃𝑃4
    𝑃𝑃5 𝑃𝑃6
    𝑃𝑃7 𝑃𝑃8
    𝑃𝑃9
    𝑃𝑃10 𝑃𝑃11
    After training the language model, add a linear and softmax layers to predict
    labels of a target task, and adapt the parameters using the supervision data
    𝑊𝑊1 𝑊𝑊2
    𝑊𝑊3 𝑊𝑊4
    𝑊𝑊5 𝑊𝑊6
    𝑊𝑊7 𝑊𝑊8
    𝑊𝑊9
    𝑊𝑊10 𝑊𝑊11
    ℎ2
    0 ℎ3
    0 ℎ4
    0 ℎ5
    0 ℎ6
    0 ℎ7
    0 ℎ8
    0 ℎ9
    0 ℎ10
    0 ℎ11
    0
    ℎ1
    ℎ2
    ℎ3
    ℎ4
    ℎ5
    ℎ6
    ℎ7
    ℎ8
    ℎ9
    ℎ10
    ℎ11

    Entail


    softmax
    In addition to the objective of the target task, we also train the model with
    the objective of language modeling on the supervision data

    View Slide

  63. Training the GPT model
    62
     Pretraining
     BooksCorpus dataset (7,000 unique books) and 1B Words Benchmark
     Finetuning
     Detail of the Transformer architecture
     12-layer decoder-only transformer with masked self attention
     Number of dimension = 768 (12 attention heads)
     Vocabulary of 40,000 subword tokens built by Byte-Pair-Encoding (BPE)
     117M parameters in total
    A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report.
    https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
    (Radford+ 2018)

    View Slide

  64. Evaluation results
    63
     Natural Language Inference: SoTA on all datasets
     Improvements: 1.5% on MNLI, 5% on SciTail, 5.8% on QNLI, and 0.6% on SNLI
     Question answering and commonsense reasoning: SoTA on all datasets
     Improvements: 8.9% on Story Cloze, and 5.7% overall on RACE
     Semantic similarity: SoTA on two ouf of three datasets
     Classification: SoTA on GLUE benchmark (72.8 ← 68.9)
     Performance drastically drops without pre-training (see the table below)
    (Radford+ 2018)
    A Radford, K Narasimhan, T Salimans, I Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. Technical report.
    https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

    View Slide

  65.  The paper explores whether the language model trained on a text
    can solve NLP tasks such as question answering without finetuning
     The architecture is the same as GPT
     But the Transformer architecture is changed from Post-LN to Pre-LN
     Training of the language model
     8M high-quality documents (40GB) crawled from the Web
    GPT-2 (Radford+ 2019)
    64


    +1
    Attention
    FFN
    Layer Norm
    Layer Norm


    +1
    Attention
    FFN
    Layer Norm
    Layer Norm
    Post-LN Pre-LN
    A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical report,
    https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

    View Slide

  66. Performance of GPT-2
    65
    117M (12 layers, 768 dims); 345M (24 layers, 1024 dims); 762M (36 layers, 1280 dims); 1542M (48 layers, 1600 dims)
    A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical report,
    https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
    (Radford+ 2019)

    View Slide

  67. Answers generated by GPT-2 on the dev set of Natural Questions
    66
    A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Technical report,
    https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

    View Slide

  68. The paper explores whether the language model trained on a text can solve NLP
    tasks with zero-shot, one-shot, or few-shot on the tasks and without updating
    parameters for the task
    GPT-3 (Brown+ 2020)
    67
    T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.

    View Slide

  69.  The architecture is the same as GPT-2
     But GPT-3 use alternating dense and locally banded sparse attention
    patterns in the layers of the transformer, similar to Sparse Transformer
     The GPT-3 models are extremely large
     Training GPT-3 (175B) requires 3.14 × 1023 flops
     “Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud
    pricing we could find, this will take 355 GPU-years and cost $4.6M for a
    single training run.” [1]
    The architecture of GPT-3
    68
    T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.
    [1] OpenAI's GPT-3 Language Model: A Technical Overview. https://lambdalabs.com/blog/demystifying-gpt-3/

    View Slide

  70. Performance of GPT-3
    69
    T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.

    View Slide

  71. Limitations even with GPT-3 (Brown+ 2020)
    70
     Inferior performance on some tasks to finetuning approach
     Notable weaknesses in text synthesis
     Repetitions/contradictions at the document level, lost coherence over sufficiently
    long passages, and non-sequitur sentences
     Difficulty with “common sense physics”
     Difficult to answer a question like “If I put cheese into the fridge, will it melt?”
     Structural and algorithmic limitations
     No bi-directional architecture (unlike BERT), which is disadvantageous to some
    tasks (e.g., fill-in-the-blank tasks) that require re-reading or carefully considering a
    long passage and then generating a very short answer
     Poor sample efficiency during pre-training
     Pre-training requires much more text than a human does in the their lifetime
     Test-time sample efficiency is closer to that of humans (one/zero-shot) though
     Other limitations that are shared by most deep learning systems
     Interpretability of decisions, biases of the data, gender, etc.
    T. B. Brown, B. Mann, N. Ryder, M. Subbiahet, et. al. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.

    View Slide

  72. BERT
    71

    View Slide

  73. What is BERT (Devlin+ 2019)?
    72
     A generic model for various NLP tasks
     Question answering, document classification, semantic inference, …
     Became a popular methodology, achieving state-of-the-art performance
     Pretraining and finetuning (a kind of transfer learning)
     Pretraining learns parameters that are generic to the language
     Finetuning learns task-specific parameters on supervision data,
    leveraging the parameters acquired in pretraining
     Based on Transformer encoder
     BERT is not an encoder-decoder model (without a decoder)
     A kind of contextualized word embeddings
     Word embeddings that can represent context
     Bidirectional Encoder Representations from Transformer (BERT)
     → Embeddings from Language Models (ELMo)
    J Devlin, M-W Chang, K Lee, K Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pp.
    4171-4186.

    View Slide

  74. Pretraining and finetuning
    73
    (Devlin+ 2019)
     BERT uses a unified architecture across different tasks
     Pretraining is based on bidirectional language modeling
     Starting with a pretrained model, finetuning updates output layers
    (sometimes tailored for target tasks) as well as all internal parameters
    J Devlin, M-W Chang, K Lee, K Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pp.
    4171-4186.

    View Slide

  75.  Idea: Train the model so that it can solve Cloze task
     Obtain supervision data by masking tokens in large corpora
     BooksCorpus (800M words) and English Wikipedia (2,500M words)
     Procedure:
     Choose 15% of token positions at random for prediction
     Choose one of the following operations
    Pretraining task 1: Masked language model
    74
    My dog is [ ].
    My dog is cute
     [80%]: Replace the target token
    with [MASK]
     [10%]: Replace the target token
    with a random token
     [10%]: Keep the target token
    unchanged
    [ ] = cute
    BERT
    My dog is [MASK]
    My dog is apple
    My dog is cute
    These treatments are because
    [MASK] token does not
    appear in downstream tasks

    View Slide

  76. Masked language modeling (15% × 80%): [MASK] input
    75
    [CLS] my dog [MASK] cute [SEP] he likes [MASK] ##ing [SEP]
    + + + + + + + + + + +
    + + + + + + + + + + +
    Token
    embeddings
    Segment
    embeddings
    Position
    embeddings
    Input
    embeddings
    Output
    embeddings
    BERT
    (Transformer
    encoder)
    Input
    sequence
    [CLS] 1
    2
    4
    3
    [SEP] 1
    ′ 2
    ′ 4

    3
    ′ [SEP]

    1
    2
    4
    3
    [SEP] 1
    ′ 2
    ′ 4

    3
    ′ [SEP]

    𝑃𝑃0 𝑃𝑃1
    𝑃𝑃2 𝑃𝑃3
    𝑃𝑃4 𝑃𝑃5
    𝑃𝑃6 𝑃𝑃7
    𝑃𝑃8
    𝑃𝑃9 𝑃𝑃10
    𝑆𝑆A
    𝑆𝑆A 𝑆𝑆A
    𝑆𝑆A 𝑆𝑆A
    𝑆𝑆A 𝑆𝑆B 𝑆𝑆B
    𝑆𝑆B 𝑆𝑆B
    𝑆𝑆B
    Sentence 1 Sentence 2
    is play
    The model is trained to predict the masked
    tokens (we don’t predict other tokens)

    View Slide

  77. Masked language modeling (15% × 10%): random input
    76
    [CLS] my dog look cute [SEP] he likes cat ##ing [SEP]
    + + + + + + + + + + +
    + + + + + + + + + + +
    Token
    embeddings
    Segment
    embeddings
    Position
    embeddings
    Input
    embeddings
    Output
    embeddings
    BERT
    (Transformer
    encoder)
    Input
    sequence
    [CLS] 1
    2
    4
    3
    [SEP] 1
    ′ 2
    ′ 4

    3
    ′ [SEP]

    1
    2
    4
    3
    [SEP] 1
    ′ 2
    ′ 4

    3
    ′ [SEP]

    𝑃𝑃0 𝑃𝑃1
    𝑃𝑃2 𝑃𝑃3
    𝑃𝑃4 𝑃𝑃5
    𝑃𝑃6 𝑃𝑃7
    𝑃𝑃8
    𝑃𝑃9 𝑃𝑃10
    𝑆𝑆A
    𝑆𝑆A 𝑆𝑆A
    𝑆𝑆A 𝑆𝑆A
    𝑆𝑆A 𝑆𝑆B 𝑆𝑆B
    𝑆𝑆B 𝑆𝑆B
    𝑆𝑆B
    Sentence 1 Sentence 2
    is play
    The model is trained to predict the target
    tokens (we don’t predict other tokens)

    View Slide

  78. Masked language modeling (15% × 10%): original input
    77
    [CLS] my dog is cute [SEP] he likes play ##ing [SEP]
    + + + + + + + + + + +
    + + + + + + + + + + +
    Token
    embeddings
    Segment
    embeddings
    Position
    embeddings
    Input
    embeddings
    Output
    embeddings
    BERT
    (Transformer
    encoder)
    Input
    sequence
    [CLS] 1
    2
    4
    3
    [SEP] 1
    ′ 2
    ′ 4

    3
    ′ [SEP]

    1
    2
    4
    3
    [SEP] 1
    ′ 2
    ′ 4

    3
    ′ [SEP]

    𝑃𝑃0 𝑃𝑃1
    𝑃𝑃2 𝑃𝑃3
    𝑃𝑃4 𝑃𝑃5
    𝑃𝑃6 𝑃𝑃7
    𝑃𝑃8
    𝑃𝑃9 𝑃𝑃10
    𝑆𝑆A
    𝑆𝑆A 𝑆𝑆A
    𝑆𝑆A 𝑆𝑆A
    𝑆𝑆A 𝑆𝑆B 𝑆𝑆B
    𝑆𝑆B 𝑆𝑆B
    𝑆𝑆B
    Sentence 1 Sentence 2
    is play
    The model is trained to predict the target
    tokens (we don’t predict other tokens)

    View Slide

  79.  Idea: Train the model so that it can classify whether given two
    sentences are consecutive or not.
     Obtain supervision data by extracting sentences in large corpora
     BooksCorpus (800M words) and English Wikipedia (2,500M words)
     Procedure:
     Choose two sentences that are consecutive 50% of the time
     Choose two sentences that are not consecutive 50% of the time
    Pretraining task 2: Next sentence prediction
    78
    My dog is cute. He likes playing. Yes
    BERT
    My dog is cute. I went to the station. No
    BERT

    View Slide

  80. Next sentence prediction
    79
    [CLS] my dog is cute [SEP] he likes play ##ing [SEP]
    + + + + + + + + + + +
    + + + + + + + + + + +
    Token
    embeddings
    Segment
    embeddings
    Position
    embeddings
    Input
    embeddings
    Output
    embeddings
    BERT
    (Transformer
    encoder)
    Input
    sequence
    [CLS] 1
    2
    4
    3
    [SEP] 1
    ′ 2
    ′ 4

    3
    ′ [SEP]

    1
    2
    4
    3
    [SEP] 1
    ′ 2
    ′ 4

    3
    ′ [SEP]

    𝑃𝑃0 𝑃𝑃1
    𝑃𝑃2 𝑃𝑃3
    𝑃𝑃4 𝑃𝑃5
    𝑃𝑃6 𝑃𝑃7
    𝑃𝑃8
    𝑃𝑃9 𝑃𝑃10
    𝑆𝑆A
    𝑆𝑆A 𝑆𝑆A
    𝑆𝑆A 𝑆𝑆A
    𝑆𝑆A 𝑆𝑆B 𝑆𝑆B
    𝑆𝑆B 𝑆𝑆B
    𝑆𝑆B
    Sentence 1 Sentence 2
    IsNext (or NotNext otherwise)

    View Slide

  81. Finetuning
    80
    Input
    embeddings
    Output
    embeddings
    BERT
    (Transformer
    encoder)
    [CLS] 1
    2



    1
    2




    … …


    … …


     BERT models are flexible to tasks of single text or text pairs
     Self-attention allows bidirectional cross attention between two sentences
     We can view output embeddings as feature representations of input text
     : Contextual word embeddings of the token at position
     : Embeddings for single or two sentences
     We reuse the model architecture and parameters for downstream tasks
     Finetune BERT models on target tasks
     We modify a label definition and output layers for a downstream task

    View Slide

  82. Finetuning task type 1: Sentence pair classification
    81
    [CLS] Tok1
    Tok2
    … TokN
    [SEP] Tok1
    Tok2
    … TokM
    [SEP]
    Input
    embeddings
    Output
    embeddings
    BERT
    (Transformer
    encoder)
    Input
    sequence
    [CLS] 1
    2
    4
    3
    [SEP] 1
    ′ 2
    ′ 4

    3
    ′ [SEP]

    1
    2
    4
    3
    [SEP] 1
    ′ 2
    ′ 4

    3
    ′ [SEP]

    Sentence 1 Sentence 2
    Label
    Task example: Multi-Genre Natural Language Inference (MultiNLI)
     Sentence 1: “At the other end of Pennsylvania Avenue, people began to line up for a
    White House tour.”
     Sentence 2: “People formed a line at the end of Pennsylvania Avenue.”
     Label: entailment

    View Slide

  83. Finetuning task type 2: Single sentence classification
    82
    [CLS] Tok1
    Tok2
    … … … … … … … TokN
    Input
    embeddings
    Output
    embeddings
    BERT
    (Transformer
    encoder)
    Input
    sequence
    [CLS] 1
    2



    1
    2


    Label


    … …


    … …


    Task example: Stanford Sentiment Treebank (SST)
     Input sentence: “You’ll probably love it.”
     Label: positive

    View Slide

  84. Finetuning task type 3: Question answering
    83
    [CLS] Tok1
    Tok2
    … TokN
    [SEP] Tok1
    Tok2
    … TokM
    [SEP]
    Input
    embeddings
    Output
    embeddings
    BERT
    (Transformer
    encoder)
    Input
    sequence
    [CLS] 1
    2
    4
    3
    [SEP] 1
    ′ 2
    ′ 4

    3
    ′ [SEP]

    1
    2
    4
    3
    [SEP] 1
    ′ 2
    ′ 4

    3
    ′ [SEP]

    Question Paragraph
    START END
    Stanford Question Answering Dataset (SQuAD)
    https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/Doctor_Who.html

    View Slide

  85. Finetuning task type 4: Single sentence tagging
    84
    [CLS] Tok1
    Tok2
    … … … … … … … TokN
    Input
    embeddings
    Output
    embeddings
    BERT
    (Transformer
    encoder)
    Input
    sequence
    [CLS] 1
    2



    1
    2


    O


    … …


    … …


    B-PER I-PER O B-ORG I-ORG I-ORG O O O
    Task example: Named Entity Recognition (NER) (as sequential labeling)
     Input: “In March 2005, the New York Times acquired About, Inc .”
     Output: O B-TEMP I-TEMP O B-ORG I-ORG I-ORG I-ORG
    O B-ORG

    View Slide

  86. Performance on downstream tasks
    85
    GLUE benchmark [1]
    SQuAD 1.0 (Q&A)
    CoNLL 2003 (NER)
    J Devlin, M-W Chang, K Lee, K Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pp.
    4171-4186.
    [1] https://gluebenchmark.com/leaderboard

    View Slide

  87. Summary
     Attention mechanism addresses the drawback of fixed-size vector
    representation in encoder-decoder models
     A decoder can extract features directly from encoder states
     Parameters in attention mechanism are trained by a target task (without
    explicit supervision data for attention mechanism)
     RNN/LSTM is difficult to parallelize across timesteps
     Encoder-decoder models using CNN and positional encoding
     Transformer removes recurrent computation by using self-attention
    and positional encoding
     GPT and BERT are Transformer models applicable to various NLP tasks
     GPT is a uni-directional language model based on Transformer decoder
     BERT is a bi-directional model based on Transformer encoder
    86

    View Slide