Upgrade to Pro — share decks privately, control downloads, hide ads and more …

より良いTransformerをつくる

 より良いTransformerをつくる

2022年6月 名古屋地区NLPセミナーでのトーク

Shun Kiyono

June 07, 2022
Tweet

More Decks by Shun Kiyono

Other Decks in Research

Transcript

  1. ΑΓྑ͍TransformerΛͭ͘Δ
    ཧݚAIP
    ਗ਼໺ ॢ

    View full-size slide

  2. ँࣙ
    • খྛᰜհࢯˏPreferred Networks / ౦๺େ
    • ߴ੉ᠳࢯˏ౦޻େ
    • ླ໦५ઌੜˏ౦๺େ
    ʹ͸ຊߨԋͷςʔϚઃఆ΍εϥΠυͷਪᏏʹ͝ڠྗ͍͖ͨͩ·ͨ͠ɻ
    ͜ͷ৔ΛआΓͯਂ͘ײँ͍ͨ͠·͢ɻ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 3

    View full-size slide

  3. ͓͞Β͍ɿTransformer [Vaswani+2017]
    • Attention→FeedforwardΛͨ͘͞ΜੵΉ+৭ʑ޻෉
    • ܥྻม׵Ϟσϧ΍ࣄલ܇࿅ࡁΈݴޠϞσϧͷࠜװ
    • ࠷ۙ͸Ի੠ɾը૾෼໺Ͱ΋׆ൃ
    • ∴ Transformerࣗମͷվྑ͸ΠϯύΫτେ
    • Change the worldͰ͖ΔՄೳੑ͕͋Δ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 4
    Multi-Head
    Attention
    Add & Norm
    Feed
    Forward
    Add & Norm
    Input
    Embedding
    +

    View full-size slide

  4. TransformerΛ΋͏গ͓͠͞Β͍
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 5
    ਤ͸ https://jalammar.github.io/illustrated-transformer/ ΑΓ
    ୯ޠؒͰ৘ใΛ΍ΓऔΓ
    ୯ޠ಺Ͱԋࢉ

    View full-size slide

  5. ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 6
    ͱ͜ΖͰɺ
    ྑ͍5SBOTGPSNFSͬͯͳʹʁ

    View full-size slide

  6. ྑ͍Transformer͸ɺੑೳͷߴ͍Transformer
    • ຊൃදͰओʹ࿩͢͜ͱ
    • λεΫ൚༻తʹੑೳ޲্ΛͶΒ͏ํ๏࿦
    • ຊൃදͰओʹ͸࿩͞ͳ͍͜ͱ
    • ಛఆͷλεΫɾσʔληοτʹண໨ͨ͠ํ๏࿦
    • ܭࢉޮ཰Λ্͛ΔͨΊͷํ๏࿦
    • Ex) ύϥϝʔλ࡟ݮ
    • Ex) Ξςϯγϣϯͷߴ଎Խ
    • Ex) ྔࢠԽ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 7
    ͋͘·Ͱຊൃදͷఆٛ
    Multi-Head
    Attention
    Add & Norm
    Feed
    Forward
    Add & Norm
    Input
    Embedding
    +

    View full-size slide

  7. ຊൃදͷߏ੒ɿ֤෦ҐͷऔΓ૊ΈΛ঺հ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 8
    Multi-Head
    Attention
    Add & Norm
    Feed
    Forward
    Add & Norm
    Input
    Embedding
    +
    ᶃ Ґஔදݱ
    ᶄ Attention
    ᶅ Feed-forward
    ᶆ Layer
    Normalization

    View full-size slide

  8. ᶃ Ґஔදݱ
    ઈରҐஔʁ૬ରҐஔʁͦΕͱ΋…
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 9

    View full-size slide

  9. ઈରҐஔදݱ; Absolute Position Embedding (APE)
    • ֤ҐஔΛઐ༻ͷϕΫτϧΛ࢖ͬͯදݱ͢Δ
    • e.g., sin/cos೾ͷ૊Έ߹Θͤ [Vaswani+2017]
    • 😀 γϯϓϧɺ଎͍ɺ࣮૷͕؆୯
    • Ґஔ͸୯ͳΔߦྻ͔ΒͷࣙॻҾ͖Ͱදݱ
    • 😰 ະ஌ͷ௕͞ͷܥྻʹରͯ͠൚Խ͠ͳ͍
    • i.e., ֎ૠ͕ࠔ೉
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 10
    (a) ASPEC English-to-Japanese (b) WMT2014 English-to-German
    Figure 3: BLEU scores on test data split by the sentence length (no training data in the gray-colored area).
    Multi-Head
    Attention
    Add & Norm
    Feed
    Forward
    Add & Norm
    Input
    Embedding
    +
    John yelled at Kevin Position
    0 1 2 3 4
    Figure from [Neishi+2019]
    ະ஌ͷ௕͞

    View full-size slide

  10. ૬ରҐஔදݱ; Relative Position Embedding (RPE)
    • ֤τʔΫϯؒͷڑ཭ΛAttention಺෦Ͱར༻
    • 😀 ະ஌ͷ௕͞ʹରͯ͠ؤ݈
    • Ωʔϫʔυɿγϑτෆมੑ
    • 😰 ઈରҐஔΑΓ΋ܭࢉ͕࣌ؒ௕͍
    • 😰 Attentionͷѥछͱͷ૊Έ߹Θ͕ͤࠔ೉
    • Performer, Linformer, etc…
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 11
    Multi-Head
    Attention
    Add & Norm
    Feed
    Forward
    Add & Norm
    Input
    +
    -2 -1 0 1 2
    Key
    Value
    Relative Distance
    Additional Features for Attention
    John yelled at Kevin
    !
    !"#
    {%&',)*+,&} !
    ."#
    {%&',)*+,&}
    Position
    0 1 2 3

    View full-size slide

  11. ઈରҐஔ vs ૬ରҐஔɿ͜͜·Ͱͷ·ͱΊ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 12
    ઈରҐஔ ૬ରҐஔ <৽نͷख๏>
    ੑೳ
    😰 😀 ?
    ଎౓
    😀 😰 ?
    ࣮૷ίετ
    😀 😰 ?
    ·ͣ͸զʑͷ
    औΓ૊ΈΛ঺հ

    View full-size slide

  12. SHAPE: Shifted Absolute Position Embedding
    • w/ Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui
    • EMNLP2021ʹ࠾୒
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 13
    Sosuke Kobayashi~,} Jun Suzuki~, Kentaro Inui~,
    ~ Tohoku University } Preferred Networks, Inc.
    [email protected], [email protected],
    jun.suzuki, inui}@tohoku.ac.jp
    t
    rucial for building
    ns in Transformers.
    ations suffer from
    test data with un-
    utational cost. We
    osition embedding
    ues. The basic idea
    invariance, which
    uccessful position
    y shifting absolute
    Multi-Head
    Attention
    Add & Norm
    Feed
    Forward
    Add & Norm
    Input
    Embedding
    +
    John yelled at Kevin
    Position 0 1 2 3 4 k-1 k k+1 k+2 k+3
    Absolute Position
    Embedding (APE)
    (a) Shifted APE (SHAPE)
    (c)
    Transformer Relative Position Embedding (RPE)
    (b)
    Relative
    Distance
    John yelled at Kevin
    !
    !"#
    {%&',)*+,&} !
    ."#
    {%&',)*+,&}
    -2 -1 0 1 2
    Key
    Value
    Shifted by random offset k
    Figure 1: Overview of position representations. (a) APE

    View full-size slide

  13. ઈରҐஔʹγϑτෆมੑΛऔΓೖΕΔ
    • ૬ରҐஔ͸ະ஌ͷ௕͞ʹؤ݈→γϑτෆมੑ͕ॏཁͦ͏
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 14
    John yelled at Kevin
    !
    !"#
    {%&',)*+,&} !
    ."#
    {%&',)*+,&}
    Position
    0 1 2 3
    Position
    34 35 36 37
    John yelled at Kevin
    !
    !"#!$
    {&'(,*+,-'} !
    !/#!$
    {&'(,*+,-'}
    e.g. ૬ରҐஔͷ৔߹: !
    !"#
    {%&',)*+,&}
    !
    !"#
    {%&',)*+,&} !
    !"#!$
    {&'(,*+,-'}
    !
    !"#!$
    {&'(,*+,-'}
    ͸
    à ্ۭؒͷҐஔγϑτ͕ؔ਺ͷग़ྗʹӨڹ͠ͳ͍…ͱ͍͏ੑ࣭
    ͱ౳͍͠

    View full-size slide

  14. ઈରҐஔ+ϥϯμϜγϑτ=γϑτෆมੑ
    • Shifted Absolute Position Embedding (SHAPE)
    • ઈରҐஔΛΦϑηοτ 𝒌~𝓤(𝟎, 𝑲) ͰϥϯμϜʹγϑτ
    • Ϟσϧ͸ઈରҐஔΛར༻ෆՄೳ
    • ୅ΘΓʹ૬ରҐஔΛར༻͢ΔֶशΛڧ੍
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 15
    Multi-Head
    Attention
    Add & Norm
    Feed
    Forward
    Add & Norm
    Input
    Embedding
    +
    John yelled at Kevin
    Position 0 1 2 3 4 k-1 k k+1 k+2 k+3
    APE
    Shifted by random offset k
    SHAPE

    View full-size slide

  15. Ұ୴ɺ͜͜·Ͱͷ·ͱΊ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 16
    ing
    ers.
    om
    un-
    We
    ing
    dea
    ich
    ion
    ute
    Multi-Head
    Attention
    Add & Norm
    Feed
    Forward
    Add & Norm
    Input
    Embedding
    +
    John yelled at Kevin
    Position 0 1 2 3 4 k-1 k k+1 k+2 k+3
    Absolute Position
    Embedding (APE)
    (a) Shifted APE (SHAPE)
    (c)
    Transformer Relative Position Embedding (RPE)
    (b)
    Relative
    Distance
    John yelled at Kevin
    !
    !"#
    {%&',)*+,&} !
    ."#
    {%&',)*+,&}
    -2 -1 0 1 2
    Key
    Value
    Shifted by random offset k

    View full-size slide

  16. ༧උௐࠪɿSHAPE͸γϑτෆมੑΛ֫ಘ͢Δ͔ʁ
    • ֶशࡁΈϞσϧʹ͍ͭͯɺ𝑘 ͷ஋ΛมԽͤͯ͞ΈΔ
    • ֤ 𝑘 ʹ͍ͭͯΤϯίʔμͷίαΠϯྨࣅ౓Λൺֱ
    • APE: ֤ 𝑘 ͸ҟͳΔӅΕ૚Λੜ੒
    • SHAPE: ֤ 𝑘 ͷ஋͸ӅΕ૚ͷ஋ʹӨڹ͠ͳ͍
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 17
    0
    100
    250
    500
    APE
    EmbeddingLayer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6
    0
    100
    250
    500
    0
    100
    250
    500
    SHAPE
    k
    0
    100
    250
    500
    0
    100
    250
    500
    0
    100
    250
    500
    0
    100
    250
    500
    0
    100
    250
    500
    0
    100
    250
    500
    0.8
    0.9
    1.0

    View full-size slide

  17. ࣮ݧઃఆͷ֓ཁ
    • Ϟσϧ: Transformer + APE, RPE, or SHAPE
    • λεΫ:ػց຋༁
    • ܇࿅σʔλ
    1. Vanilla
    • WMT 2016 EnDe [Ott+2018]
    2. Extrapolate
    • ௕͞50Ҏ্ͷܥྻΛ܇࿅σʔλ͔Β࡟আͯ͠࡞੒
    3. Interpolate
    • ྡ઀͢ΔܥྻΛ݁߹ͯ͠࡞੒
    • ։ൃσʔλ: newstest2010-2013
    • ςετσʔλ: newstest2014-2016
    • ධՁ: sacreBLEU
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 18
    ࠓճ͸লུ
    ઈରҐஔ
    ૬ରҐஔ

    View full-size slide

  18. ࣮ݧઃఆɿσʔληοτ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 19
    ط஌&ະ஌ͷ௕͞
    ͰධՁ͍ͨ͠
    ᶃ Vanilla: WMT EnDe 2016 [Ott+2018]
    English German
    • ػց຋༁ͷඪ४తͳϕϯνϚʔΫ
    • ϕʔεϥΠϯͷੑೳͷ֬ೝ༻
    ᶄ Extrapolate: ௕͞50Ҏ্ͷܥྻΛ܇࿅σʔλ͔Β࡟আ
    English German
    • Ϟσϧͷ֎ૠੑೳͷ֬ೝ
    • ະ஌ͷ௕͞ʹରͯ͠ؤ݈͔ʁ

    View full-size slide

  19. ࣮ݧ݁ՌɿRPEͱSHAPE͸ಉ౳ͷੑೳ
    • Extrapolateͷ݁Ռ
    • RPEͱSHAPE͕APEΛ্ճΔੑೳ
    • SHAPE͸RPEͱಉ౳ͷੑೳͰɺRPEΑΓߴ଎
    • Vanillaͷ݁Ռ
    • શͯͷϞσϧ͕ಉ౳ͷੑೳ
    • SHAPE͸ੑೳѱԽͷڪΕͳ͘࢖ͬͯྑͦ͞͏
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 20
    Figure 2: Cosine similarities of the encoder hidden
    states with different offsets k 2 {0, 100, 250, 500}.
    Only the representation of SHAPE is invariant with
    k.
    Dataset Model Valid Test Speed
    VANILLA APE† 23.61 30.46 x1.00
    RPE† 23.67 30.54 x0.91
    SHAPE† 23.63 30.49 x1.01
    EXTRAPOLATE APE 22.18 29.22 x1.00
    RPE 22.97 29.86 x0.91
    SHAPE 22.96 29.80 x0.99
    Table 2: BLEU scores on newstest2010-2016. Valid is
    Figure 3: BLEU score improveme
    dation and test sets with respect to
    length. The gray color means no t

    View full-size slide

  20. ௕͞͝ͱͷੑೳΛௐࠪɿ֎ૠੑೳ͕޲্
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 21
    0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-
    Source Sequence Length (tokens)
    0
    2
    4
    6
    8
    BLEU improvement
    APE
    RPE
    SHAPE
    ϕʔεϥΠϯ "1&
    ͔Βͷੑೳ޲্෯
    ະ஌ͷ௕͞
    κʔϯ
    • SHAPEͱRPE͸APEΑΓ΋ߴ͍֎ૠੑೳΛൃش
    • SHAPEͱRPE͸ಉ౳ͷੑೳ

    View full-size slide

  21. SHAPEͷ·ͱΊ
    • SHAPE : shifted absolute position embedding
    • APEʹγϑτෆมੑΛ෇༩
    • APEͱಉ౳ͷ଎౓
    • RPEͱಉ౳ͷੑೳ
    • ࣮૷͸؆୯
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 22
    ॏཁ
    SHAPEͷ࣮૷ྫ
    ઈରҐஔ ૬ରҐஔ SHAPE
    ੑೳ
    😰 😀 😀
    ଎౓
    😀 😰 😀
    ࣮૷ίετ
    😀 😰 😀

    View full-size slide

  22. ͦͷ΄͔ɺ࠷ۙͷҐஔදݱͨͪ
    • ALiBi: Attention with Linear Biases [Press+2021]
    • RoPE: Rotary Position Embedding [Su+2021]
    • CAPE: Continuous Augmented Positional Embeddings [Likhomanenko+2021]
    • TUPE: Transformer with Untied Positional Encoding [Ke+2020]
    • ࠓճ͸লུ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 23
    Q. શ෦ಉ͡͡Όͳ͍Ͱ͔͢ʁ
    A. ͕͍ͪ·͢…

    View full-size slide

  23. ALiBi: Attention with Linear Biases [Press+2021]
    • Query-Keyͷ಺ੵʹ͍ͭͯڑ཭ʹ
    Ԡͨ͡ϖφϧςΟΛ෇༩
    • ԕ͍τʔΫϯ΄ͲAttentionͷ
    είΞ͕Լ͕Δ
    • ۙ๣Λ༏ઌ͢ΔΑ͏ͳInductive
    biasΛೖΕ͍ͯΔͱΈͳͤΔ
    • APEΑΓੑೳྑ͍
    • ֎ૠੑೳ΋ྑ͍
    • BigScienceWSͷϞσϧͰ΋࠾༻
    • [Le Scao+2022]
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 24
    Figures from [Press+2021]

    View full-size slide

  24. RoPE: Rotary Position Embedding [Su+2021]
    • ໨తɿQuery-Keyͷ಺ੵΛ
    ୯ޠؒͷ૬ରڑ཭ʹґଘͤ͞
    ͍ͨ
    • ҐஔʹԠͯ͡୯ޠϕΫτϧΛ
    𝑚𝜃͚ͩճస
    • Query-Keyͷ૬ରڑ཭͕ಉ͡
    →ͳ͕֯͢ಉ͡
    • ઈରҐஔΛ্ճΔੑೳ
    • GPT-J΍GPT-NeoX΍PaLMͰ
    ࠾༻
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 25
    Figures from [Su+2021]

    View full-size slide

  25. CAPE: Continuous Augmented Position Embeddings
    • APEΛྑ͍ͨ͘͠೿ൊ
    • ܇࿅தʹҎԼͷૢ࡞ΛऔΓೖΕΔͱྑ͍
    • (MPCBM4IJGU ೖྗશମͷҐஔΛϥϯμϜʹͣΒ͢

    • Local Shift (ೖྗҰ෦ͷҐஔΛϥϯμϜʹͣΒ͢)
    • Global Scaling (ॖईΛϥϯμϜʹௐ੔)
    • ը૾ɺԻ੠ɺݴޠλεΫͰੑೳ޲্Λใࠂ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 26
    Ͳ͔͜Ͱݟ͕֮͑…
    [Likhomanenko+2021]
    Figure from [Likhomanenko+2021]

    View full-size slide

  26. ᶄ Attention
    ৞ΈࠐΈͰ΋ྑ͍ͷͰ͸ɾͳΜͱ͔ݻఆ௕ʹ͍ͨ͠ɾAttention࠷ߴʂ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 28

    View full-size slide

  27. ͓͞Β͍ɿAttention
    • Transformerͷ֩৺ʁ
    • Մม௕ܥྻͷதͰ͍͍ײ͡ʹ৘ใΛࠞͥΔͨΊͷ࢓૊Έ
    • ௕ڑ཭ґଘͷֶश͕͠΍͍͢ʢͱ͞Ε͍ͯΔʣ
    • ฒྻԽͰ͖ΔͷͰRNNΑΓ΋ߴ଎
    • ͨͩ͠௒௕ڑ཭ʹͳΔͱܭࢉ͕େม
    • ܥྻYܥྻ ͷܭࢉ͕ඞཁ
    Process
    Output
    array
    Q
    x L
    V
    Attention
    K V
    Attention
    Q
    K V
    V
    K
    Q
    Attention
    scores
    O
    E
    ͜Ε͕Qͱಉ͡ͱ͖
    Self-Attention
    Figure from [Jaegle+2022]

    View full-size slide

  28. AttentionΛऔΓר͘࠷ۙͷঢ়گ
    • AttentionΛผͷԿ͔Ͱஔ͖׵͍͑ͨ
    • CNN [Wu+2019]
    • MLP [Sun+2021]
    • AttentionΛݮΒ͍ͨ͠
    • ୅ΘΓʹFeed-ForwardΛ૿΍͢ʁ [Zhang+2020] [Irie+2020]
    • AttentionࣗମΛܰྔԽ͍ͨ͠
    • Efficient Transformers: A Survey
    • Attentionͷಛ௃Λ΋ͬͱ׆͔͍ͨ͠
    • Perceiver [Jaegle+2021]
    • Memory Tranformer [Wu+2022]
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 30

    View full-size slide

  29. Dynamic Convolution [Wu+2019]
    • AttentionΛCNNͰஔ͖׵͑Α͏ܥͰ͸࠷΋༗໊ʁ
    • ֤࣌ࠁʹ͍ͭͯೖྗ͔ΒCNNͷΧʔωϧΛ
    ಈతʹੜ੒
    • MTɾཁ໿ɾݴޠϞσϧͰAttentionΑΓߴੑೳ
    • ύϥϝʔλڞ༗ͳͲͷࡉ͔͍ςΫχοΫ͕ॏཁͦ͏
    • ࠷େΧʔωϧ෯͸31→ܥྻશମΛݟΔͷ͕େࣄʁ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 31
    Figure from [Wu+2019]
    ਤ͸ ࿦จ঺հ: Pay Less Attention with Lightweight and Dynamic Convolutions ΑΓҾ༻

    View full-size slide

  30. Pretrain-finetuneઃఆʹ͓͚ΔCNN [Tay+2022]
    • ٙ໰ɿPretrain-finetuneͷԸܙ͸Transformer͚ͩͷ΋ͷ͔ʁ
    • TransformerͱCNN3छྨΛൺֱɿ6/7λεΫͰCNN͕TransformerΛ্ճΔੑೳ
    • ͨͩ͠ɺCNNଆ͸3छྨ͋Δͷʹ஫ҙ
    • ͨͩ͠ɺ࣭໰Ԡ౴΍ؚҙؔ܎ೝࣝͳͲɺෳ਺จΛ༻͍ΔλεΫͰ͸
    CNN < Transformer
    • CNNͰ͸จؒͷ৘ใͷ΍ΓऔΓ͕ࠔ೉
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 32
    Table from [Tay+2022]

    View full-size slide

  31. Perceiver [Jaegle+2021]
    • Perceiver: ݴޠɾԻ੠ɾը૾Ͱ൚༻తʹ࢖͑ΔϞσϧΛAttentionͰ࣮ݱ
    • 😩 ʮ1࣍ݩͳΒLSTMɺ2࣍ݩͳΒCNN…ʯΛղܾ
    • جຊతʹ͸Transformer Encoder
    • ௕ڑ཭ܥྻ͕ͭΒ͍ͷͰɺೖྗΛCross-Attentionܦ༝Ͱݻఆ௕ʹ
    • ը૾෼ྨɺԻ੠+ಈըղੳɺ఺܈෼ྨͳͲͰڧ͍ϕʔεϥΠϯʹඖఢ
    • ͨͩ͠ɺ෼ྨ໰୊͔͠ղ͚ͳ͍ߏ଄ͳͷʹ஫ҙ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 33
    M>>>N
    ͳͷͰܰྔ
    ResNetͱ͔
    ViTͱ͔
    ೖྗ
    ύϥϝʔλ
    ߦྻ
    Figure from [Jaegle+2021]

    View full-size slide

  32. Perceiver IO [Jaegle+2022]
    • Perceiver IOɿPerceiverͷग़ྗଆʹ΋Cross AttentionΛ࠾༻
    • ೖग़ྗʹରশੑΛ͍࣋ͨͤͯΔͱ΋ݟ͑Δ
    • ௕͞𝒐ͷܥྻΛग़ྗՄೳàద༻ՄೳλεΫ͕૿Ճ
    • ϚεΫݴޠϞσϧؚΉ֤छλεΫͰ൚༻తʹಈ࡞
    • ग़ྗଆͷQuery ArrayͰ͸ະདྷͷ৘ใ͕ϦʔΫ͍ͯ͠Δͷʹ஫ҙ
    • ͳͷͰɺͦͷ··Ͱ͸ੜ੒ʹ࢖͑ͳ͍
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 34
    ৽ن
    ύϥϝʔλߦྻ
    ೖྗΛݟΔͷ͸
    Ұճ͚ͩ
    Figure from [Jaegle+2022]

    View full-size slide

  33. Perceiver AR [Hawthorne+2022]
    • ະདྷͷ৘ใ͕ϦʔΫ͢ΔͷΛ๷͍͗ͨ
    • ग़ྗଆͷAttentionʹ͸ϚεΫػߏΛ௥Ճ͠ɺ
    autoregressiveͳੜ੒ΛՄೳʹ
    • ͍ͭ΋ͷCausal Mask
    • ը૾ੜ੒ɺݴޠϞσϧͳͲͰߴੑೳ
    • Transformer-XLͳͲʹඖఢ
    • ܥྻ͕௕͘ͳͬͯ΋ֶश͸ߴ଎
    • ܥྻม׵λεΫ౳Ͱಈ͘ͷ͔͸Α͘෼͔Βͳ͍…
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 35
    Figures from [Hawthorne+2022]

    View full-size slide

  34. Memorizing Transformers [Wu+2022]
    • Transformer DecoderΛ૝ఆ
    • ઐ༻ͷAttention૚Λઃܭ
    • աڈͷӅΕ૚Λ֎෦ϝϞϦͱ
    Έͳͯ͠AttentionΛܭࢉ
    • AttentionείΞΛݩʹkNNͰ
    ద౰ʹࢬמΓ
    • ࠷ޙʹී௨ͷAttentionͱࠞͥΔ
    • ਺ֶه߸΍ͦͷఆٛΛ
    ͏·͘ੜ੒Ͱ͖Δ
    • ୯ͳΔ௿ස౓ޠͷίϐʔʁ
    • ؔ࿈ɿ[Sun+2021]
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 36
    ޯ഑͸ྲྀΕͳ͍
    →ߴ଎ʹಈ࡞
    Figures from [Wu+2022]

    View full-size slide

  35. ᶅ FeedForward
    ΋ͬͱ΋ͬͱදݱྗΛ…
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 37

    View full-size slide

  36. ͓͞Β͍ɿFeedForward૚
    • શ݁߹+Activation+શ݁߹
    • ׳शɿ࣍ݩ਺Λ্͛ͯݩʹ໭͢
    • Transformer-base: 512à2048à5412
    • Transformer-big: 1024à4096à1024
    • ࣮ࡍʹ͸ԿΛ΍͍ͬͯΔʁ
    • ֎෦ϝϞϦ΁ͷAttentionͱΈͳͤΔ
    • [Sukhbaatar+2019]
    • Key-ValueϝϞϦͱͯ͠ಈ࡞
    • [Geva+2021]
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 38
    Multi-Head
    Attention
    Add & Norm
    Feed
    Forward
    Add & Norm
    Input
    Embedding
    +
    Linear
    Activation Function
    Linear
    X ࣍ݩ਺Λ্͛Δ
    ʙ4ഒఔ౓
    ݩʹ໭͢

    View full-size slide

  37. ࠷ۙͷਐలɿ׆ੑԽؔ਺Λม͑ͯΈΔ
    • Vanilla TransformerͰ͸ReLUΛ࠾༻
    • GPT΍BERTͳͲͷݴޠϞσϧɿGeLUΛଟ͘ݟ͔͚Δ
    • ॳग़͸GPTͷ͸ͣ
    • GLUܥ͕ޮ͘ͱ͍͏ใࠂ΋ [Shazeer+2020]
    • Կނྑ͍ͷ͔͸෼͔Βͳ͍͕…
    • “We offer no explanation as to why these architectures seem to work”
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 39
    Figure from [Hendrycks+2016]
    Figure from [Narang+2021]

    View full-size slide

  38. ࠷ۙͷਐలɿ࣍ݩ਺Λ্͛Δ
    • WMT’19 ʹ͓͚ΔFacebookͷใࠂ[Ng+2019]
    • ৭ʑࢼͨ݁͠ՌɺFFNΛ૿΍͢ͱྑ͔ͬͨ
    • We experiment with increasing network capacity by
    increasing embed dimension, FFN size, number of
    heads, and number of layers.
    • We find that using a larger FFN size (8192) gives a
    reasonable improvement in performance while
    maintaining a manageable network size
    • WMT’20 ࢀՃ࣌ʹ௥ࢼɾ࠾༻ [Kiyono+2020]
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 40
    Table from [Ng+2019]

    View full-size slide

  39. දݱྗͷͦͷઌ΁ɿMixture of Experts
    • ී௨ʹϞσϧΛେ͖͍ͯ͘͘͠ͱ…
    • 😨 දݱྗUPͷͨΊύϥϝλ૿→Ϟσϧͷܭࢉ࣌ؒ૿
    • ҰํɺMixture of Expertsͷ৔߹
    • 😀 ෳ਺ͷHomogenousͳExpertΛ༻ҙ→ύϥϝλ૿ɾදݱྗUP
    • 😀 ೖྗʹԠͯ͡Ұ෦ΛൃՐ→ܭࢉ࣌ؒ࡟ݮ
    • Ұ෦ք۾ʢGoogleͱFacebookʣͰྲྀߦ͍ͬͯΔ༷ࢠ
    • Switch Transformer [Fedus+2021], Base Layers [Lewis+2021], Sparse all-MLP [Yu+2022]
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 41
    Figure from [Fedus+2021]

    View full-size slide

  40. ᶆ Layer Normalization
    ࠞಱͱͨ͠ੈք
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 42

    View full-size slide

  41. ͓͞Β͍ɿLayer Normalization (LN)
    • ֤ӅΕ૚͔Βฏۉͱ෼ࢄΛܭࢉ
    →ਖ਼نԽ
    • ֶशͷ҆ఆԽʹߩݙ
    • LNͳ͠ͷTransformer͸ֶशࠔ೉
    • AttentionͱFeedForwardϒϩοΫ
    ʹҰͭͣͭ഑ஔ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 43
    Multi-Head
    Attention
    Add & Norm
    Feed
    Forward
    Add & Norm
    Input
    Embedding
    +
    ຒΊࠐΈʹ
    LayerNorm͢Δ
    ྲྀ೿΋
    ֶश͸҆ఆ͢Δ͕
    ੑೳ͕ѱԽ͢Δ
    [Le Scao+2022]
    LNϙΠϯτᶃ
    LNϙΠϯτᶄ
    Figure from [Zhang+2019]
    RNNͷ৔߹

    View full-size slide

  42. LNΛͲ͜ʹஔ͔͘ʁɿPost-LNͱPre-LN
    • Vanilla Transformer͸Post-LN
    • ࠷ۙͷTransformer͸Pre-LN͕ଟ͍
    • ༨ஊɿPre-LNͷॳग़͸Tensor2Tensor
    • ͜ͷ͋ͱ࿦จ͕ͨ͘͞Μग़ͨ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 44
    Post-LN
    ResidualޙʹLN
    Pre-LN
    ؔ਺ͷલʹLN
    ͦͷޙPre-LN͕
    σϑΥϧτʹ
    ˠͱͯ΋ॏཁ
    Ռͨͯ͠LNͷҐஔ͕ͦΜͳʹॏཁͳͷ͔ʁ
    Figure from [Xiong+2020]

    View full-size slide

  43. Post-LNͱPre-LNͷ೰·͍ؔ͠܎
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 45
    2
    st-LN
    e-LN
    ual 後に
    Norm
    本研究の貢献
    ・Post-LN と Pre-LN の性能差を実験的に⽰す
    ・多層 Post-LN の学習が難しい原因を⽰す
    ・⾼い性能を維持しつつ多層化する⼿法を提案
    性能 多層化
    Post-LN ○ ×
    Pre-LN × ○
    B2T(提案⼿法) ○ ○
    Layer Norm
    Attention
    FFN
    Layer Norm
    Layer Norm
    Attention
    FFN
    Layer Norm
    Layer Norm
    Attention
    × N
    × N
    Attention
    Layer Norm
    Layer Norm
    FFN
    Layer Norm
    (a) Post-LN
    er Norm
    ention
    Layer Norm
    Attention
    Attention
    Layer Norm
    Attention
    Layer Norm
    Layer Norm
    Attention
    Layer Norm
    Attention
    (a) Post-LN (b) Pre-LN (c) Post-LN with B2T connection
    ention
    er Norm
    er Norm
    Attention
    Layer Norm
    Layer Norm
    FFN
    Attention
    Layer Norm
    × N
    FFN
    er Norm
    Layer Norm
    Attention
    FFN
    Layer Norm
    Layer Norm
    Attention
    FFN
    Layer Norm
    Layer Norm
    Attention
    × N
    (b) Pre-LN (c) Post-LN with B2T connection
    適⽤前に
    Norm
    ೰ΈɿPre-LN͸ֶश͕༰қ͕ͩੑೳ͕௿͍
    1PTU-/ͷ΄͏͕ੑೳ͕ߴ͍
    Pre-LN で良いのでは︖
    Pre-LN は多層にしたときの学習が安定
    – 近年の多層なモデル(例︓GPT)はPre-LN
    しかし Pre-LN は性能が低い
    – Post-LN は学習が成功した場合 Pre-LN より⾼性能
    • 6層 Transformer エンコーダ・デコーダの性能⽐較
    7
    Post-LN Post-LN
    Pre-LN
    Pre-LN
    翻訳(WMT英-独)でのBLEU ⾒出し⽂⽣成でのROUGE-1
    ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻
    1PTU-/ͷଟ૚Խ͸೉͍͠
    エンコーダ側 デコーダ側
    各層の勾配のノルム
    デコーダ側で勾配消失が発⽣
    101
    100
    10-1
    10-1
    100

    View full-size slide

  44. Post-LNͱPre-LNͷྑ͍ͱ͜औΓΛ͍ͨ͠
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 46
    ͪΐ͏Ͳྑ͍ݚڀ͕…

    View full-size slide

  45. ޯ഑ফࣦͷݪҼ͸LN
    • 勾配消失の原因を探る
    – デコーダの18層⽬の各位置における勾配のノルムを調べる
    Layer Norm
    Attention
    FFN
    Layer Norm
    Layer Norm
    Attention
    (4) → (3),(2) → (1) で勾配が⼤きく減衰
    → LN をまたぐと勾配が⼤きく減衰
    → LN が勾配を減衰させる
    → LN が勾配消失の原因
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 47
    ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻

    View full-size slide

  46. B2T Connection [Takase+2022]
    Ͱྑ͍ͱ͜औΓΛ࣮ݱ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 48
    Layer Norm
    Attention
    FFN
    Layer Norm
    Layer Norm
    Attention
    FFN
    Layer Norm
    Layer Norm
    Attention
    × N
    × N
    (c) Post-LN with B2T connection
    ͸ Pre-LNɼ(c) ͸ Post-LN ʹఏҊख๏Λ૊Έ
    提案⼿法︓⾚字の Residual Connection を追加
    各層の下から上までの経路︓Bottom-to-Top Connection
    1. 各層の最後の LN 以外を迂回する
    → LN による勾配の減衰を抑制
    2. 最後の LN は⼊⼒にも適⽤
    ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻

    View full-size slide

  47. ൺֱ࣮ݧɿػց຋༁λεΫ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 49
    ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻
    B2T connection は 学習に成功
    性能も他の⼿法より⾼い
    • あ
    学習
    失敗
    6層エンコーダ・デコーダ 18層エンコーダ・デコーダ
    Post-LN は Pre-LN よりも⾼い性能
    B2T connection は Post-LN と同程度の性能

    View full-size slide

  48. LNΛফͯ͠ޯ഑ফࣦʹରॲɿT-FixUp [Huang+2020]
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 50
    • T-FixUpɿ͔ͳΓݡ͍ॳظԽ
    • LNΛϞσϧશମ͔ΒຣফͰ͖Δ
    • ֶश཰ͷWarmup͕ཁΒͳ͘ͳΔ
    • ଟ૚ʹͯ͠΋ֶश͕҆ఆ͢Δ
    • Warmup͸ͳͥඞཁ͔ͩͬͨʁ
    • LNͷޯ഑͸ೖྗͷϊϧϜʹ൓ൺྫ
    • ֶशॳظ͸Adamͷਪఆྔ͕ෆ҆ఆ→ߋ৽ྔ📈
    • ݁ՌɺೖྗͷϊϧϜ͕📈
    • ޯ഑͕ফֶ͑ͯशࠔ೉ʹ
    • 👍ϋΠύϥ (Warmup) ͕ফͤΔ
    • 👎 Կεςοϓ͔Βֶश཰ΛԼ͛Δ͔ʁ
    ͱ͍͏ϋΠύϥ͕૿͍͑ͯΔ
    • ϓϥϚΠθϩͰ͸ʁ
    Figure from [Huang+2020]
    ޯ഑ͷ༷ࢠ

    View full-size slide

  49. ͋͑ͯLNΛ૿΍͢ɿNormformer [Shleifer+2021]
    • LNΛ૿΍͢ͱऩଋ଎౓ͱੑೳ͕վળ͢Δ
    ͱ͍͏ݚڀ
    • େن໛ݴޠϞσϧͰͷ࣮ݧͰൺֱ
    • Pre-LNʹ͓͚Δ௿૚ͷޯ഑ͱ
    ߴ૚ͷޯ഑ͷεέʔϧͷෆҰகΛ໰୊ࢹ
    • AttentionͱFeedForwardޙʹLNͯ͠
    εέʔϧΛҰகͤ͞Δ
    • εέʔϧͷෆҰக͕ͳͥ໰୊ͳͷ͔ʁ
    • ΈΜͳ࢖ͬͯΔσʔλ͕ҧ͏ɿ֤ख๏ͷ
    ൺֱ͕ࠔ೉
    • ࠞಱͱ͍ͯ͠Δ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 51
    Figure from [Shleifer+2021]

    View full-size slide

  50. ͻͱ΍͢ΈɿมΘΓछTransformer
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 52

    View full-size slide

  51. Sandwitch Transformer [Press+2019]
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 53
    Multi-Head
    Attention
    Add & Norm
    Feed
    Forward
    Add & Norm
    Input
    Embedding
    +
    ΞΠσΞ
    (Attention+FeedForward)ΛN૚ੵΉɺ͸ຊ౰ʹ࠷ద͔ʁ
    →֤αϒϨΠϠ͸Մ׵Ͱ͋Δ͜ͱʹண໨͠ɺ
    ॱ൪ΛೖΕସ͑ͯΈΔ

    View full-size slide

  52. Sandwitch Transformer [Press+2019]
    • ॱ൪ΛϥϯμϜʹೖΕସ͑ͯΈ࣮ͯݧ
    • ϕʔεϥΠϯΑΓ΋ྑ͍ߏ଄͕͋ΔͬΆ͍
    • ೖྗଆʹAttentionଟΊɺग़ྗଆʹFFNଟΊ͕
    ྑͦ͞͏
    • ͭ·Γ
    • ݴޠϞσϧͰੑೳ޲্ɺ຋༁Ͱ͸ޮՌͳ͠
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 54
    which is comparable to the performance of the base-
    ine with the same number of parameters.
    We next generalize this model and the original
    nterleaved transformer, creating the family of sand-
    wich transformers. A sandwichn
    k
    transformer con-
    ists of 2n sublayers in total (n of each type), con-
    orming to the regular expression sk(sf)n k fk.
    The first k sublayers are purely self-attention (s),
    while the last k are feedforward sublayers (f). In
    etween, we use the original interleaving pattern
    sf) to fill the remaining 2(n k) sublayers. When
    k = 0, we get the original transformer model, and
    when k = n 1 (its maximal value) we get the
    Model Test
    Baseline (Baevski and Auli, 2019) 18.70
    Transformer XL (Dai et al., 2019) 18.30
    kNN-LM (Khandelwal et al., 2019) 15.79
    Baseline (5 Runs) 18.63 ± 0.26
    Sandwich16
    6
    17.96
    Table 3: Performance on the WikiText-103 test set. We
    compare the best sandwich transformer to the unmod-
    ified, interleaved transformer baseline (Baevski and
    Auli, 2019) trained over 5 random seeds and to other
    previously reported results.
    than the average baseline transformer. Of those, 6
    models outperform the best baseline transformer
    (k = 5, 6, 8, 9, 10, 11). The best performance of
    17.84 perplexity is obtained when k = 6. We com-
    pare this model to the baseline on WikiText-103’s
    test set.
    Table 3 shows that, despite its simple design,
    the sandwich transformer outperforms the original
    transformer baseline by roughly double the gap be-
    tween the baseline (Baevski and Auli, 2019) and
    Transformer XL (Dai et al., 2019). This improve-
    ment comes at no extra cost in parameters, data,
    memory, or computation; we did not even change
    any of the original hyperparameters, including the
    number of training epochs.
    To check whether this advantage is consistent,
    we train 4 more sandwich16
    6
    models with different
    Figure 5: The transformer’s sandwich coefficient (k)
    and validation perplexity, for k 2 {1, . . . , 15}. The
    dotted line is the average baseline model’s perplex-
    ity (trained with different random seeds), whereas the
    dashed line represents the best baseline model.
    Figure 6: Performance on the WikiText-103 develop-
    ment set of the Sandwich16 transformer and the base-
    Figures from [Press+2019]
    ଠࣈ͕Baseline
    ௿͍΄Ͳྑ͍ੑೳ

    View full-size slide

  53. Primer [So+2021]
    • TransformerͷΞʔΩςΫνϟΛҨ఻తΞϧΰϦζϜͰ୳ࡧ
    • ୳ࡧۭؒΛAdd/Square/Conv…ͳͲͷ௿ϨϕϧԋࢉͰߏ੒
    • ൚༻తʹྑ͔֦ͬͨு͸2ͭ
    1. Query, Key, Valueʹରͯ͠෯3ͷDepthwise Convolution (DConv)
    2. ReLUͷग़ྗΛೋ৐͢Δ
    • DConvͷؾ࣋ͪʹͳͬͯΈΔ
    • Attentionͷલʹۙ๣Λݟ͍ͨ
    • Ͱ΋ී௨ͷConv͡ΌͩΊ
    • ෯Λ޿ͯ͘͠΋ͩΊ
    • αϯϓϧޮ཰্͕͕Δͱͷ͜ͱ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 55
    νϟωϧؒͰ
    ৘ใΛࠞͥͳ͍Conv
    Figure from [So+2021]

    View full-size slide

  54. Transformer͸”ྑ͘”ͳͬͨͷ͔ʁ
    ͞·͟·ͳ௥ࢼͷ঺հ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 56

    View full-size slide

  55. ʮྑ͞ʯ͸࣮૷Λ·͍ͨͰ൚Խ͢Δ͔ʁɿNo
    • Do Transformer Modifications Transfer Across Implementations
    and Applications? [Narang+2021]
    • ͞·͟·ͳTransformerͷ֦ுΛڞ௨ͷίʔυϕʔεʹ࠶࣮૷
    • ݩ࿦จͷใࠂ͢Δ஌ݟʢi.e., ྑ͞ʣΛ࠶ݱͰ͖Δ͔௥ࢼ
    • ࠶ݱͰ͖Δख๏͸ҎԼͷ͍ͣΕ͔ͱใࠂ
    • γϯϓϧͳมߋʢ׆ੑԽؔ਺ͳͲʣ
    • ܭࢉྔͷ૿ՃΛ൐͏΋ͷʢଟ૚ԽͳͲʣ
    • ύϥϝʔλ਺ͷ૿ՃΛ൐͏΋ͷʢ.JYUVSFPG&YQFSUTͳͲʣ
    • ϋΠύϥͷௐ੔Λ͍ͯ͠ͳ͍͜ͱʹ஫ҙ
    • ֦ுख๏ͨͪʹͱͬͯएׯෆརͳઃఆͱΈͳͤΔ
    • ϋΠύϥͷௐ੔͕ඞཁͳ࣌఺ͰμϝͰ͠ΐɺͱ͍͏ڧ͍ओு
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 57
    ͩΊ͡ΌΜ…

    View full-size slide

  56. ʮྑ͞ʯ͸σʔλʹؔͯ͠൚Խ͢Δ͔ʁɿNo
    • The Impact of Positional Encodings on Multilingual Compression
    • [Ravishankar+2021]
    • Ґஔදݱख๏ʢઈରҐஔɺ૬ରҐஔʣΛม͑ͯଟݴޠBERTΛ܇࿅
    • ଟݴޠλεΫʢ୯ޠΞϥΠϯϝϯτɾจநग़ͳͲʣͰධՁ
    • ઈରҐஔ͕૬ରҐஔΛ্ճΔੑೳ
    • ૬ରҐஔ͕ଟݴޠؒͷΞϥΠϯϝϯτͷֶशΛࠔ೉ʹ͍ͯ͠Δʁ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 58
    Figure from [Ravishankar+2021]

    View full-size slide

  57. ʮྑ͞ʯ͸Ϟσϧͷεέʔϧʹؔͯ͠൚Խ͢Δ͔ʁɿNo
    • Transformerͷѥछʹ͍ͭͯϞσϧͷεέʔϧΛม͑ͨͱ͖ͷੑೳΛൺֱ
    • Vanilla, Universal Transformer, MLP-Mixer, ALBERT, Performer, etc…
    • ݁ՌɺVanilla Transformer͕Ұ൪ྑ͔ͬͨ
    • 😀 ಛఆͷεέʔϧͰTransformerΑΓྑ͍Ϟσϧ͸ଘࡏ
    • 😨 TransformerΑΓҰ؏ͯ͠ྑ͍΋ͷ͸ଘࡏ͠ͳ͍
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 59
    [Anonymous+2021]
    ARRʹొ৔ͯ͠Ҏ߱
    Իࠫଡͳ͕ͩ͠…
    Figure from [Anonymous+2021]

    View full-size slide

  58. ؔ࿈ɿ175BݴޠϞσϧʹ͓͚ΔMetaͷ஌ݟ
    • খن໛࣮ݧ͔ΒಘͨϋΠύϥ͕௨༻͠ͳ͍
    • ͪͳΈʹɿ13Bύϥϝλ͸”খن໛”
    • ֶश཰ΛGPT-3ઃఆΑΓେ͖ͯ͘͠Έ͚ͨͲେࣦഊ
    • طଘͷϞσϧ͕௨༻͠ͳ͍
    • Normformerɿֶश్தͰloss͕ఀ଺
    • ֶशͷ҆ఆԽ͕ॏཁ
    • FP16ͩͱGeLU͕਺஋తʹ҆ఆ͠ͳ͍ à ReLUͰଥڠ
    • EmbeddingʹLN͢Δͱֶश͕҆ఆ͢Δ͚Ͳੑೳ͕ѱ͍
    • ؤுͬͯ΋਺ઍUpdate͝ͱʹloss͕ൃࢄ͢Δ à ͦͷ౎౓restart͕ඞཁ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 60
    https://github.com/facebookresearch/metaseq/blob/
    main/projects/OPT/chronicles/56_percent_update.md
    LogBook͸Ұಡͷ
    Ձ஋͋Γ

    View full-size slide

  59. ༨ஊɿWMT 2020ͷܦݧ͔Β
    • γεςϜ։ൃͷ࠷தʹ͋Ε͜Εطଘख๏Λࢼ͢
    • େମ͏·͍͔͘ͳ͍
    • ྫɿٙࣅσʔλͷϑΟϧλϦϯά
    • ਺ԯจͷٙࣅσʔλશ෦Ͱֶशͨ͘͠ͳ͍→ྑ͍σʔλΛऔΓग़͍ͨ͠
    • ϑΟϧλϦϯάͷ༗ແʹΑΒͣੑೳʹӨڹͳ͠
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 61
    ਗ਼໺ͷࢀՃͨ͠
    ػց຋༁ͷίϯϖ
    Setting En!De De!En En!Ja Ja!En
    ASE (Section 3.1) 42.4 42.0 19.7 21.6
    ASE (l = 9)+TAGGED-BT (Section 3.3) 42.7 42.5 22.0 23.9
    b) + fine-tuning (Section 3.4) 44.9 42.3 23.1 24.4
    c) ⇥ 4 (Section 3.5) 45.5 42.8 23.9 25.4
    d) + 4 ⇥ (c)-R2L (Section 3.6) 45.4 43.6 24.2 25.9
    e) + reranking (Section 3.7) 45.7 43.8 24.9 26.2
    he best system in WMT’19 44.9 42.8 - -
    of each technique: we use newstest2019 and official validation set for En$De and En$Ja
    result from WMT’19 is unavailable for En$Ja, because this task has newly appeared this
    g / ID BLEU chrF
    able 5) 37.5 0.647
    able 5) 43.8 0.690
    able 5) 40.1 0.343
    ENSEMBLE 25.5 0.536
    e on WMT’20 Test Set: refer to
    newstest
    Amount of Synthetic Data Used: r (%) 2014 2018 2019
    100 33.0 48.0 42.0
    50 32.9 48.4 42.3
    33 33.1 47.9 42.2
    25 32.9 48.5 42.4
    Table 7: Effectiveness of corpus filtering on En!De.
    newstest
    Table from [Kiyono+2020]

    View full-size slide

  60. ·ͱΊ
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 62

    View full-size slide

  61. ΑΓྑ͍TransformerΛͭ͘Δ
    • Ґஔදݱ
    • ઈରҐஔΑΓ΋૬ରҐஔ
    • ݸਓతʹ͸ALiBiͱSHAPEʹՄೳੑΛײ͡Δ
    • Attention
    • ௕͍ܥྻͷॲཧʹ՝୊
    • Perceiver͕࣍ͷ೼ݖΛऔΔ͔΋͠Εͳ͍
    • Feed-Forward
    • ࣍ݩΛ૿΍͢ɺ׆ੑԽؔ਺ɺMixture-of-Experts
    • Layer Normalization
    • PreLN͕Ұൠత
    • ྑ͠ѱ͠͸ࠞಱͱ͍ͯ͠Δ…
    • ྑ͞ͷ௥ࢼ
    • ൚༻తʹྑ͍ख๏͸΄ͱΜͲແ͍
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 63

    View full-size slide

  62. ΑΓྑ͍TransformerΛͭ͘Δʁ
    • ಛఆͷઃఆͰ”ྑ͍”Transformer͸ग़͖͍ͯͯΔ
    • ͔͠͠ɺ൚༻తʹྑ͍΋ͷΛͭ͘Δͷ͸ඇৗʹࠔ೉
    • ݴ͍׵͑ΔͱɺσʔλྔɾλεΫɾεέʔϧɾ࣮૷ʹΑͬͯ
    ྑ͍Transformerͷઃఆ͸ҧ͏
    • …ͱ͍͏ͷ͕͜Ε·Ͱͷ஌ݟ
    • ݁ՌɿҰൠతʹ࠾༻͞Ε͍ͯΔख๏͸͔ᷮ͘͝
    • Pre-LNɾGeLUʢɾ૬ରҐஔʁʁʣ
    • ΠϯύΫτͷେ͖͍ݚڀ෼໺Ͱ͋Δ͜ͱʹมΘΓ͸ͳ͍
    • ࠓޙ΋ओྲྀͷݚڀͱͯ͠ଓ͖ͦ͏
    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 64

    View full-size slide