Upgrade to Pro — share decks privately, control downloads, hide ads and more …

より良いTransformerをつくる

 より良いTransformerをつくる

2022年6月 名古屋地区NLPセミナーでのトーク

Shun Kiyono

June 07, 2022
Tweet

More Decks by Shun Kiyono

Other Decks in Research

Transcript

  1. ँࣙ • খྛᰜհࢯˏPreferred Networks / ౦๺େ • ߴ੉ᠳࢯˏ౦޻େ • ླ໦५ઌੜˏ౦๺େ

    ʹ͸ຊߨԋͷςʔϚઃఆ΍εϥΠυͷਪᏏʹ͝ڠྗ͍͖ͨͩ·ͨ͠ɻ ͜ͷ৔ΛआΓͯਂ͘ײँ͍ͨ͠·͢ɻ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 3
  2. ͓͞Β͍ɿTransformer [Vaswani+2017] • Attention→FeedforwardΛͨ͘͞ΜੵΉ+৭ʑ޻෉ • ܥྻม׵Ϟσϧ΍ࣄલ܇࿅ࡁΈݴޠϞσϧͷࠜװ • ࠷ۙ͸Ի੠ɾը૾෼໺Ͱ΋׆ൃ • ∴

    Transformerࣗମͷվྑ͸ΠϯύΫτେ • Change the worldͰ͖ΔՄೳੑ͕͋Δ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 4 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding +
  3. ྑ͍Transformer͸ɺੑೳͷߴ͍Transformer • ຊൃදͰओʹ࿩͢͜ͱ • λεΫ൚༻తʹੑೳ޲্ΛͶΒ͏ํ๏࿦ • ຊൃදͰओʹ͸࿩͞ͳ͍͜ͱ • ಛఆͷλεΫɾσʔληοτʹண໨ͨ͠ํ๏࿦ •

    ܭࢉޮ཰Λ্͛ΔͨΊͷํ๏࿦ • Ex) ύϥϝʔλ࡟ݮ • Ex) Ξςϯγϣϯͷߴ଎Խ • Ex) ྔࢠԽ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 7 ͋͘·Ͱຊൃදͷఆٛ Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding +
  4. ຊൃදͷߏ੒ɿ֤෦ҐͷऔΓ૊ΈΛ঺հ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 8 Multi-Head Attention Add & Norm Feed

    Forward Add & Norm Input Embedding + ᶃ Ґஔදݱ ᶄ Attention ᶅ Feed-forward ᶆ Layer Normalization
  5. ઈରҐஔදݱ; Absolute Position Embedding (APE) • ֤ҐஔΛઐ༻ͷϕΫτϧΛ࢖ͬͯදݱ͢Δ • e.g., sin/cos೾ͷ૊Έ߹Θͤ

    [Vaswani+2017] • 😀 γϯϓϧɺ଎͍ɺ࣮૷͕؆୯ • Ґஔ͸୯ͳΔߦྻ͔ΒͷࣙॻҾ͖Ͱදݱ • 😰 ະ஌ͷ௕͞ͷܥྻʹରͯ͠൚Խ͠ͳ͍ • i.e., ֎ૠ͕ࠔ೉ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 10 (a) ASPEC English-to-Japanese (b) WMT2014 English-to-German Figure 3: BLEU scores on test data split by the sentence length (no training data in the gray-colored area). Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 Figure from [Neishi+2019] ະ஌ͷ௕͞
  6. ૬ରҐஔදݱ; Relative Position Embedding (RPE) • ֤τʔΫϯؒͷڑ཭ΛAttention಺෦Ͱར༻ • 😀 ະ஌ͷ௕͞ʹରͯ͠ؤ݈

    • Ωʔϫʔυɿγϑτෆมੑ • 😰 ઈରҐஔΑΓ΋ܭࢉ͕࣌ؒ௕͍ • 😰 Attentionͷѥछͱͷ૊Έ߹Θ͕ͤࠔ೉ • Performer, Linformer, etc… ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 11 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input + -2 -1 0 1 2 Key Value Relative Distance Additional Features for Attention John yelled at Kevin ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} Position 0 1 2 3
  7. ઈରҐஔ vs ૬ରҐஔɿ͜͜·Ͱͷ·ͱΊ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 12 ઈରҐஔ ૬ରҐஔ <৽نͷख๏> ੑೳ

    😰 😀 ? ଎౓ 😀 😰 ? ࣮૷ίετ 😀 😰 ? ·ͣ͸զʑͷ औΓ૊ΈΛ঺հ
  8. SHAPE: Shifted Absolute Position Embedding • w/ Sosuke Kobayashi, Jun

    Suzuki, and Kentaro Inui • EMNLP2021ʹ࠾୒ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 13 Sosuke Kobayashi~,} Jun Suzuki~, Kentaro Inui~, ~ Tohoku University } Preferred Networks, Inc. [email protected], [email protected], jun.suzuki, inui}@tohoku.ac.jp t rucial for building ns in Transformers. ations suffer from test data with un- utational cost. We osition embedding ues. The basic idea invariance, which uccessful position y shifting absolute Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 k-1 k k+1 k+2 k+3 Absolute Position Embedding (APE) (a) Shifted APE (SHAPE) (c) Transformer Relative Position Embedding (RPE) (b) Relative Distance John yelled at Kevin ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} -2 -1 0 1 2 Key Value Shifted by random offset k Figure 1: Overview of position representations. (a) APE
  9. ઈରҐஔʹγϑτෆมੑΛऔΓೖΕΔ • ૬ରҐஔ͸ະ஌ͷ௕͞ʹؤ݈→γϑτෆมੑ͕ॏཁͦ͏ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 14 John yelled at Kevin

    ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} Position 0 1 2 3 Position 34 35 36 37 John yelled at Kevin ! !"#!$ {&'(,*+,-'} ! !/#!$ {&'(,*+,-'} e.g. ૬ରҐஔͷ৔߹: ! !"# {%&',)*+,&} ! !"# {%&',)*+,&} ! !"#!$ {&'(,*+,-'} ! !"#!$ {&'(,*+,-'} ͸ à ্ۭؒͷҐஔγϑτ͕ؔ਺ͷग़ྗʹӨڹ͠ͳ͍…ͱ͍͏ੑ࣭ ͱ౳͍͠
  10. ઈରҐஔ+ϥϯμϜγϑτ=γϑτෆมੑ • Shifted Absolute Position Embedding (SHAPE) • ઈରҐஔΛΦϑηοτ 𝒌~𝓤(𝟎,

    𝑲) ͰϥϯμϜʹγϑτ • Ϟσϧ͸ઈରҐஔΛར༻ෆՄೳ • ୅ΘΓʹ૬ରҐஔΛར༻͢ΔֶशΛڧ੍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 15 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 k-1 k k+1 k+2 k+3 APE Shifted by random offset k SHAPE
  11. Ұ୴ɺ͜͜·Ͱͷ·ͱΊ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 16 ing ers. om un- We ing

    dea ich ion ute Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 k-1 k k+1 k+2 k+3 Absolute Position Embedding (APE) (a) Shifted APE (SHAPE) (c) Transformer Relative Position Embedding (RPE) (b) Relative Distance John yelled at Kevin ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} -2 -1 0 1 2 Key Value Shifted by random offset k
  12. ༧උௐࠪɿSHAPE͸γϑτෆมੑΛ֫ಘ͢Δ͔ʁ • ֶशࡁΈϞσϧʹ͍ͭͯɺ𝑘 ͷ஋ΛมԽͤͯ͞ΈΔ • ֤ 𝑘 ʹ͍ͭͯΤϯίʔμͷίαΠϯྨࣅ౓Λൺֱ • APE:

    ֤ 𝑘 ͸ҟͳΔӅΕ૚Λੜ੒ • SHAPE: ֤ 𝑘 ͷ஋͸ӅΕ૚ͷ஋ʹӨڹ͠ͳ͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 17 0 100 250 500 APE EmbeddingLayer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 0 100 250 500 0 100 250 500 SHAPE k 0 100 250 500 0 100 250 500 0 100 250 500 0 100 250 500 0 100 250 500 0 100 250 500 0.8 0.9 1.0
  13. ࣮ݧઃఆͷ֓ཁ • Ϟσϧ: Transformer + APE, RPE, or SHAPE •

    λεΫ:ػց຋༁ • ܇࿅σʔλ 1. Vanilla • WMT 2016 EnDe [Ott+2018] 2. Extrapolate • ௕͞50Ҏ্ͷܥྻΛ܇࿅σʔλ͔Β࡟আͯ͠࡞੒ 3. Interpolate • ྡ઀͢ΔܥྻΛ݁߹ͯ͠࡞੒ • ։ൃσʔλ: newstest2010-2013 • ςετσʔλ: newstest2014-2016 • ධՁ: sacreBLEU ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 18 ࠓճ͸লུ ઈରҐஔ ૬ରҐஔ
  14. ࣮ݧઃఆɿσʔληοτ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 19 ط஌&ະ஌ͷ௕͞ ͰධՁ͍ͨ͠ ᶃ Vanilla: WMT EnDe

    2016 [Ott+2018] English German • ػց຋༁ͷඪ४తͳϕϯνϚʔΫ • ϕʔεϥΠϯͷੑೳͷ֬ೝ༻ ᶄ Extrapolate: ௕͞50Ҏ্ͷܥྻΛ܇࿅σʔλ͔Β࡟আ English German • Ϟσϧͷ֎ૠੑೳͷ֬ೝ • ະ஌ͷ௕͞ʹରͯ͠ؤ݈͔ʁ
  15. ࣮ݧ݁ՌɿRPEͱSHAPE͸ಉ౳ͷੑೳ • Extrapolateͷ݁Ռ • RPEͱSHAPE͕APEΛ্ճΔੑೳ • SHAPE͸RPEͱಉ౳ͷੑೳͰɺRPEΑΓߴ଎ • Vanillaͷ݁Ռ •

    શͯͷϞσϧ͕ಉ౳ͷੑೳ • SHAPE͸ੑೳѱԽͷڪΕͳ͘࢖ͬͯྑͦ͞͏ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 20 Figure 2: Cosine similarities of the encoder hidden states with different offsets k 2 {0, 100, 250, 500}. Only the representation of SHAPE is invariant with k. Dataset Model Valid Test Speed VANILLA APE† 23.61 30.46 x1.00 RPE† 23.67 30.54 x0.91 SHAPE† 23.63 30.49 x1.01 EXTRAPOLATE APE 22.18 29.22 x1.00 RPE 22.97 29.86 x0.91 SHAPE 22.96 29.80 x0.99 Table 2: BLEU scores on newstest2010-2016. Valid is Figure 3: BLEU score improveme dation and test sets with respect to length. The gray color means no t
  16. ௕͞͝ͱͷੑೳΛௐࠪɿ֎ૠੑೳ͕޲্ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 21 0-9 10-19 20-29 30-39 40-49 50-59

    60-69 70- Source Sequence Length (tokens) 0 2 4 6 8 BLEU improvement APE RPE SHAPE ϕʔεϥΠϯ "1& ͔Βͷੑೳ޲্෯ ະ஌ͷ௕͞ κʔϯ • SHAPEͱRPE͸APEΑΓ΋ߴ͍֎ૠੑೳΛൃش • SHAPEͱRPE͸ಉ౳ͷੑೳ
  17. SHAPEͷ·ͱΊ • SHAPE : shifted absolute position embedding • APEʹγϑτෆมੑΛ෇༩

    • APEͱಉ౳ͷ଎౓ • RPEͱಉ౳ͷੑೳ • ࣮૷͸؆୯ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 22 ॏཁ SHAPEͷ࣮૷ྫ ઈରҐஔ ૬ରҐஔ SHAPE ੑೳ 😰 😀 😀 ଎౓ 😀 😰 😀 ࣮૷ίετ 😀 😰 😀
  18. ͦͷ΄͔ɺ࠷ۙͷҐஔදݱͨͪ • ALiBi: Attention with Linear Biases [Press+2021] • RoPE:

    Rotary Position Embedding [Su+2021] • CAPE: Continuous Augmented Positional Embeddings [Likhomanenko+2021] • TUPE: Transformer with Untied Positional Encoding [Ke+2020] • ࠓճ͸লུ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 23 Q. શ෦ಉ͡͡Όͳ͍Ͱ͔͢ʁ A. ͕͍ͪ·͢…
  19. ALiBi: Attention with Linear Biases [Press+2021] • Query-Keyͷ಺ੵʹ͍ͭͯڑ཭ʹ Ԡͨ͡ϖφϧςΟΛ෇༩ •

    ԕ͍τʔΫϯ΄ͲAttentionͷ είΞ͕Լ͕Δ • ۙ๣Λ༏ઌ͢ΔΑ͏ͳInductive biasΛೖΕ͍ͯΔͱΈͳͤΔ • APEΑΓੑೳྑ͍ • ֎ૠੑೳ΋ྑ͍ • BigScienceWSͷϞσϧͰ΋࠾༻ • [Le Scao+2022] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 24 Figures from [Press+2021]
  20. RoPE: Rotary Position Embedding [Su+2021] • ໨తɿQuery-Keyͷ಺ੵΛ ୯ޠؒͷ૬ରڑ཭ʹґଘͤ͞ ͍ͨ •

    ҐஔʹԠͯ͡୯ޠϕΫτϧΛ 𝑚𝜃͚ͩճస • Query-Keyͷ૬ରڑ཭͕ಉ͡ →ͳ͕֯͢ಉ͡ • ઈରҐஔΛ্ճΔੑೳ • GPT-J΍GPT-NeoX΍PaLMͰ ࠾༻ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 25 Figures from [Su+2021]
  21. CAPE: Continuous Augmented Position Embeddings • APEΛྑ͍ͨ͘͠೿ൊ • ܇࿅தʹҎԼͷૢ࡞ΛऔΓೖΕΔͱྑ͍ •

    (MPCBM4IJGU ೖྗશମͷҐஔΛϥϯμϜʹͣΒ͢ • Local Shift (ೖྗҰ෦ͷҐஔΛϥϯμϜʹͣΒ͢) • Global Scaling (ॖईΛϥϯμϜʹௐ੔) • ը૾ɺԻ੠ɺݴޠλεΫͰੑೳ޲্Λใࠂ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 26 Ͳ͔͜Ͱݟ͕֮͑… [Likhomanenko+2021] Figure from [Likhomanenko+2021]
  22. ͓͞Β͍ɿAttention • Transformerͷ֩৺ʁ • Մม௕ܥྻͷதͰ͍͍ײ͡ʹ৘ใΛࠞͥΔͨΊͷ࢓૊Έ • ௕ڑ཭ґଘͷֶश͕͠΍͍͢ʢͱ͞Ε͍ͯΔʣ • ฒྻԽͰ͖ΔͷͰRNNΑΓ΋ߴ଎ •

    ͨͩ͠௒௕ڑ཭ʹͳΔͱܭࢉ͕େม • ܥྻYܥྻ ͷܭࢉ͕ඞཁ Process Output array Q x L V Attention K V Attention Q K V V K Q Attention scores O E ͜Ε͕Qͱಉ͡ͱ͖ Self-Attention Figure from [Jaegle+2022]
  23. AttentionΛऔΓר͘࠷ۙͷঢ়گ • AttentionΛผͷԿ͔Ͱஔ͖׵͍͑ͨ • CNN [Wu+2019] • MLP [Sun+2021] •

    AttentionΛݮΒ͍ͨ͠ • ୅ΘΓʹFeed-ForwardΛ૿΍͢ʁ [Zhang+2020] [Irie+2020] • AttentionࣗମΛܰྔԽ͍ͨ͠ • Efficient Transformers: A Survey • Attentionͷಛ௃Λ΋ͬͱ׆͔͍ͨ͠ • Perceiver [Jaegle+2021] • Memory Tranformer [Wu+2022] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 30
  24. Dynamic Convolution [Wu+2019] • AttentionΛCNNͰஔ͖׵͑Α͏ܥͰ͸࠷΋༗໊ʁ • ֤࣌ࠁʹ͍ͭͯೖྗ͔ΒCNNͷΧʔωϧΛ ಈతʹੜ੒ • MTɾཁ໿ɾݴޠϞσϧͰAttentionΑΓߴੑೳ

    • ύϥϝʔλڞ༗ͳͲͷࡉ͔͍ςΫχοΫ͕ॏཁͦ͏ • ࠷େΧʔωϧ෯͸31→ܥྻશମΛݟΔͷ͕େࣄʁ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 31 Figure from [Wu+2019] ਤ͸ ࿦จ঺հ: Pay Less Attention with Lightweight and Dynamic Convolutions ΑΓҾ༻
  25. Perceiver [Jaegle+2021] • Perceiver: ݴޠɾԻ੠ɾը૾Ͱ൚༻తʹ࢖͑ΔϞσϧΛAttentionͰ࣮ݱ • 😩 ʮ1࣍ݩͳΒLSTMɺ2࣍ݩͳΒCNN…ʯΛղܾ • جຊతʹ͸Transformer

    Encoder • ௕ڑ཭ܥྻ͕ͭΒ͍ͷͰɺೖྗΛCross-Attentionܦ༝Ͱݻఆ௕ʹ • ը૾෼ྨɺԻ੠+ಈըղੳɺ఺܈෼ྨͳͲͰڧ͍ϕʔεϥΠϯʹඖఢ • ͨͩ͠ɺ෼ྨ໰୊͔͠ղ͚ͳ͍ߏ଄ͳͷʹ஫ҙ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 33 M>>>N ͳͷͰܰྔ ResNetͱ͔ ViTͱ͔ ೖྗ ύϥϝʔλ ߦྻ Figure from [Jaegle+2021]
  26. Perceiver IO [Jaegle+2022] • Perceiver IOɿPerceiverͷग़ྗଆʹ΋Cross AttentionΛ࠾༻ • ೖग़ྗʹରশੑΛ͍࣋ͨͤͯΔͱ΋ݟ͑Δ •

    ௕͞𝒐ͷܥྻΛग़ྗՄೳàద༻ՄೳλεΫ͕૿Ճ • ϚεΫݴޠϞσϧؚΉ֤छλεΫͰ൚༻తʹಈ࡞ • ग़ྗଆͷQuery ArrayͰ͸ະདྷͷ৘ใ͕ϦʔΫ͍ͯ͠Δͷʹ஫ҙ • ͳͷͰɺͦͷ··Ͱ͸ੜ੒ʹ࢖͑ͳ͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 34 ৽ن ύϥϝʔλߦྻ ೖྗΛݟΔͷ͸ Ұճ͚ͩ Figure from [Jaegle+2022]
  27. Perceiver AR [Hawthorne+2022] • ະདྷͷ৘ใ͕ϦʔΫ͢ΔͷΛ๷͍͗ͨ • ग़ྗଆͷAttentionʹ͸ϚεΫػߏΛ௥Ճ͠ɺ autoregressiveͳੜ੒ΛՄೳʹ • ͍ͭ΋ͷCausal

    Mask • ը૾ੜ੒ɺݴޠϞσϧͳͲͰߴੑೳ • Transformer-XLͳͲʹඖఢ • ܥྻ͕௕͘ͳͬͯ΋ֶश͸ߴ଎ • ܥྻม׵λεΫ౳Ͱಈ͘ͷ͔͸Α͘෼͔Βͳ͍… ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 35 Figures from [Hawthorne+2022]
  28. Memorizing Transformers [Wu+2022] • Transformer DecoderΛ૝ఆ • ઐ༻ͷAttention૚Λઃܭ • աڈͷӅΕ૚Λ֎෦ϝϞϦͱ

    Έͳͯ͠AttentionΛܭࢉ • AttentionείΞΛݩʹkNNͰ ద౰ʹࢬמΓ • ࠷ޙʹී௨ͷAttentionͱࠞͥΔ • ਺ֶه߸΍ͦͷఆٛΛ ͏·͘ੜ੒Ͱ͖Δ • ୯ͳΔ௿ස౓ޠͷίϐʔʁ • ؔ࿈ɿ[Sun+2021] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 36 ޯ഑͸ྲྀΕͳ͍ →ߴ଎ʹಈ࡞ Figures from [Wu+2022]
  29. ͓͞Β͍ɿFeedForward૚ • શ݁߹+Activation+શ݁߹ • ׳शɿ࣍ݩ਺Λ্͛ͯݩʹ໭͢ • Transformer-base: 512à2048à5412 • Transformer-big:

    1024à4096à1024 • ࣮ࡍʹ͸ԿΛ΍͍ͬͯΔʁ • ֎෦ϝϞϦ΁ͷAttentionͱΈͳͤΔ • [Sukhbaatar+2019] • Key-ValueϝϞϦͱͯ͠ಈ࡞ • [Geva+2021] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 38 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + Linear Activation Function Linear X ࣍ݩ਺Λ্͛Δ ʙ4ഒఔ౓ ݩʹ໭͢
  30. ࠷ۙͷਐలɿ׆ੑԽؔ਺Λม͑ͯΈΔ • Vanilla TransformerͰ͸ReLUΛ࠾༻ • GPT΍BERTͳͲͷݴޠϞσϧɿGeLUΛଟ͘ݟ͔͚Δ • ॳग़͸GPTͷ͸ͣ • GLUܥ͕ޮ͘ͱ͍͏ใࠂ΋

    [Shazeer+2020] • Կނྑ͍ͷ͔͸෼͔Βͳ͍͕… • “We offer no explanation as to why these architectures seem to work” ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 39 Figure from [Hendrycks+2016] Figure from [Narang+2021]
  31. ࠷ۙͷਐలɿ࣍ݩ਺Λ্͛Δ • WMT’19 ʹ͓͚ΔFacebookͷใࠂ[Ng+2019] • ৭ʑࢼͨ݁͠ՌɺFFNΛ૿΍͢ͱྑ͔ͬͨ • We experiment with

    increasing network capacity by increasing embed dimension, FFN size, number of heads, and number of layers. • We find that using a larger FFN size (8192) gives a reasonable improvement in performance while maintaining a manageable network size • WMT’20 ࢀՃ࣌ʹ௥ࢼɾ࠾༻ [Kiyono+2020] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 40 Table from [Ng+2019]
  32. දݱྗͷͦͷઌ΁ɿMixture of Experts • ී௨ʹϞσϧΛେ͖͍ͯ͘͘͠ͱ… • 😨 දݱྗUPͷͨΊύϥϝλ૿→Ϟσϧͷܭࢉ࣌ؒ૿ • ҰํɺMixture

    of Expertsͷ৔߹ • 😀 ෳ਺ͷHomogenousͳExpertΛ༻ҙ→ύϥϝλ૿ɾදݱྗUP • 😀 ೖྗʹԠͯ͡Ұ෦ΛൃՐ→ܭࢉ࣌ؒ࡟ݮ • Ұ෦ք۾ʢGoogleͱFacebookʣͰྲྀߦ͍ͬͯΔ༷ࢠ • Switch Transformer [Fedus+2021], Base Layers [Lewis+2021], Sparse all-MLP [Yu+2022] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 41 Figure from [Fedus+2021]
  33. ͓͞Β͍ɿLayer Normalization (LN) • ֤ӅΕ૚͔Βฏۉͱ෼ࢄΛܭࢉ →ਖ਼نԽ • ֶशͷ҆ఆԽʹߩݙ • LNͳ͠ͷTransformer͸ֶशࠔ೉

    • AttentionͱFeedForwardϒϩοΫ ʹҰͭͣͭ഑ஔ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 43 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + ຒΊࠐΈʹ LayerNorm͢Δ ྲྀ೿΋ ֶश͸҆ఆ͢Δ͕ ੑೳ͕ѱԽ͢Δ [Le Scao+2022] LNϙΠϯτᶃ LNϙΠϯτᶄ Figure from [Zhang+2019] RNNͷ৔߹
  34. LNΛͲ͜ʹஔ͔͘ʁɿPost-LNͱPre-LN • Vanilla Transformer͸Post-LN • ࠷ۙͷTransformer͸Pre-LN͕ଟ͍ • ༨ஊɿPre-LNͷॳग़͸Tensor2Tensor • ͜ͷ͋ͱ࿦จ͕ͨ͘͞Μग़ͨ

    ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 44 Post-LN ResidualޙʹLN Pre-LN ؔ਺ͷલʹLN ͦͷޙPre-LN͕ σϑΥϧτʹ ˠͱͯ΋ॏཁ Ռͨͯ͠LNͷҐஔ͕ͦΜͳʹॏཁͳͷ͔ʁ Figure from [Xiong+2020]
  35. Post-LNͱPre-LNͷ೰·͍ؔ͠܎ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 45 2 st-LN e-LN ual 後に Norm

    本研究の貢献 ・Post-LN と Pre-LN の性能差を実験的に⽰す ・多層 Post-LN の学習が難しい原因を⽰す ・⾼い性能を維持しつつ多層化する⼿法を提案 性能 多層化 Post-LN ◦ × Pre-LN × ◦ B2T(提案⼿法) ◦ ◦ Layer Norm Attention FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N × N Attention Layer Norm Layer Norm FFN Layer Norm (a) Post-LN er Norm ention Layer Norm Attention Attention Layer Norm Attention Layer Norm Layer Norm Attention Layer Norm Attention (a) Post-LN (b) Pre-LN (c) Post-LN with B2T connection ention er Norm er Norm Attention Layer Norm Layer Norm FFN Attention Layer Norm × N FFN er Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N (b) Pre-LN (c) Post-LN with B2T connection 適⽤前に Norm ೰ΈɿPre-LN͸ֶश͕༰қ͕ͩੑೳ͕௿͍ 1PTU-/ͷ΄͏͕ੑೳ͕ߴ͍ Pre-LN で良いのでは︖ Pre-LN は多層にしたときの学習が安定 – 近年の多層なモデル(例︓GPT)はPre-LN しかし Pre-LN は性能が低い – Post-LN は学習が成功した場合 Pre-LN より⾼性能 • 6層 Transformer エンコーダ・デコーダの性能⽐較 7 Post-LN Post-LN Pre-LN Pre-LN 翻訳(WMT英-独)でのBLEU ⾒出し⽂⽣成でのROUGE-1 ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻ 1PTU-/ͷଟ૚Խ͸೉͍͠ エンコーダ側 デコーダ側 各層の勾配のノルム デコーダ側で勾配消失が発⽣ 101 100 10-1 10-1 100
  36. ޯ഑ফࣦͷݪҼ͸LN • 勾配消失の原因を探る – デコーダの18層⽬の各位置における勾配のノルムを調べる Layer Norm Attention FFN Layer

    Norm Layer Norm Attention (4) → (3),(2) → (1) で勾配が⼤きく減衰 → LN をまたぐと勾配が⼤きく減衰 → LN が勾配を減衰させる → LN が勾配消失の原因 ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 47 ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻
  37. B2T Connection [Takase+2022] Ͱྑ͍ͱ͜औΓΛ࣮ݱ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 48 Layer Norm Attention

    FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N × N (c) Post-LN with B2T connection ͸ Pre-LNɼ(c) ͸ Post-LN ʹఏҊख๏Λ૊Έ 提案⼿法︓⾚字の Residual Connection を追加 各層の下から上までの経路︓Bottom-to-Top Connection 1. 各層の最後の LN 以外を迂回する → LN による勾配の減衰を抑制 2. 最後の LN は⼊⼒にも適⽤ ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻
  38. ൺֱ࣮ݧɿػց຋༁λεΫ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 49 ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻ B2T connection は 学習に成功 性能も他の⼿法より⾼い

    • あ 学習 失敗 6層エンコーダ・デコーダ 18層エンコーダ・デコーダ Post-LN は Pre-LN よりも⾼い性能 B2T connection は Post-LN と同程度の性能
  39. LNΛফͯ͠ޯ഑ফࣦʹରॲɿT-FixUp [Huang+2020] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 50 • T-FixUpɿ͔ͳΓݡ͍ॳظԽ • LNΛϞσϧશମ͔ΒຣফͰ͖Δ •

    ֶश཰ͷWarmup͕ཁΒͳ͘ͳΔ • ଟ૚ʹͯ͠΋ֶश͕҆ఆ͢Δ • Warmup͸ͳͥඞཁ͔ͩͬͨʁ • LNͷޯ഑͸ೖྗͷϊϧϜʹ൓ൺྫ • ֶशॳظ͸Adamͷਪఆྔ͕ෆ҆ఆ→ߋ৽ྔ📈 • ݁ՌɺೖྗͷϊϧϜ͕📈 • ޯ഑͕ফֶ͑ͯशࠔ೉ʹ • 👍ϋΠύϥ (Warmup) ͕ফͤΔ • 👎 Կεςοϓ͔Βֶश཰ΛԼ͛Δ͔ʁ ͱ͍͏ϋΠύϥ͕૿͍͑ͯΔ • ϓϥϚΠθϩͰ͸ʁ Figure from [Huang+2020] ޯ഑ͷ༷ࢠ
  40. ͋͑ͯLNΛ૿΍͢ɿNormformer [Shleifer+2021] • LNΛ૿΍͢ͱऩଋ଎౓ͱੑೳ͕վળ͢Δ ͱ͍͏ݚڀ • େن໛ݴޠϞσϧͰͷ࣮ݧͰൺֱ • Pre-LNʹ͓͚Δ௿૚ͷޯ഑ͱ ߴ૚ͷޯ഑ͷεέʔϧͷෆҰகΛ໰୊ࢹ

    • AttentionͱFeedForwardޙʹLNͯ͠ εέʔϧΛҰகͤ͞Δ • εέʔϧͷෆҰக͕ͳͥ໰୊ͳͷ͔ʁ • ΈΜͳ࢖ͬͯΔσʔλ͕ҧ͏ɿ֤ख๏ͷ ൺֱ͕ࠔ೉ • ࠞಱͱ͍ͯ͠Δ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 51 Figure from [Shleifer+2021]
  41. Sandwitch Transformer [Press+2019] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 53 Multi-Head Attention Add &

    Norm Feed Forward Add & Norm Input Embedding + ΞΠσΞ (Attention+FeedForward)ΛN૚ੵΉɺ͸ຊ౰ʹ࠷ద͔ʁ →֤αϒϨΠϠ͸Մ׵Ͱ͋Δ͜ͱʹண໨͠ɺ ॱ൪ΛೖΕସ͑ͯΈΔ
  42. Sandwitch Transformer [Press+2019] • ॱ൪ΛϥϯμϜʹೖΕସ͑ͯΈ࣮ͯݧ • ϕʔεϥΠϯΑΓ΋ྑ͍ߏ଄͕͋ΔͬΆ͍ • ೖྗଆʹAttentionଟΊɺग़ྗଆʹFFNଟΊ͕ ྑͦ͞͏

    • ͭ·Γ • ݴޠϞσϧͰੑೳ޲্ɺ຋༁Ͱ͸ޮՌͳ͠ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 54 which is comparable to the performance of the base- ine with the same number of parameters. We next generalize this model and the original nterleaved transformer, creating the family of sand- wich transformers. A sandwichn k transformer con- ists of 2n sublayers in total (n of each type), con- orming to the regular expression sk(sf)n k fk. The first k sublayers are purely self-attention (s), while the last k are feedforward sublayers (f). In etween, we use the original interleaving pattern sf) to fill the remaining 2(n k) sublayers. When k = 0, we get the original transformer model, and when k = n 1 (its maximal value) we get the Model Test Baseline (Baevski and Auli, 2019) 18.70 Transformer XL (Dai et al., 2019) 18.30 kNN-LM (Khandelwal et al., 2019) 15.79 Baseline (5 Runs) 18.63 ± 0.26 Sandwich16 6 17.96 Table 3: Performance on the WikiText-103 test set. We compare the best sandwich transformer to the unmod- ified, interleaved transformer baseline (Baevski and Auli, 2019) trained over 5 random seeds and to other previously reported results. than the average baseline transformer. Of those, 6 models outperform the best baseline transformer (k = 5, 6, 8, 9, 10, 11). The best performance of 17.84 perplexity is obtained when k = 6. We com- pare this model to the baseline on WikiText-103’s test set. Table 3 shows that, despite its simple design, the sandwich transformer outperforms the original transformer baseline by roughly double the gap be- tween the baseline (Baevski and Auli, 2019) and Transformer XL (Dai et al., 2019). This improve- ment comes at no extra cost in parameters, data, memory, or computation; we did not even change any of the original hyperparameters, including the number of training epochs. To check whether this advantage is consistent, we train 4 more sandwich16 6 models with different Figure 5: The transformer’s sandwich coefficient (k) and validation perplexity, for k 2 {1, . . . , 15}. The dotted line is the average baseline model’s perplex- ity (trained with different random seeds), whereas the dashed line represents the best baseline model. Figure 6: Performance on the WikiText-103 develop- ment set of the Sandwich16 transformer and the base- Figures from [Press+2019] ଠࣈ͕Baseline ௿͍΄Ͳྑ͍ੑೳ
  43. Primer [So+2021] • TransformerͷΞʔΩςΫνϟΛҨ఻తΞϧΰϦζϜͰ୳ࡧ • ୳ࡧۭؒΛAdd/Square/Conv…ͳͲͷ௿ϨϕϧԋࢉͰߏ੒ • ൚༻తʹྑ͔֦ͬͨு͸2ͭ 1. Query,

    Key, Valueʹରͯ͠෯3ͷDepthwise Convolution (DConv) 2. ReLUͷग़ྗΛೋ৐͢Δ • DConvͷؾ࣋ͪʹͳͬͯΈΔ • Attentionͷલʹۙ๣Λݟ͍ͨ • Ͱ΋ී௨ͷConv͡ΌͩΊ • ෯Λ޿ͯ͘͠΋ͩΊ • αϯϓϧޮ཰্͕͕Δͱͷ͜ͱ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 55 νϟωϧؒͰ ৘ใΛࠞͥͳ͍Conv Figure from [So+2021]
  44. ʮྑ͞ʯ͸࣮૷Λ·͍ͨͰ൚Խ͢Δ͔ʁɿNo • Do Transformer Modifications Transfer Across Implementations and Applications?

    [Narang+2021] • ͞·͟·ͳTransformerͷ֦ுΛڞ௨ͷίʔυϕʔεʹ࠶࣮૷ • ݩ࿦จͷใࠂ͢Δ஌ݟʢi.e., ྑ͞ʣΛ࠶ݱͰ͖Δ͔௥ࢼ • ࠶ݱͰ͖Δख๏͸ҎԼͷ͍ͣΕ͔ͱใࠂ • γϯϓϧͳมߋʢ׆ੑԽؔ਺ͳͲʣ • ܭࢉྔͷ૿ՃΛ൐͏΋ͷʢଟ૚ԽͳͲʣ • ύϥϝʔλ਺ͷ૿ՃΛ൐͏΋ͷʢ.JYUVSFPG&YQFSUTͳͲʣ • ϋΠύϥͷௐ੔Λ͍ͯ͠ͳ͍͜ͱʹ஫ҙ • ֦ுख๏ͨͪʹͱͬͯएׯෆརͳઃఆͱΈͳͤΔ • ϋΠύϥͷௐ੔͕ඞཁͳ࣌఺ͰμϝͰ͠ΐɺͱ͍͏ڧ͍ओு ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 57 ͩΊ͡ΌΜ…
  45. ʮྑ͞ʯ͸σʔλʹؔͯ͠൚Խ͢Δ͔ʁɿNo • The Impact of Positional Encodings on Multilingual Compression

    • [Ravishankar+2021] • Ґஔදݱख๏ʢઈରҐஔɺ૬ରҐஔʣΛม͑ͯଟݴޠBERTΛ܇࿅ • ଟݴޠλεΫʢ୯ޠΞϥΠϯϝϯτɾจநग़ͳͲʣͰධՁ • ઈରҐஔ͕૬ରҐஔΛ্ճΔੑೳ • ૬ରҐஔ͕ଟݴޠؒͷΞϥΠϯϝϯτͷֶशΛࠔ೉ʹ͍ͯ͠Δʁ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 58 Figure from [Ravishankar+2021]
  46. ʮྑ͞ʯ͸Ϟσϧͷεέʔϧʹؔͯ͠൚Խ͢Δ͔ʁɿNo • Transformerͷѥछʹ͍ͭͯϞσϧͷεέʔϧΛม͑ͨͱ͖ͷੑೳΛൺֱ • Vanilla, Universal Transformer, MLP-Mixer, ALBERT, Performer,

    etc… • ݁ՌɺVanilla Transformer͕Ұ൪ྑ͔ͬͨ • 😀 ಛఆͷεέʔϧͰTransformerΑΓྑ͍Ϟσϧ͸ଘࡏ • 😨 TransformerΑΓҰ؏ͯ͠ྑ͍΋ͷ͸ଘࡏ͠ͳ͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 59 [Anonymous+2021] ARRʹొ৔ͯ͠Ҏ߱ Իࠫଡͳ͕ͩ͠… Figure from [Anonymous+2021]
  47. ؔ࿈ɿ175BݴޠϞσϧʹ͓͚ΔMetaͷ஌ݟ • খن໛࣮ݧ͔ΒಘͨϋΠύϥ͕௨༻͠ͳ͍ • ͪͳΈʹɿ13Bύϥϝλ͸”খن໛” • ֶश཰ΛGPT-3ઃఆΑΓେ͖ͯ͘͠Έ͚ͨͲେࣦഊ • طଘͷϞσϧ͕௨༻͠ͳ͍ •

    Normformerɿֶश్தͰloss͕ఀ଺ • ֶशͷ҆ఆԽ͕ॏཁ • FP16ͩͱGeLU͕਺஋తʹ҆ఆ͠ͳ͍ à ReLUͰଥڠ • EmbeddingʹLN͢Δͱֶश͕҆ఆ͢Δ͚Ͳੑೳ͕ѱ͍ • ؤுͬͯ΋਺ઍUpdate͝ͱʹloss͕ൃࢄ͢Δ à ͦͷ౎౓restart͕ඞཁ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 60 https://github.com/facebookresearch/metaseq/blob/ main/projects/OPT/chronicles/56_percent_update.md LogBook͸Ұಡͷ Ձ஋͋Γ
  48. ༨ஊɿWMT 2020ͷܦݧ͔Β • γεςϜ։ൃͷ࠷தʹ͋Ε͜Εطଘख๏Λࢼ͢ • େମ͏·͍͔͘ͳ͍ • ྫɿٙࣅσʔλͷϑΟϧλϦϯά • ਺ԯจͷٙࣅσʔλશ෦Ͱֶशͨ͘͠ͳ͍→ྑ͍σʔλΛऔΓग़͍ͨ͠

    • ϑΟϧλϦϯάͷ༗ແʹΑΒͣੑೳʹӨڹͳ͠ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 61 ਗ਼໺ͷࢀՃͨ͠ ػց຋༁ͷίϯϖ Setting En!De De!En En!Ja Ja!En ASE (Section 3.1) 42.4 42.0 19.7 21.6 ASE (l = 9)+TAGGED-BT (Section 3.3) 42.7 42.5 22.0 23.9 b) + fine-tuning (Section 3.4) 44.9 42.3 23.1 24.4 c) ⇥ 4 (Section 3.5) 45.5 42.8 23.9 25.4 d) + 4 ⇥ (c)-R2L (Section 3.6) 45.4 43.6 24.2 25.9 e) + reranking (Section 3.7) 45.7 43.8 24.9 26.2 he best system in WMT’19 44.9 42.8 - - of each technique: we use newstest2019 and official validation set for En$De and En$Ja result from WMT’19 is unavailable for En$Ja, because this task has newly appeared this g / ID BLEU chrF able 5) 37.5 0.647 able 5) 43.8 0.690 able 5) 40.1 0.343 ENSEMBLE 25.5 0.536 e on WMT’20 Test Set: refer to newstest Amount of Synthetic Data Used: r (%) 2014 2018 2019 100 33.0 48.0 42.0 50 32.9 48.4 42.3 33 33.1 47.9 42.2 25 32.9 48.5 42.4 Table 7: Effectiveness of corpus filtering on En!De. newstest Table from [Kiyono+2020]
  49. ΑΓྑ͍TransformerΛͭ͘Δ • Ґஔදݱ • ઈରҐஔΑΓ΋૬ରҐஔ • ݸਓతʹ͸ALiBiͱSHAPEʹՄೳੑΛײ͡Δ • Attention •

    ௕͍ܥྻͷॲཧʹ՝୊ • Perceiver͕࣍ͷ೼ݖΛऔΔ͔΋͠Εͳ͍ • Feed-Forward • ࣍ݩΛ૿΍͢ɺ׆ੑԽؔ਺ɺMixture-of-Experts • Layer Normalization • PreLN͕Ұൠత • ྑ͠ѱ͠͸ࠞಱͱ͍ͯ͠Δ… • ྑ͞ͷ௥ࢼ • ൚༻తʹྑ͍ख๏͸΄ͱΜͲແ͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 63
  50. ΑΓྑ͍TransformerΛͭ͘Δʁ • ಛఆͷઃఆͰ”ྑ͍”Transformer͸ग़͖͍ͯͯΔ • ͔͠͠ɺ൚༻తʹྑ͍΋ͷΛͭ͘Δͷ͸ඇৗʹࠔ೉ • ݴ͍׵͑ΔͱɺσʔλྔɾλεΫɾεέʔϧɾ࣮૷ʹΑͬͯ ྑ͍Transformerͷઃఆ͸ҧ͏ • …ͱ͍͏ͷ͕͜Ε·Ͱͷ஌ݟ

    • ݁ՌɿҰൠతʹ࠾༻͞Ε͍ͯΔख๏͸͔ᷮ͘͝ • Pre-LNɾGeLUʢɾ૬ରҐஔʁʁʣ • ΠϯύΫτͷେ͖͍ݚڀ෼໺Ͱ͋Δ͜ͱʹมΘΓ͸ͳ͍ • ࠓޙ΋ओྲྀͷݚڀͱͯ͠ଓ͖ͦ͏ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 64