より良いTransformerをつくる

ΑΓྑ͍TransformerΛͭ͘Δ ཧݚAIP ਗ਼໺ ॢ

ँࣙ • খྛᰜհࢯˏPreferred Networks / ౦๺େ • ߴ੉ᠳࢯˏ౦޻େ • ླ໦५ઌੜˏ౦๺େ
ʹ͸ຊߨԋͷςʔϚઃఆ΍εϥΠυͷਪᏏʹ͝ڠྗ͍͖ͨͩ·ͨ͠ɻ ͜ͷ৔ΛआΓͯਂ͘ײँ͍ͨ͠·͢ɻ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 3

͓͞Β͍ɿTransformer [Vaswani+2017] • Attention→FeedforwardΛͨ͘͞ΜੵΉ+৭ʑ޻෉ • ܥྻม׵Ϟσϧ΍ࣄલ܇࿅ࡁΈݴޠϞσϧͷࠜװ • ࠷ۙ͸Ի੠ɾը૾෼໺Ͱ΋׆ൃ • ∴
Transformerࣗମͷվྑ͸ΠϯύΫτେ • Change the worldͰ͖ΔՄೳੑ͕͋Δ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 4 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding +

TransformerΛ΋͏গ͓͠͞Β͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 5 ਤ͸ https://jalammar.github.io/illustrated-transformer/ ΑΓ ୯ޠؒͰ৘ใΛ΍ΓऔΓ ୯ޠ಺Ͱԋࢉ

໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 6 ͱ͜ΖͰɺ ྑ͍5SBOTGPSNFSͬͯͳʹʁ

ྑ͍Transformer͸ɺੑೳͷߴ͍Transformer • ຊൃදͰओʹ࿩͢͜ͱ • λεΫ൚༻తʹੑೳ޲্ΛͶΒ͏ํ๏࿦ • ຊൃදͰओʹ͸࿩͞ͳ͍͜ͱ • ಛఆͷλεΫɾσʔληοτʹண໨ͨ͠ํ๏࿦ •
ܭࢉޮ཰Λ্͛ΔͨΊͷํ๏࿦ • Ex) ύϥϝʔλ࡟ݮ • Ex) Ξςϯγϣϯͷߴ଎Խ • Ex) ྔࢠԽ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 7 ͋͘·Ͱຊൃදͷఆٛ Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding +

ຊൃදͷߏ੒ɿ֤෦ҐͷऔΓ૊ΈΛ঺հ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 8 Multi-Head Attention Add & Norm Feed
Forward Add & Norm Input Embedding + ᶃ Ґஔදݱ ᶄ Attention ᶅ Feed-forward ᶆ Layer Normalization

ᶃ Ґஔදݱ ઈରҐஔʁ૬ରҐஔʁͦΕͱ΋… ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 9

ઈରҐஔදݱ; Absolute Position Embedding (APE) • ֤ҐஔΛઐ༻ͷϕΫτϧΛ࢖ͬͯදݱ͢Δ • e.g., sin/cos೾ͷ૊Έ߹Θͤ
[Vaswani+2017] • 😀 γϯϓϧɺ଎͍ɺ࣮૷͕؆୯ • Ґஔ͸୯ͳΔߦྻ͔ΒͷࣙॻҾ͖Ͱදݱ • 😰 ະ஌ͷ௕͞ͷܥྻʹରͯ͠൚Խ͠ͳ͍ • i.e., ֎ૠ͕ࠔ೉ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 10 (a) ASPEC English-to-Japanese (b) WMT2014 English-to-German Figure 3: BLEU scores on test data split by the sentence length (no training data in the gray-colored area). Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 Figure from [Neishi+2019] ະ஌ͷ௕͞

૬ରҐஔදݱ; Relative Position Embedding (RPE) • ֤τʔΫϯؒͷڑ཭ΛAttention಺෦Ͱར༻ • 😀 ະ஌ͷ௕͞ʹରͯ͠ؤ݈
• Ωʔϫʔυɿγϑτෆมੑ • 😰 ઈରҐஔΑΓ΋ܭࢉ͕࣌ؒ௕͍ • 😰 Attentionͷѥछͱͷ૊Έ߹Θ͕ͤࠔ೉ • Performer, Linformer, etc… ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 11 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input + -2 -1 0 1 2 Key Value Relative Distance Additional Features for Attention John yelled at Kevin ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} Position 0 1 2 3

ઈରҐஔ vs ૬ରҐஔɿ͜͜·Ͱͷ·ͱΊ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 12 ઈରҐஔ ૬ରҐஔ <৽نͷख๏> ੑೳ
😰 😀 ? ଎౓ 😀 😰 ? ࣮૷ίετ 😀 😰 ? ·ͣ͸զʑͷ औΓ૊ΈΛ঺հ

SHAPE: Shifted Absolute Position Embedding • w/ Sosuke Kobayashi, Jun
Suzuki, and Kentaro Inui • EMNLP2021ʹ࠾୒ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 13 Sosuke Kobayashi~,} Jun Suzuki~, Kentaro Inui~, ~ Tohoku University } Preferred Networks, Inc. [email protected], [email protected], jun.suzuki, inui}@tohoku.ac.jp t rucial for building ns in Transformers. ations suffer from test data with un- utational cost. We osition embedding ues. The basic idea invariance, which uccessful position y shifting absolute Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 k-1 k k+1 k+2 k+3 Absolute Position Embedding (APE) (a) Shifted APE (SHAPE) (c) Transformer Relative Position Embedding (RPE) (b) Relative Distance John yelled at Kevin ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} -2 -1 0 1 2 Key Value Shifted by random offset k Figure 1: Overview of position representations. (a) APE

ઈରҐஔʹγϑτෆมੑΛऔΓೖΕΔ • ૬ରҐஔ͸ະ஌ͷ௕͞ʹؤ݈→γϑτෆมੑ͕ॏཁͦ͏ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 14 John yelled at Kevin
! !"# {%&',)*+,&} ! ."# {%&',)*+,&} Position 0 1 2 3 Position 34 35 36 37 John yelled at Kevin ! !"#!$ {&'(,*+,-'} ! !/#!$ {&'(,*+,-'} e.g. ૬ରҐஔͷ৔߹: ! !"# {%&',)*+,&} ! !"# {%&',)*+,&} ! !"#!$ {&'(,*+,-'} ! !"#!$ {&'(,*+,-'} ͸ à ্ۭؒͷҐஔγϑτ͕ؔ਺ͷग़ྗʹӨڹ͠ͳ͍…ͱ͍͏ੑ࣭ ͱ౳͍͠

ઈରҐஔ+ϥϯμϜγϑτ=γϑτෆมੑ • Shifted Absolute Position Embedding (SHAPE) • ઈରҐஔΛΦϑηοτ 𝒌~𝓤(𝟎,
𝑲) ͰϥϯμϜʹγϑτ • Ϟσϧ͸ઈରҐஔΛར༻ෆՄೳ • ୅ΘΓʹ૬ରҐஔΛར༻͢ΔֶशΛڧ੍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 15 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 k-1 k k+1 k+2 k+3 APE Shifted by random offset k SHAPE

Ұ୴ɺ͜͜·Ͱͷ·ͱΊ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 16 ing ers. om un- We ing
dea ich ion ute Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 k-1 k k+1 k+2 k+3 Absolute Position Embedding (APE) (a) Shifted APE (SHAPE) (c) Transformer Relative Position Embedding (RPE) (b) Relative Distance John yelled at Kevin ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} -2 -1 0 1 2 Key Value Shifted by random offset k

༧උௐࠪɿSHAPE͸γϑτෆมੑΛ֫ಘ͢Δ͔ʁ • ֶशࡁΈϞσϧʹ͍ͭͯɺ𝑘 ͷ஋ΛมԽͤͯ͞ΈΔ • ֤ 𝑘 ʹ͍ͭͯΤϯίʔμͷίαΠϯྨࣅ౓Λൺֱ • APE:
֤ 𝑘 ͸ҟͳΔӅΕ૚Λੜ੒ • SHAPE: ֤ 𝑘 ͷ஋͸ӅΕ૚ͷ஋ʹӨڹ͠ͳ͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 17 0 100 250 500 APE EmbeddingLayer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 0 100 250 500 0 100 250 500 SHAPE k 0 100 250 500 0 100 250 500 0 100 250 500 0 100 250 500 0 100 250 500 0 100 250 500 0.8 0.9 1.0

࣮ݧઃఆͷ֓ཁ • Ϟσϧ: Transformer + APE, RPE, or SHAPE •
λεΫ:ػց຋༁ • ܇࿅σʔλ 1. Vanilla • WMT 2016 EnDe [Ott+2018] 2. Extrapolate • ௕͞50Ҏ্ͷܥྻΛ܇࿅σʔλ͔Β࡟আͯ͠࡞੒ 3. Interpolate • ྡ઀͢ΔܥྻΛ݁߹ͯ͠࡞੒ • ։ൃσʔλ: newstest2010-2013 • ςετσʔλ: newstest2014-2016 • ධՁ: sacreBLEU ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 18 ࠓճ͸লུ ઈରҐஔ ૬ରҐஔ

࣮ݧઃఆɿσʔληοτ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 19 ط஌&ະ஌ͷ௕͞ ͰධՁ͍ͨ͠ ᶃ Vanilla: WMT EnDe
2016 [Ott+2018] English German • ػց຋༁ͷඪ४తͳϕϯνϚʔΫ • ϕʔεϥΠϯͷੑೳͷ֬ೝ༻ ᶄ Extrapolate: ௕͞50Ҏ্ͷܥྻΛ܇࿅σʔλ͔Β࡟আ English German • Ϟσϧͷ֎ૠੑೳͷ֬ೝ • ະ஌ͷ௕͞ʹରͯ͠ؤ݈͔ʁ

࣮ݧ݁ՌɿRPEͱSHAPE͸ಉ౳ͷੑೳ • Extrapolateͷ݁Ռ • RPEͱSHAPE͕APEΛ্ճΔੑೳ • SHAPE͸RPEͱಉ౳ͷੑೳͰɺRPEΑΓߴ଎ • Vanillaͷ݁Ռ •
શͯͷϞσϧ͕ಉ౳ͷੑೳ • SHAPE͸ੑೳѱԽͷڪΕͳ͘࢖ͬͯྑͦ͞͏ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 20 Figure 2: Cosine similarities of the encoder hidden states with different offsets k 2 {0, 100, 250, 500}. Only the representation of SHAPE is invariant with k. Dataset Model Valid Test Speed VANILLA APE† 23.61 30.46 x1.00 RPE† 23.67 30.54 x0.91 SHAPE† 23.63 30.49 x1.01 EXTRAPOLATE APE 22.18 29.22 x1.00 RPE 22.97 29.86 x0.91 SHAPE 22.96 29.80 x0.99 Table 2: BLEU scores on newstest2010-2016. Valid is Figure 3: BLEU score improveme dation and test sets with respect to length. The gray color means no t

௕͞͝ͱͷੑೳΛௐࠪɿ֎ૠੑೳ͕޲্ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 21 0-9 10-19 20-29 30-39 40-49 50-59
60-69 70- Source Sequence Length (tokens) 0 2 4 6 8 BLEU improvement APE RPE SHAPE ϕʔεϥΠϯ "1& ͔Βͷੑೳ޲্෯ ະ஌ͷ௕͞ κʔϯ • SHAPEͱRPE͸APEΑΓ΋ߴ͍֎ૠੑೳΛൃش • SHAPEͱRPE͸ಉ౳ͷੑೳ

SHAPEͷ·ͱΊ • SHAPE : shifted absolute position embedding • APEʹγϑτෆมੑΛ෇༩
• APEͱಉ౳ͷ଎౓ • RPEͱಉ౳ͷੑೳ • ࣮૷͸؆୯ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 22 ॏཁ SHAPEͷ࣮૷ྫ ઈରҐஔ ૬ରҐஔ SHAPE ੑೳ 😰 😀 😀 ଎౓ 😀 😰 😀 ࣮૷ίετ 😀 😰 😀

ͦͷ΄͔ɺ࠷ۙͷҐஔදݱͨͪ • ALiBi: Attention with Linear Biases [Press+2021] • RoPE:
Rotary Position Embedding [Su+2021] • CAPE: Continuous Augmented Positional Embeddings [Likhomanenko+2021] • TUPE: Transformer with Untied Positional Encoding [Ke+2020] • ࠓճ͸লུ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 23 Q. શ෦ಉ͡͡Όͳ͍Ͱ͔͢ʁ A. ͕͍ͪ·͢…

ALiBi: Attention with Linear Biases [Press+2021] • Query-Keyͷ಺ੵʹ͍ͭͯڑ཭ʹ Ԡͨ͡ϖφϧςΟΛ෇༩ •
ԕ͍τʔΫϯ΄ͲAttentionͷ είΞ͕Լ͕Δ • ۙ๣Λ༏ઌ͢ΔΑ͏ͳInductive biasΛೖΕ͍ͯΔͱΈͳͤΔ • APEΑΓੑೳྑ͍ • ֎ૠੑೳ΋ྑ͍ • BigScienceWSͷϞσϧͰ΋࠾༻ • [Le Scao+2022] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 24 Figures from [Press+2021]

RoPE: Rotary Position Embedding [Su+2021] • ໨తɿQuery-Keyͷ಺ੵΛ ୯ޠؒͷ૬ରڑ཭ʹґଘͤ͞ ͍ͨ •
ҐஔʹԠͯ͡୯ޠϕΫτϧΛ 𝑚𝜃͚ͩճస • Query-Keyͷ૬ରڑ཭͕ಉ͡ →ͳ͕֯͢ಉ͡ • ઈରҐஔΛ্ճΔੑೳ • GPT-J΍GPT-NeoX΍PaLMͰ ࠾༻ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 25 Figures from [Su+2021]

CAPE: Continuous Augmented Position Embeddings • APEΛྑ͍ͨ͘͠೿ൊ • ܇࿅தʹҎԼͷૢ࡞ΛऔΓೖΕΔͱྑ͍ •
(MPCBM4IJGU ೖྗશମͷҐஔΛϥϯμϜʹͣΒ͢ • Local Shift (ೖྗҰ෦ͷҐஔΛϥϯμϜʹͣΒ͢) • Global Scaling (ॖईΛϥϯμϜʹௐ੔) • ը૾ɺԻ੠ɺݴޠλεΫͰੑೳ޲্Λใࠂ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 26 Ͳ͔͜Ͱݟ͕֮͑… [Likhomanenko+2021] Figure from [Likhomanenko+2021]

ᶄ Attention ৞ΈࠐΈͰ΋ྑ͍ͷͰ͸ɾͳΜͱ͔ݻఆ௕ʹ͍ͨ͠ɾAttention࠷ߴʂ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 28

͓͞Β͍ɿAttention • Transformerͷ֩৺ʁ • Մม௕ܥྻͷதͰ͍͍ײ͡ʹ৘ใΛࠞͥΔͨΊͷ࢓૊Έ • ௕ڑ཭ґଘͷֶश͕͠΍͍͢ʢͱ͞Ε͍ͯΔʣ • ฒྻԽͰ͖ΔͷͰRNNΑΓ΋ߴ଎ •
ͨͩ͠௒௕ڑ཭ʹͳΔͱܭࢉ͕େม • ܥྻYܥྻ ͷܭࢉ͕ඞཁ Process Output array Q x L V Attention K V Attention Q K V V K Q Attention scores O E ͜Ε͕Qͱಉ͡ͱ͖ Self-Attention Figure from [Jaegle+2022]

AttentionΛऔΓר͘࠷ۙͷঢ়گ • AttentionΛผͷԿ͔Ͱஔ͖׵͍͑ͨ • CNN [Wu+2019] • MLP [Sun+2021] •
AttentionΛݮΒ͍ͨ͠ • ୅ΘΓʹFeed-ForwardΛ૿΍͢ʁ [Zhang+2020] [Irie+2020] • AttentionࣗମΛܰྔԽ͍ͨ͠ • Efficient Transformers: A Survey • Attentionͷಛ௃Λ΋ͬͱ׆͔͍ͨ͠ • Perceiver [Jaegle+2021] • Memory Tranformer [Wu+2022] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 30

Dynamic Convolution [Wu+2019] • AttentionΛCNNͰஔ͖׵͑Α͏ܥͰ͸࠷΋༗໊ʁ • ֤࣌ࠁʹ͍ͭͯೖྗ͔ΒCNNͷΧʔωϧΛ ಈతʹੜ੒ • MTɾཁ໿ɾݴޠϞσϧͰAttentionΑΓߴੑೳ
• ύϥϝʔλڞ༗ͳͲͷࡉ͔͍ςΫχοΫ͕ॏཁͦ͏ • ࠷େΧʔωϧ෯͸31→ܥྻશମΛݟΔͷ͕େࣄʁ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 31 Figure from [Wu+2019] ਤ͸ ࿦จ঺հ: Pay Less Attention with Lightweight and Dynamic Convolutions ΑΓҾ༻

Pretrain-finetuneઃఆʹ͓͚ΔCNN [Tay+2022] • ٙ໰ɿPretrain-finetuneͷԸܙ͸Transformer͚ͩͷ΋ͷ͔ʁ • TransformerͱCNN3छྨΛൺֱɿ6/7λεΫͰCNN͕TransformerΛ্ճΔੑೳ • ͨͩ͠ɺCNNଆ͸3छྨ͋Δͷʹ஫ҙ • ͨͩ͠ɺ࣭໰Ԡ౴΍ؚҙؔ܎ೝࣝͳͲɺෳ਺จΛ༻͍ΔλεΫͰ͸
CNN < Transformer • CNNͰ͸จؒͷ৘ใͷ΍ΓऔΓ͕ࠔ೉ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 32 Table from [Tay+2022]

Perceiver [Jaegle+2021] • Perceiver: ݴޠɾԻ੠ɾը૾Ͱ൚༻తʹ࢖͑ΔϞσϧΛAttentionͰ࣮ݱ • 😩 ʮ1࣍ݩͳΒLSTMɺ2࣍ݩͳΒCNN…ʯΛղܾ • جຊతʹ͸Transformer
Encoder • ௕ڑ཭ܥྻ͕ͭΒ͍ͷͰɺೖྗΛCross-Attentionܦ༝Ͱݻఆ௕ʹ • ը૾෼ྨɺԻ੠+ಈըղੳɺ఺܈෼ྨͳͲͰڧ͍ϕʔεϥΠϯʹඖఢ • ͨͩ͠ɺ෼ྨ໰୊͔͠ղ͚ͳ͍ߏ଄ͳͷʹ஫ҙ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 33 M>>>N ͳͷͰܰྔ ResNetͱ͔ ViTͱ͔ ೖྗ ύϥϝʔλ ߦྻ Figure from [Jaegle+2021]

Perceiver IO [Jaegle+2022] • Perceiver IOɿPerceiverͷग़ྗଆʹ΋Cross AttentionΛ࠾༻ • ೖग़ྗʹରশੑΛ͍࣋ͨͤͯΔͱ΋ݟ͑Δ •
௕͞𝒐ͷܥྻΛग़ྗՄೳàద༻ՄೳλεΫ͕૿Ճ • ϚεΫݴޠϞσϧؚΉ֤छλεΫͰ൚༻తʹಈ࡞ • ग़ྗଆͷQuery ArrayͰ͸ະདྷͷ৘ใ͕ϦʔΫ͍ͯ͠Δͷʹ஫ҙ • ͳͷͰɺͦͷ··Ͱ͸ੜ੒ʹ࢖͑ͳ͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 34 ৽ن ύϥϝʔλߦྻ ೖྗΛݟΔͷ͸ Ұճ͚ͩ Figure from [Jaegle+2022]

Perceiver AR [Hawthorne+2022] • ະདྷͷ৘ใ͕ϦʔΫ͢ΔͷΛ๷͍͗ͨ • ग़ྗଆͷAttentionʹ͸ϚεΫػߏΛ௥Ճ͠ɺ autoregressiveͳੜ੒ΛՄೳʹ • ͍ͭ΋ͷCausal
Mask • ը૾ੜ੒ɺݴޠϞσϧͳͲͰߴੑೳ • Transformer-XLͳͲʹඖఢ • ܥྻ͕௕͘ͳͬͯ΋ֶश͸ߴ଎ • ܥྻม׵λεΫ౳Ͱಈ͘ͷ͔͸Α͘෼͔Βͳ͍… ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 35 Figures from [Hawthorne+2022]

Memorizing Transformers [Wu+2022] • Transformer DecoderΛ૝ఆ • ઐ༻ͷAttention૚Λઃܭ • աڈͷӅΕ૚Λ֎෦ϝϞϦͱ
Έͳͯ͠AttentionΛܭࢉ • AttentionείΞΛݩʹkNNͰ ద౰ʹࢬמΓ • ࠷ޙʹී௨ͷAttentionͱࠞͥΔ • ਺ֶه߸΍ͦͷఆٛΛ ͏·͘ੜ੒Ͱ͖Δ • ୯ͳΔ௿ස౓ޠͷίϐʔʁ • ؔ࿈ɿ[Sun+2021] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 36 ޯ഑͸ྲྀΕͳ͍ →ߴ଎ʹಈ࡞ Figures from [Wu+2022]

ᶅ FeedForward ΋ͬͱ΋ͬͱදݱྗΛ… ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 37

͓͞Β͍ɿFeedForward૚ • શ݁߹+Activation+શ݁߹ • ׳शɿ࣍ݩ਺Λ্͛ͯݩʹ໭͢ • Transformer-base: 512à2048à5412 • Transformer-big:
1024à4096à1024 • ࣮ࡍʹ͸ԿΛ΍͍ͬͯΔʁ • ֎෦ϝϞϦ΁ͷAttentionͱΈͳͤΔ • [Sukhbaatar+2019] • Key-ValueϝϞϦͱͯ͠ಈ࡞ • [Geva+2021] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 38 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + Linear Activation Function Linear X ࣍ݩ਺Λ্͛Δ ʙ4ഒఔ౓ ݩʹ໭͢

࠷ۙͷਐలɿ׆ੑԽؔ਺Λม͑ͯΈΔ • Vanilla TransformerͰ͸ReLUΛ࠾༻ • GPT΍BERTͳͲͷݴޠϞσϧɿGeLUΛଟ͘ݟ͔͚Δ • ॳग़͸GPTͷ͸ͣ • GLUܥ͕ޮ͘ͱ͍͏ใࠂ΋
[Shazeer+2020] • Կނྑ͍ͷ͔͸෼͔Βͳ͍͕… • “We offer no explanation as to why these architectures seem to work” ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 39 Figure from [Hendrycks+2016] Figure from [Narang+2021]

࠷ۙͷਐలɿ࣍ݩ਺Λ্͛Δ • WMT’19 ʹ͓͚ΔFacebookͷใࠂ[Ng+2019] • ৭ʑࢼͨ݁͠ՌɺFFNΛ૿΍͢ͱྑ͔ͬͨ • We experiment with
increasing network capacity by increasing embed dimension, FFN size, number of heads, and number of layers. • We find that using a larger FFN size (8192) gives a reasonable improvement in performance while maintaining a manageable network size • WMT’20 ࢀՃ࣌ʹ௥ࢼɾ࠾༻ [Kiyono+2020] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 40 Table from [Ng+2019]

දݱྗͷͦͷઌ΁ɿMixture of Experts • ී௨ʹϞσϧΛେ͖͍ͯ͘͘͠ͱ… • 😨 දݱྗUPͷͨΊύϥϝλ૿→Ϟσϧͷܭࢉ࣌ؒ૿ • ҰํɺMixture
of Expertsͷ৔߹ • 😀 ෳ਺ͷHomogenousͳExpertΛ༻ҙ→ύϥϝλ૿ɾදݱྗUP • 😀 ೖྗʹԠͯ͡Ұ෦ΛൃՐ→ܭࢉ࣌ؒ࡟ݮ • Ұ෦ք۾ʢGoogleͱFacebookʣͰྲྀߦ͍ͬͯΔ༷ࢠ • Switch Transformer [Fedus+2021], Base Layers [Lewis+2021], Sparse all-MLP [Yu+2022] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 41 Figure from [Fedus+2021]

ᶆ Layer Normalization ࠞಱͱͨ͠ੈք ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 42

͓͞Β͍ɿLayer Normalization (LN) • ֤ӅΕ૚͔Βฏۉͱ෼ࢄΛܭࢉ →ਖ਼نԽ • ֶशͷ҆ఆԽʹߩݙ • LNͳ͠ͷTransformer͸ֶशࠔ೉
• AttentionͱFeedForwardϒϩοΫ ʹҰͭͣͭ഑ஔ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 43 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + ຒΊࠐΈʹ LayerNorm͢Δ ྲྀ೿΋ ֶश͸҆ఆ͢Δ͕ ੑೳ͕ѱԽ͢Δ [Le Scao+2022] LNϙΠϯτᶃ LNϙΠϯτᶄ Figure from [Zhang+2019] RNNͷ৔߹

LNΛͲ͜ʹஔ͔͘ʁɿPost-LNͱPre-LN • Vanilla Transformer͸Post-LN • ࠷ۙͷTransformer͸Pre-LN͕ଟ͍ • ༨ஊɿPre-LNͷॳग़͸Tensor2Tensor • ͜ͷ͋ͱ࿦จ͕ͨ͘͞Μग़ͨ
໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 44 Post-LN ResidualޙʹLN Pre-LN ؔ਺ͷલʹLN ͦͷޙPre-LN͕ σϑΥϧτʹ ˠͱͯ΋ॏཁ Ռͨͯ͠LNͷҐஔ͕ͦΜͳʹॏཁͳͷ͔ʁ Figure from [Xiong+2020]

Post-LNͱPre-LNͷ೰·͍ؔ͠܎ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 45 2 st-LN e-LN ual 後に Norm
本研究の貢献・Post-LN と Pre-LN の性能差を実験的に⽰す・多層 Post-LN の学習が難しい原因を⽰す・⾼い性能を維持しつつ多層化する⼿法を提案性能多層化 Post-LN ◦ × Pre-LN × ◦ B2T（提案⼿法） ◦ ◦ Layer Norm Attention FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N × N Attention Layer Norm Layer Norm FFN Layer Norm (a) Post-LN er Norm ention Layer Norm Attention Attention Layer Norm Attention Layer Norm Layer Norm Attention Layer Norm Attention (a) Post-LN (b) Pre-LN (c) Post-LN with B2T connection ention er Norm er Norm Attention Layer Norm Layer Norm FFN Attention Layer Norm × N FFN er Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N (b) Pre-LN (c) Post-LN with B2T connection 適⽤前に Norm ೰ΈɿPre-LN͸ֶश͕༰қ͕ͩੑೳ͕௿͍ 1PTU-/ͷ΄͏͕ੑೳ͕ߴ͍ Pre-LN で良いのでは︖ Pre-LN は多層にしたときの学習が安定 – 近年の多層なモデル（例︓GPT）はPre-LN しかし Pre-LN は性能が低い – Post-LN は学習が成功した場合 Pre-LN より⾼性能 • 6層 Transformer エンコーダ・デコーダの性能⽐較 7 Post-LN Post-LN Pre-LN Pre-LN 翻訳（WMT英-独）でのBLEU ⾒出し⽂⽣成でのROUGE-1 ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻ 1PTU-/ͷଟ૚Խ͸೉͍͠ エンコーダ側デコーダ側各層の勾配のノルムデコーダ側で勾配消失が発⽣ 101 100 10-1 10-1 100

Post-LNͱPre-LNͷྑ͍ͱ͜औΓΛ͍ͨ͠ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 46 ͪΐ͏Ͳྑ͍ݚڀ͕…

ޯ഑ফࣦͷݪҼ͸LN • 勾配消失の原因を探る – デコーダの18層⽬の各位置における勾配のノルムを調べる Layer Norm Attention FFN Layer
Norm Layer Norm Attention (4) → (3)，(2) → (1) で勾配が⼤きく減衰 → LN をまたぐと勾配が⼤きく減衰 → LN が勾配を減衰させる → LN が勾配消失の原因 ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 47 ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻

B2T Connection [Takase+2022] Ͱྑ͍ͱ͜औΓΛ࣮ݱ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 48 Layer Norm Attention
FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N × N (c) Post-LN with B2T connection ͸ Pre-LNɼ(c) ͸ Post-LN ʹఏҊख๏Λ૊Έ 提案⼿法︓⾚字の Residual Connection を追加各層の下から上までの経路︓Bottom-to-Top Connection 1. 各層の最後の LN 以外を迂回する → LN による勾配の減衰を抑制 2. 最後の LN は⼊⼒にも適⽤ ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻

ൺֱ࣮ݧɿػց຋༁λεΫ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 49 ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻ B2T connection は学習に成功性能も他の⼿法より⾼い
• あ学習失敗 6層エンコーダ・デコーダ 18層エンコーダ・デコーダ Post-LN は Pre-LN よりも⾼い性能 B2T connection は Post-LN と同程度の性能

LNΛফͯ͠ޯ഑ফࣦʹରॲɿT-FixUp [Huang+2020] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 50 • T-FixUpɿ͔ͳΓݡ͍ॳظԽ • LNΛϞσϧશମ͔ΒຣফͰ͖Δ •
ֶश཰ͷWarmup͕ཁΒͳ͘ͳΔ • ଟ૚ʹͯ͠΋ֶश͕҆ఆ͢Δ • Warmup͸ͳͥඞཁ͔ͩͬͨʁ • LNͷޯ഑͸ೖྗͷϊϧϜʹ൓ൺྫ • ֶशॳظ͸Adamͷਪఆྔ͕ෆ҆ఆ→ߋ৽ྔ📈 • ݁ՌɺೖྗͷϊϧϜ͕📈 • ޯ഑͕ফֶ͑ͯशࠔ೉ʹ • 👍ϋΠύϥ (Warmup) ͕ফͤΔ • 👎 Կεςοϓ͔Βֶश཰ΛԼ͛Δ͔ʁ ͱ͍͏ϋΠύϥ͕૿͍͑ͯΔ • ϓϥϚΠθϩͰ͸ʁ Figure from [Huang+2020] ޯ഑ͷ༷ࢠ

͋͑ͯLNΛ૿΍͢ɿNormformer [Shleifer+2021] • LNΛ૿΍͢ͱऩଋ଎౓ͱੑೳ͕վળ͢Δ ͱ͍͏ݚڀ • େن໛ݴޠϞσϧͰͷ࣮ݧͰൺֱ • Pre-LNʹ͓͚Δ௿૚ͷޯ഑ͱ ߴ૚ͷޯ഑ͷεέʔϧͷෆҰகΛ໰୊ࢹ
• AttentionͱFeedForwardޙʹLNͯ͠ εέʔϧΛҰகͤ͞Δ • εέʔϧͷෆҰக͕ͳͥ໰୊ͳͷ͔ʁ • ΈΜͳ࢖ͬͯΔσʔλ͕ҧ͏ɿ֤ख๏ͷ ൺֱ͕ࠔ೉ • ࠞಱͱ͍ͯ͠Δ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 51 Figure from [Shleifer+2021]

ͻͱ΍͢ΈɿมΘΓछTransformer ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 52

Sandwitch Transformer [Press+2019] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 53 Multi-Head Attention Add &
Norm Feed Forward Add & Norm Input Embedding + ΞΠσΞ (Attention+FeedForward)ΛN૚ੵΉɺ͸ຊ౰ʹ࠷ద͔ʁ →֤αϒϨΠϠ͸Մ׵Ͱ͋Δ͜ͱʹண໨͠ɺ ॱ൪ΛೖΕସ͑ͯΈΔ

Sandwitch Transformer [Press+2019] • ॱ൪ΛϥϯμϜʹೖΕସ͑ͯΈ࣮ͯݧ • ϕʔεϥΠϯΑΓ΋ྑ͍ߏ଄͕͋ΔͬΆ͍ • ೖྗଆʹAttentionଟΊɺग़ྗଆʹFFNଟΊ͕ ྑͦ͞͏
• ͭ·Γ • ݴޠϞσϧͰੑೳ޲্ɺ຋༁Ͱ͸ޮՌͳ͠ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 54 which is comparable to the performance of the base- ine with the same number of parameters. We next generalize this model and the original nterleaved transformer, creating the family of sandwich transformers. A sandwichn k transformer con- ists of 2n sublayers in total (n of each type), con- orming to the regular expression sk(sf)n k fk. The first k sublayers are purely self-attention (s), while the last k are feedforward sublayers (f). In etween, we use the original interleaving pattern sf) to fill the remaining 2(n k) sublayers. When k = 0, we get the original transformer model, and when k = n 1 (its maximal value) we get the Model Test Baseline (Baevski and Auli, 2019) 18.70 Transformer XL (Dai et al., 2019) 18.30 kNN-LM (Khandelwal et al., 2019) 15.79 Baseline (5 Runs) 18.63 ± 0.26 Sandwich16 6 17.96 Table 3: Performance on the WikiText-103 test set. We compare the best sandwich transformer to the unmod- ified, interleaved transformer baseline (Baevski and Auli, 2019) trained over 5 random seeds and to other previously reported results. than the average baseline transformer. Of those, 6 models outperform the best baseline transformer (k = 5, 6, 8, 9, 10, 11). The best performance of 17.84 perplexity is obtained when k = 6. We compare this model to the baseline on WikiText-103’s test set. Table 3 shows that, despite its simple design, the sandwich transformer outperforms the original transformer baseline by roughly double the gap be- tween the baseline (Baevski and Auli, 2019) and Transformer XL (Dai et al., 2019). This improvement comes at no extra cost in parameters, data, memory, or computation; we did not even change any of the original hyperparameters, including the number of training epochs. To check whether this advantage is consistent, we train 4 more sandwich16 6 models with different Figure 5: The transformer’s sandwich coefficient (k) and validation perplexity, for k 2 {1, . . . , 15}. The dotted line is the average baseline model’s perplexity (trained with different random seeds), whereas the dashed line represents the best baseline model. Figure 6: Performance on the WikiText-103 develop- ment set of the Sandwich16 transformer and the base- Figures from [Press+2019] ଠࣈ͕Baseline ௿͍΄Ͳྑ͍ੑೳ

Primer [So+2021] • TransformerͷΞʔΩςΫνϟΛҨ఻తΞϧΰϦζϜͰ୳ࡧ • ୳ࡧۭؒΛAdd/Square/Conv…ͳͲͷ௿ϨϕϧԋࢉͰߏ੒ • ൚༻తʹྑ͔֦ͬͨு͸2ͭ 1. Query,
Key, Valueʹରͯ͠෯3ͷDepthwise Convolution (DConv) 2. ReLUͷग़ྗΛೋ৐͢Δ • DConvͷؾ࣋ͪʹͳͬͯΈΔ • Attentionͷલʹۙ๣Λݟ͍ͨ • Ͱ΋ී௨ͷConv͡ΌͩΊ • ෯Λ޿ͯ͘͠΋ͩΊ • αϯϓϧޮ཰্͕͕Δͱͷ͜ͱ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 55 νϟωϧؒͰ ৘ใΛࠞͥͳ͍Conv Figure from [So+2021]

Transformer͸”ྑ͘”ͳͬͨͷ͔ʁ ͞·͟·ͳ௥ࢼͷ঺հ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 56

ʮྑ͞ʯ͸࣮૷Λ·͍ͨͰ൚Խ͢Δ͔ʁɿNo • Do Transformer Modifications Transfer Across Implementations and Applications?
[Narang+2021] • ͞·͟·ͳTransformerͷ֦ுΛڞ௨ͷίʔυϕʔεʹ࠶࣮૷ • ݩ࿦จͷใࠂ͢Δ஌ݟʢi.e., ྑ͞ʣΛ࠶ݱͰ͖Δ͔௥ࢼ • ࠶ݱͰ͖Δख๏͸ҎԼͷ͍ͣΕ͔ͱใࠂ • γϯϓϧͳมߋʢ׆ੑԽؔ਺ͳͲʣ • ܭࢉྔͷ૿ՃΛ൐͏΋ͷʢଟ૚ԽͳͲʣ • ύϥϝʔλ਺ͷ૿ՃΛ൐͏΋ͷʢ.JYUVSFPG&YQFSUTͳͲʣ • ϋΠύϥͷௐ੔Λ͍ͯ͠ͳ͍͜ͱʹ஫ҙ • ֦ுख๏ͨͪʹͱͬͯएׯෆརͳઃఆͱΈͳͤΔ • ϋΠύϥͷௐ੔͕ඞཁͳ࣌఺ͰμϝͰ͠ΐɺͱ͍͏ڧ͍ओு ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 57 ͩΊ͡ΌΜ…

ʮྑ͞ʯ͸σʔλʹؔͯ͠൚Խ͢Δ͔ʁɿNo • The Impact of Positional Encodings on Multilingual Compression
• [Ravishankar+2021] • Ґஔදݱख๏ʢઈରҐஔɺ૬ରҐஔʣΛม͑ͯଟݴޠBERTΛ܇࿅ • ଟݴޠλεΫʢ୯ޠΞϥΠϯϝϯτɾจநग़ͳͲʣͰධՁ • ઈରҐஔ͕૬ରҐஔΛ্ճΔੑೳ • ૬ରҐஔ͕ଟݴޠؒͷΞϥΠϯϝϯτͷֶशΛࠔ೉ʹ͍ͯ͠Δʁ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 58 Figure from [Ravishankar+2021]

ʮྑ͞ʯ͸Ϟσϧͷεέʔϧʹؔͯ͠൚Խ͢Δ͔ʁɿNo • Transformerͷѥछʹ͍ͭͯϞσϧͷεέʔϧΛม͑ͨͱ͖ͷੑೳΛൺֱ • Vanilla, Universal Transformer, MLP-Mixer, ALBERT, Performer,
etc… • ݁ՌɺVanilla Transformer͕Ұ൪ྑ͔ͬͨ • 😀 ಛఆͷεέʔϧͰTransformerΑΓྑ͍Ϟσϧ͸ଘࡏ • 😨 TransformerΑΓҰ؏ͯ͠ྑ͍΋ͷ͸ଘࡏ͠ͳ͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 59 [Anonymous+2021] ARRʹొ৔ͯ͠Ҏ߱ Իࠫଡͳ͕ͩ͠… Figure from [Anonymous+2021]

ؔ࿈ɿ175BݴޠϞσϧʹ͓͚ΔMetaͷ஌ݟ • খن໛࣮ݧ͔ΒಘͨϋΠύϥ͕௨༻͠ͳ͍ • ͪͳΈʹɿ13Bύϥϝλ͸”খن໛” • ֶश཰ΛGPT-3ઃఆΑΓେ͖ͯ͘͠Έ͚ͨͲେࣦഊ • طଘͷϞσϧ͕௨༻͠ͳ͍ •
Normformerɿֶश్தͰloss͕ఀ଺ • ֶशͷ҆ఆԽ͕ॏཁ • FP16ͩͱGeLU͕਺஋తʹ҆ఆ͠ͳ͍ à ReLUͰଥڠ • EmbeddingʹLN͢Δͱֶश͕҆ఆ͢Δ͚Ͳੑೳ͕ѱ͍ • ؤுͬͯ΋਺ઍUpdate͝ͱʹloss͕ൃࢄ͢Δ à ͦͷ౎౓restart͕ඞཁ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 60 https://github.com/facebookresearch/metaseq/blob/ main/projects/OPT/chronicles/56_percent_update.md LogBook͸Ұಡͷ Ձ஋͋Γ

༨ஊɿWMT 2020ͷܦݧ͔Β • γεςϜ։ൃͷ࠷தʹ͋Ε͜Εطଘख๏Λࢼ͢ • େମ͏·͍͔͘ͳ͍ • ྫɿٙࣅσʔλͷϑΟϧλϦϯά • ਺ԯจͷٙࣅσʔλશ෦Ͱֶशͨ͘͠ͳ͍→ྑ͍σʔλΛऔΓग़͍ͨ͠
• ϑΟϧλϦϯάͷ༗ແʹΑΒͣੑೳʹӨڹͳ͠ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 61 ਗ਼໺ͷࢀՃͨ͠ ػց຋༁ͷίϯϖ Setting En!De De!En En!Ja Ja!En ASE (Section 3.1) 42.4 42.0 19.7 21.6 ASE (l = 9)+TAGGED-BT (Section 3.3) 42.7 42.5 22.0 23.9 b) + fine-tuning (Section 3.4) 44.9 42.3 23.1 24.4 c) ⇥ 4 (Section 3.5) 45.5 42.8 23.9 25.4 d) + 4 ⇥ (c)-R2L (Section 3.6) 45.4 43.6 24.2 25.9 e) + reranking (Section 3.7) 45.7 43.8 24.9 26.2 he best system in WMT’19 44.9 42.8 - - of each technique: we use newstest2019 and official validation set for En$De and En$Ja result from WMT’19 is unavailable for En$Ja, because this task has newly appeared this g / ID BLEU chrF able 5) 37.5 0.647 able 5) 43.8 0.690 able 5) 40.1 0.343 ENSEMBLE 25.5 0.536 e on WMT’20 Test Set: refer to newstest Amount of Synthetic Data Used: r (%) 2014 2018 2019 100 33.0 48.0 42.0 50 32.9 48.4 42.3 33 33.1 47.9 42.2 25 32.9 48.5 42.4 Table 7: Effectiveness of corpus filtering on En!De. newstest Table from [Kiyono+2020]

·ͱΊ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 62

ΑΓྑ͍TransformerΛͭ͘Δ • Ґஔදݱ • ઈରҐஔΑΓ΋૬ରҐஔ • ݸਓతʹ͸ALiBiͱSHAPEʹՄೳੑΛײ͡Δ • Attention •
௕͍ܥྻͷॲཧʹ՝୊ • Perceiver͕࣍ͷ೼ݖΛऔΔ͔΋͠Εͳ͍ • Feed-Forward • ࣍ݩΛ૿΍͢ɺ׆ੑԽؔ਺ɺMixture-of-Experts • Layer Normalization • PreLN͕Ұൠత • ྑ͠ѱ͠͸ࠞಱͱ͍ͯ͠Δ… • ྑ͞ͷ௥ࢼ • ൚༻తʹྑ͍ख๏͸΄ͱΜͲແ͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 63

ΑΓྑ͍TransformerΛͭ͘Δʁ • ಛఆͷઃఆͰ”ྑ͍”Transformer͸ग़͖͍ͯͯΔ • ͔͠͠ɺ൚༻తʹྑ͍΋ͷΛͭ͘Δͷ͸ඇৗʹࠔ೉ • ݴ͍׵͑ΔͱɺσʔλྔɾλεΫɾεέʔϧɾ࣮૷ʹΑͬͯ ྑ͍Transformerͷઃఆ͸ҧ͏ • …ͱ͍͏ͷ͕͜Ε·Ͱͷ஌ݟ
• ݁ՌɿҰൠతʹ࠾༻͞Ε͍ͯΔख๏͸͔ᷮ͘͝ • Pre-LNɾGeLUʢɾ૬ରҐஔʁʁʣ • ΠϯύΫτͷେ͖͍ݚڀ෼໺Ͱ͋Δ͜ͱʹมΘΓ͸ͳ͍ • ࠓޙ΋ओྲྀͷݚڀͱͯ͠ଓ͖ͦ͏ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 64

より良いTransformerをつくる

より良いTransformerをつくる

More Decks by Shun Kiyono

Other Decks in Research

Featured

Transcript