Slide 1

Slide 1 text

ΑΓྑ͍TransformerΛͭ͘Δ ཧݚAIP ਗ਼໺ ॢ

Slide 2

Slide 2 text

ँࣙ • খྛᰜհࢯˏPreferred Networks / ౦๺େ • ߴ੉ᠳࢯˏ౦޻େ • ླ໦५ઌੜˏ౦๺େ ʹ͸ຊߨԋͷςʔϚઃఆ΍εϥΠυͷਪᏏʹ͝ڠྗ͍͖ͨͩ·ͨ͠ɻ ͜ͷ৔ΛआΓͯਂ͘ײँ͍ͨ͠·͢ɻ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 3

Slide 3

Slide 3 text

͓͞Β͍ɿTransformer [Vaswani+2017] • Attention→FeedforwardΛͨ͘͞ΜੵΉ+৭ʑ޻෉ • ܥྻม׵Ϟσϧ΍ࣄલ܇࿅ࡁΈݴޠϞσϧͷࠜװ • ࠷ۙ͸Ի੠ɾը૾෼໺Ͱ΋׆ൃ • ∴ Transformerࣗମͷվྑ͸ΠϯύΫτେ • Change the worldͰ͖ΔՄೳੑ͕͋Δ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 4 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding +

Slide 4

Slide 4 text

TransformerΛ΋͏গ͓͠͞Β͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 5 ਤ͸ https://jalammar.github.io/illustrated-transformer/ ΑΓ ୯ޠؒͰ৘ใΛ΍ΓऔΓ ୯ޠ಺Ͱԋࢉ

Slide 5

Slide 5 text

໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 6 ͱ͜ΖͰɺ ྑ͍5SBOTGPSNFSͬͯͳʹʁ

Slide 6

Slide 6 text

ྑ͍Transformer͸ɺੑೳͷߴ͍Transformer • ຊൃදͰओʹ࿩͢͜ͱ • λεΫ൚༻తʹੑೳ޲্ΛͶΒ͏ํ๏࿦ • ຊൃදͰओʹ͸࿩͞ͳ͍͜ͱ • ಛఆͷλεΫɾσʔληοτʹண໨ͨ͠ํ๏࿦ • ܭࢉޮ཰Λ্͛ΔͨΊͷํ๏࿦ • Ex) ύϥϝʔλ࡟ݮ • Ex) Ξςϯγϣϯͷߴ଎Խ • Ex) ྔࢠԽ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 7 ͋͘·Ͱຊൃදͷఆٛ Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding +

Slide 7

Slide 7 text

ຊൃදͷߏ੒ɿ֤෦ҐͷऔΓ૊ΈΛ঺հ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 8 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + ᶃ Ґஔදݱ ᶄ Attention ᶅ Feed-forward ᶆ Layer Normalization

Slide 8

Slide 8 text

ᶃ Ґஔදݱ ઈରҐஔʁ૬ରҐஔʁͦΕͱ΋… ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 9

Slide 9

Slide 9 text

ઈରҐஔදݱ; Absolute Position Embedding (APE) • ֤ҐஔΛઐ༻ͷϕΫτϧΛ࢖ͬͯදݱ͢Δ • e.g., sin/cos೾ͷ૊Έ߹Θͤ [Vaswani+2017] • 😀 γϯϓϧɺ଎͍ɺ࣮૷͕؆୯ • Ґஔ͸୯ͳΔߦྻ͔ΒͷࣙॻҾ͖Ͱදݱ • 😰 ະ஌ͷ௕͞ͷܥྻʹରͯ͠൚Խ͠ͳ͍ • i.e., ֎ૠ͕ࠔ೉ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 10 (a) ASPEC English-to-Japanese (b) WMT2014 English-to-German Figure 3: BLEU scores on test data split by the sentence length (no training data in the gray-colored area). Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 Figure from [Neishi+2019] ະ஌ͷ௕͞

Slide 10

Slide 10 text

૬ରҐஔදݱ; Relative Position Embedding (RPE) • ֤τʔΫϯؒͷڑ཭ΛAttention಺෦Ͱར༻ • 😀 ະ஌ͷ௕͞ʹରͯ͠ؤ݈ • Ωʔϫʔυɿγϑτෆมੑ • 😰 ઈରҐஔΑΓ΋ܭࢉ͕࣌ؒ௕͍ • 😰 Attentionͷѥछͱͷ૊Έ߹Θ͕ͤࠔ೉ • Performer, Linformer, etc… ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 11 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input + -2 -1 0 1 2 Key Value Relative Distance Additional Features for Attention John yelled at Kevin ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} Position 0 1 2 3

Slide 11

Slide 11 text

ઈରҐஔ vs ૬ରҐஔɿ͜͜·Ͱͷ·ͱΊ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 12 ઈରҐஔ ૬ରҐஔ <৽نͷख๏> ੑೳ 😰 😀 ? ଎౓ 😀 😰 ? ࣮૷ίετ 😀 😰 ? ·ͣ͸զʑͷ औΓ૊ΈΛ঺հ

Slide 12

Slide 12 text

SHAPE: Shifted Absolute Position Embedding • w/ Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui • EMNLP2021ʹ࠾୒ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 13 Sosuke Kobayashi~,} Jun Suzuki~, Kentaro Inui~, ~ Tohoku University } Preferred Networks, Inc. [email protected], [email protected], jun.suzuki, inui}@tohoku.ac.jp t rucial for building ns in Transformers. ations suffer from test data with un- utational cost. We osition embedding ues. The basic idea invariance, which uccessful position y shifting absolute Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 k-1 k k+1 k+2 k+3 Absolute Position Embedding (APE) (a) Shifted APE (SHAPE) (c) Transformer Relative Position Embedding (RPE) (b) Relative Distance John yelled at Kevin ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} -2 -1 0 1 2 Key Value Shifted by random offset k Figure 1: Overview of position representations. (a) APE

Slide 13

Slide 13 text

ઈରҐஔʹγϑτෆมੑΛऔΓೖΕΔ • ૬ରҐஔ͸ະ஌ͷ௕͞ʹؤ݈→γϑτෆมੑ͕ॏཁͦ͏ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 14 John yelled at Kevin ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} Position 0 1 2 3 Position 34 35 36 37 John yelled at Kevin ! !"#!$ {&'(,*+,-'} ! !/#!$ {&'(,*+,-'} e.g. ૬ରҐஔͷ৔߹: ! !"# {%&',)*+,&} ! !"# {%&',)*+,&} ! !"#!$ {&'(,*+,-'} ! !"#!$ {&'(,*+,-'} ͸ à ্ۭؒͷҐஔγϑτ͕ؔ਺ͷग़ྗʹӨڹ͠ͳ͍…ͱ͍͏ੑ࣭ ͱ౳͍͠

Slide 14

Slide 14 text

ઈରҐஔ+ϥϯμϜγϑτ=γϑτෆมੑ • Shifted Absolute Position Embedding (SHAPE) • ઈରҐஔΛΦϑηοτ 𝒌~𝓤(𝟎, 𝑲) ͰϥϯμϜʹγϑτ • Ϟσϧ͸ઈରҐஔΛར༻ෆՄೳ • ୅ΘΓʹ૬ରҐஔΛར༻͢ΔֶशΛڧ੍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 15 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 k-1 k k+1 k+2 k+3 APE Shifted by random offset k SHAPE

Slide 15

Slide 15 text

Ұ୴ɺ͜͜·Ͱͷ·ͱΊ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 16 ing ers. om un- We ing dea ich ion ute Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 k-1 k k+1 k+2 k+3 Absolute Position Embedding (APE) (a) Shifted APE (SHAPE) (c) Transformer Relative Position Embedding (RPE) (b) Relative Distance John yelled at Kevin ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} -2 -1 0 1 2 Key Value Shifted by random offset k

Slide 16

Slide 16 text

༧උௐࠪɿSHAPE͸γϑτෆมੑΛ֫ಘ͢Δ͔ʁ • ֶशࡁΈϞσϧʹ͍ͭͯɺ𝑘 ͷ஋ΛมԽͤͯ͞ΈΔ • ֤ 𝑘 ʹ͍ͭͯΤϯίʔμͷίαΠϯྨࣅ౓Λൺֱ • APE: ֤ 𝑘 ͸ҟͳΔӅΕ૚Λੜ੒ • SHAPE: ֤ 𝑘 ͷ஋͸ӅΕ૚ͷ஋ʹӨڹ͠ͳ͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 17 0 100 250 500 APE EmbeddingLayer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 0 100 250 500 0 100 250 500 SHAPE k 0 100 250 500 0 100 250 500 0 100 250 500 0 100 250 500 0 100 250 500 0 100 250 500 0.8 0.9 1.0

Slide 17

Slide 17 text

࣮ݧઃఆͷ֓ཁ • Ϟσϧ: Transformer + APE, RPE, or SHAPE • λεΫ:ػց຋༁ • ܇࿅σʔλ 1. Vanilla • WMT 2016 EnDe [Ott+2018] 2. Extrapolate • ௕͞50Ҏ্ͷܥྻΛ܇࿅σʔλ͔Β࡟আͯ͠࡞੒ 3. Interpolate • ྡ઀͢ΔܥྻΛ݁߹ͯ͠࡞੒ • ։ൃσʔλ: newstest2010-2013 • ςετσʔλ: newstest2014-2016 • ධՁ: sacreBLEU ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 18 ࠓճ͸লུ ઈରҐஔ ૬ରҐஔ

Slide 18

Slide 18 text

࣮ݧઃఆɿσʔληοτ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 19 ط஌&ະ஌ͷ௕͞ ͰධՁ͍ͨ͠ ᶃ Vanilla: WMT EnDe 2016 [Ott+2018] English German • ػց຋༁ͷඪ४తͳϕϯνϚʔΫ • ϕʔεϥΠϯͷੑೳͷ֬ೝ༻ ᶄ Extrapolate: ௕͞50Ҏ্ͷܥྻΛ܇࿅σʔλ͔Β࡟আ English German • Ϟσϧͷ֎ૠੑೳͷ֬ೝ • ະ஌ͷ௕͞ʹରͯ͠ؤ݈͔ʁ

Slide 19

Slide 19 text

࣮ݧ݁ՌɿRPEͱSHAPE͸ಉ౳ͷੑೳ • Extrapolateͷ݁Ռ • RPEͱSHAPE͕APEΛ্ճΔੑೳ • SHAPE͸RPEͱಉ౳ͷੑೳͰɺRPEΑΓߴ଎ • Vanillaͷ݁Ռ • શͯͷϞσϧ͕ಉ౳ͷੑೳ • SHAPE͸ੑೳѱԽͷڪΕͳ͘࢖ͬͯྑͦ͞͏ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 20 Figure 2: Cosine similarities of the encoder hidden states with different offsets k 2 {0, 100, 250, 500}. Only the representation of SHAPE is invariant with k. Dataset Model Valid Test Speed VANILLA APE† 23.61 30.46 x1.00 RPE† 23.67 30.54 x0.91 SHAPE† 23.63 30.49 x1.01 EXTRAPOLATE APE 22.18 29.22 x1.00 RPE 22.97 29.86 x0.91 SHAPE 22.96 29.80 x0.99 Table 2: BLEU scores on newstest2010-2016. Valid is Figure 3: BLEU score improveme dation and test sets with respect to length. The gray color means no t

Slide 20

Slide 20 text

௕͞͝ͱͷੑೳΛௐࠪɿ֎ૠੑೳ͕޲্ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 21 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70- Source Sequence Length (tokens) 0 2 4 6 8 BLEU improvement APE RPE SHAPE ϕʔεϥΠϯ "1& ͔Βͷੑೳ޲্෯ ະ஌ͷ௕͞ κʔϯ • SHAPEͱRPE͸APEΑΓ΋ߴ͍֎ૠੑೳΛൃش • SHAPEͱRPE͸ಉ౳ͷੑೳ

Slide 21

Slide 21 text

SHAPEͷ·ͱΊ • SHAPE : shifted absolute position embedding • APEʹγϑτෆมੑΛ෇༩ • APEͱಉ౳ͷ଎౓ • RPEͱಉ౳ͷੑೳ • ࣮૷͸؆୯ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 22 ॏཁ SHAPEͷ࣮૷ྫ ઈରҐஔ ૬ରҐஔ SHAPE ੑೳ 😰 😀 😀 ଎౓ 😀 😰 😀 ࣮૷ίετ 😀 😰 😀

Slide 22

Slide 22 text

ͦͷ΄͔ɺ࠷ۙͷҐஔදݱͨͪ • ALiBi: Attention with Linear Biases [Press+2021] • RoPE: Rotary Position Embedding [Su+2021] • CAPE: Continuous Augmented Positional Embeddings [Likhomanenko+2021] • TUPE: Transformer with Untied Positional Encoding [Ke+2020] • ࠓճ͸লུ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 23 Q. શ෦ಉ͡͡Όͳ͍Ͱ͔͢ʁ A. ͕͍ͪ·͢…

Slide 23

Slide 23 text

ALiBi: Attention with Linear Biases [Press+2021] • Query-Keyͷ಺ੵʹ͍ͭͯڑ཭ʹ Ԡͨ͡ϖφϧςΟΛ෇༩ • ԕ͍τʔΫϯ΄ͲAttentionͷ είΞ͕Լ͕Δ • ۙ๣Λ༏ઌ͢ΔΑ͏ͳInductive biasΛೖΕ͍ͯΔͱΈͳͤΔ • APEΑΓੑೳྑ͍ • ֎ૠੑೳ΋ྑ͍ • BigScienceWSͷϞσϧͰ΋࠾༻ • [Le Scao+2022] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 24 Figures from [Press+2021]

Slide 24

Slide 24 text

RoPE: Rotary Position Embedding [Su+2021] • ໨తɿQuery-Keyͷ಺ੵΛ ୯ޠؒͷ૬ରڑ཭ʹґଘͤ͞ ͍ͨ • ҐஔʹԠͯ͡୯ޠϕΫτϧΛ 𝑚𝜃͚ͩճస • Query-Keyͷ૬ରڑ཭͕ಉ͡ →ͳ͕֯͢ಉ͡ • ઈରҐஔΛ্ճΔੑೳ • GPT-J΍GPT-NeoX΍PaLMͰ ࠾༻ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 25 Figures from [Su+2021]

Slide 25

Slide 25 text

CAPE: Continuous Augmented Position Embeddings • APEΛྑ͍ͨ͘͠೿ൊ • ܇࿅தʹҎԼͷૢ࡞ΛऔΓೖΕΔͱྑ͍ • (MPCBM4IJGU ೖྗશମͷҐஔΛϥϯμϜʹͣΒ͢ • Local Shift (ೖྗҰ෦ͷҐஔΛϥϯμϜʹͣΒ͢) • Global Scaling (ॖईΛϥϯμϜʹௐ੔) • ը૾ɺԻ੠ɺݴޠλεΫͰੑೳ޲্Λใࠂ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 26 Ͳ͔͜Ͱݟ͕֮͑… [Likhomanenko+2021] Figure from [Likhomanenko+2021]

Slide 26

Slide 26 text

ᶄ Attention ৞ΈࠐΈͰ΋ྑ͍ͷͰ͸ɾͳΜͱ͔ݻఆ௕ʹ͍ͨ͠ɾAttention࠷ߴʂ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 28

Slide 27

Slide 27 text

͓͞Β͍ɿAttention • Transformerͷ֩৺ʁ • Մม௕ܥྻͷதͰ͍͍ײ͡ʹ৘ใΛࠞͥΔͨΊͷ࢓૊Έ • ௕ڑ཭ґଘͷֶश͕͠΍͍͢ʢͱ͞Ε͍ͯΔʣ • ฒྻԽͰ͖ΔͷͰRNNΑΓ΋ߴ଎ • ͨͩ͠௒௕ڑ཭ʹͳΔͱܭࢉ͕େม • ܥྻYܥྻ ͷܭࢉ͕ඞཁ Process Output array Q x L V Attention K V Attention Q K V V K Q Attention scores O E ͜Ε͕Qͱಉ͡ͱ͖ Self-Attention Figure from [Jaegle+2022]

Slide 28

Slide 28 text

AttentionΛऔΓר͘࠷ۙͷঢ়گ • AttentionΛผͷԿ͔Ͱஔ͖׵͍͑ͨ • CNN [Wu+2019] • MLP [Sun+2021] • AttentionΛݮΒ͍ͨ͠ • ୅ΘΓʹFeed-ForwardΛ૿΍͢ʁ [Zhang+2020] [Irie+2020] • AttentionࣗମΛܰྔԽ͍ͨ͠ • Efficient Transformers: A Survey • Attentionͷಛ௃Λ΋ͬͱ׆͔͍ͨ͠ • Perceiver [Jaegle+2021] • Memory Tranformer [Wu+2022] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 30

Slide 29

Slide 29 text

Dynamic Convolution [Wu+2019] • AttentionΛCNNͰஔ͖׵͑Α͏ܥͰ͸࠷΋༗໊ʁ • ֤࣌ࠁʹ͍ͭͯೖྗ͔ΒCNNͷΧʔωϧΛ ಈతʹੜ੒ • MTɾཁ໿ɾݴޠϞσϧͰAttentionΑΓߴੑೳ • ύϥϝʔλڞ༗ͳͲͷࡉ͔͍ςΫχοΫ͕ॏཁͦ͏ • ࠷େΧʔωϧ෯͸31→ܥྻશମΛݟΔͷ͕େࣄʁ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 31 Figure from [Wu+2019] ਤ͸ ࿦จ঺հ: Pay Less Attention with Lightweight and Dynamic Convolutions ΑΓҾ༻

Slide 30

Slide 30 text

Pretrain-finetuneઃఆʹ͓͚ΔCNN [Tay+2022] • ٙ໰ɿPretrain-finetuneͷԸܙ͸Transformer͚ͩͷ΋ͷ͔ʁ • TransformerͱCNN3छྨΛൺֱɿ6/7λεΫͰCNN͕TransformerΛ্ճΔੑೳ • ͨͩ͠ɺCNNଆ͸3छྨ͋Δͷʹ஫ҙ • ͨͩ͠ɺ࣭໰Ԡ౴΍ؚҙؔ܎ೝࣝͳͲɺෳ਺จΛ༻͍ΔλεΫͰ͸ CNN < Transformer • CNNͰ͸จؒͷ৘ใͷ΍ΓऔΓ͕ࠔ೉ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 32 Table from [Tay+2022]

Slide 31

Slide 31 text

Perceiver [Jaegle+2021] • Perceiver: ݴޠɾԻ੠ɾը૾Ͱ൚༻తʹ࢖͑ΔϞσϧΛAttentionͰ࣮ݱ • 😩 ʮ1࣍ݩͳΒLSTMɺ2࣍ݩͳΒCNN…ʯΛղܾ • جຊతʹ͸Transformer Encoder • ௕ڑ཭ܥྻ͕ͭΒ͍ͷͰɺೖྗΛCross-Attentionܦ༝Ͱݻఆ௕ʹ • ը૾෼ྨɺԻ੠+ಈըղੳɺ఺܈෼ྨͳͲͰڧ͍ϕʔεϥΠϯʹඖఢ • ͨͩ͠ɺ෼ྨ໰୊͔͠ղ͚ͳ͍ߏ଄ͳͷʹ஫ҙ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 33 M>>>N ͳͷͰܰྔ ResNetͱ͔ ViTͱ͔ ೖྗ ύϥϝʔλ ߦྻ Figure from [Jaegle+2021]

Slide 32

Slide 32 text

Perceiver IO [Jaegle+2022] • Perceiver IOɿPerceiverͷग़ྗଆʹ΋Cross AttentionΛ࠾༻ • ೖग़ྗʹରশੑΛ͍࣋ͨͤͯΔͱ΋ݟ͑Δ • ௕͞𝒐ͷܥྻΛग़ྗՄೳàద༻ՄೳλεΫ͕૿Ճ • ϚεΫݴޠϞσϧؚΉ֤छλεΫͰ൚༻తʹಈ࡞ • ग़ྗଆͷQuery ArrayͰ͸ະདྷͷ৘ใ͕ϦʔΫ͍ͯ͠Δͷʹ஫ҙ • ͳͷͰɺͦͷ··Ͱ͸ੜ੒ʹ࢖͑ͳ͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 34 ৽ن ύϥϝʔλߦྻ ೖྗΛݟΔͷ͸ Ұճ͚ͩ Figure from [Jaegle+2022]

Slide 33

Slide 33 text

Perceiver AR [Hawthorne+2022] • ະདྷͷ৘ใ͕ϦʔΫ͢ΔͷΛ๷͍͗ͨ • ग़ྗଆͷAttentionʹ͸ϚεΫػߏΛ௥Ճ͠ɺ autoregressiveͳੜ੒ΛՄೳʹ • ͍ͭ΋ͷCausal Mask • ը૾ੜ੒ɺݴޠϞσϧͳͲͰߴੑೳ • Transformer-XLͳͲʹඖఢ • ܥྻ͕௕͘ͳͬͯ΋ֶश͸ߴ଎ • ܥྻม׵λεΫ౳Ͱಈ͘ͷ͔͸Α͘෼͔Βͳ͍… ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 35 Figures from [Hawthorne+2022]

Slide 34

Slide 34 text

Memorizing Transformers [Wu+2022] • Transformer DecoderΛ૝ఆ • ઐ༻ͷAttention૚Λઃܭ • աڈͷӅΕ૚Λ֎෦ϝϞϦͱ Έͳͯ͠AttentionΛܭࢉ • AttentionείΞΛݩʹkNNͰ ద౰ʹࢬמΓ • ࠷ޙʹී௨ͷAttentionͱࠞͥΔ • ਺ֶه߸΍ͦͷఆٛΛ ͏·͘ੜ੒Ͱ͖Δ • ୯ͳΔ௿ස౓ޠͷίϐʔʁ • ؔ࿈ɿ[Sun+2021] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 36 ޯ഑͸ྲྀΕͳ͍ →ߴ଎ʹಈ࡞ Figures from [Wu+2022]

Slide 35

Slide 35 text

ᶅ FeedForward ΋ͬͱ΋ͬͱදݱྗΛ… ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 37

Slide 36

Slide 36 text

͓͞Β͍ɿFeedForward૚ • શ݁߹+Activation+શ݁߹ • ׳शɿ࣍ݩ਺Λ্͛ͯݩʹ໭͢ • Transformer-base: 512à2048à5412 • Transformer-big: 1024à4096à1024 • ࣮ࡍʹ͸ԿΛ΍͍ͬͯΔʁ • ֎෦ϝϞϦ΁ͷAttentionͱΈͳͤΔ • [Sukhbaatar+2019] • Key-ValueϝϞϦͱͯ͠ಈ࡞ • [Geva+2021] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 38 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + Linear Activation Function Linear X ࣍ݩ਺Λ্͛Δ ʙ4ഒఔ౓ ݩʹ໭͢

Slide 37

Slide 37 text

࠷ۙͷਐలɿ׆ੑԽؔ਺Λม͑ͯΈΔ • Vanilla TransformerͰ͸ReLUΛ࠾༻ • GPT΍BERTͳͲͷݴޠϞσϧɿGeLUΛଟ͘ݟ͔͚Δ • ॳग़͸GPTͷ͸ͣ • GLUܥ͕ޮ͘ͱ͍͏ใࠂ΋ [Shazeer+2020] • Կނྑ͍ͷ͔͸෼͔Βͳ͍͕… • “We offer no explanation as to why these architectures seem to work” ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 39 Figure from [Hendrycks+2016] Figure from [Narang+2021]

Slide 38

Slide 38 text

࠷ۙͷਐలɿ࣍ݩ਺Λ্͛Δ • WMT’19 ʹ͓͚ΔFacebookͷใࠂ[Ng+2019] • ৭ʑࢼͨ݁͠ՌɺFFNΛ૿΍͢ͱྑ͔ͬͨ • We experiment with increasing network capacity by increasing embed dimension, FFN size, number of heads, and number of layers. • We find that using a larger FFN size (8192) gives a reasonable improvement in performance while maintaining a manageable network size • WMT’20 ࢀՃ࣌ʹ௥ࢼɾ࠾༻ [Kiyono+2020] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 40 Table from [Ng+2019]

Slide 39

Slide 39 text

දݱྗͷͦͷઌ΁ɿMixture of Experts • ී௨ʹϞσϧΛେ͖͍ͯ͘͘͠ͱ… • 😨 දݱྗUPͷͨΊύϥϝλ૿→Ϟσϧͷܭࢉ࣌ؒ૿ • ҰํɺMixture of Expertsͷ৔߹ • 😀 ෳ਺ͷHomogenousͳExpertΛ༻ҙ→ύϥϝλ૿ɾදݱྗUP • 😀 ೖྗʹԠͯ͡Ұ෦ΛൃՐ→ܭࢉ࣌ؒ࡟ݮ • Ұ෦ք۾ʢGoogleͱFacebookʣͰྲྀߦ͍ͬͯΔ༷ࢠ • Switch Transformer [Fedus+2021], Base Layers [Lewis+2021], Sparse all-MLP [Yu+2022] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 41 Figure from [Fedus+2021]

Slide 40

Slide 40 text

ᶆ Layer Normalization ࠞಱͱͨ͠ੈք ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 42

Slide 41

Slide 41 text

͓͞Β͍ɿLayer Normalization (LN) • ֤ӅΕ૚͔Βฏۉͱ෼ࢄΛܭࢉ →ਖ਼نԽ • ֶशͷ҆ఆԽʹߩݙ • LNͳ͠ͷTransformer͸ֶशࠔ೉ • AttentionͱFeedForwardϒϩοΫ ʹҰͭͣͭ഑ஔ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 43 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + ຒΊࠐΈʹ LayerNorm͢Δ ྲྀ೿΋ ֶश͸҆ఆ͢Δ͕ ੑೳ͕ѱԽ͢Δ [Le Scao+2022] LNϙΠϯτᶃ LNϙΠϯτᶄ Figure from [Zhang+2019] RNNͷ৔߹

Slide 42

Slide 42 text

LNΛͲ͜ʹஔ͔͘ʁɿPost-LNͱPre-LN • Vanilla Transformer͸Post-LN • ࠷ۙͷTransformer͸Pre-LN͕ଟ͍ • ༨ஊɿPre-LNͷॳग़͸Tensor2Tensor • ͜ͷ͋ͱ࿦จ͕ͨ͘͞Μग़ͨ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 44 Post-LN ResidualޙʹLN Pre-LN ؔ਺ͷલʹLN ͦͷޙPre-LN͕ σϑΥϧτʹ ˠͱͯ΋ॏཁ Ռͨͯ͠LNͷҐஔ͕ͦΜͳʹॏཁͳͷ͔ʁ Figure from [Xiong+2020]

Slide 43

Slide 43 text

Post-LNͱPre-LNͷ೰·͍ؔ͠܎ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 45 2 st-LN e-LN ual 後に Norm 本研究の貢献 ・Post-LN と Pre-LN の性能差を実験的に⽰す ・多層 Post-LN の学習が難しい原因を⽰す ・⾼い性能を維持しつつ多層化する⼿法を提案 性能 多層化 Post-LN ○ × Pre-LN × ○ B2T(提案⼿法) ○ ○ Layer Norm Attention FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N × N Attention Layer Norm Layer Norm FFN Layer Norm (a) Post-LN er Norm ention Layer Norm Attention Attention Layer Norm Attention Layer Norm Layer Norm Attention Layer Norm Attention (a) Post-LN (b) Pre-LN (c) Post-LN with B2T connection ention er Norm er Norm Attention Layer Norm Layer Norm FFN Attention Layer Norm × N FFN er Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N (b) Pre-LN (c) Post-LN with B2T connection 適⽤前に Norm ೰ΈɿPre-LN͸ֶश͕༰қ͕ͩੑೳ͕௿͍ 1PTU-/ͷ΄͏͕ੑೳ͕ߴ͍ Pre-LN で良いのでは︖ Pre-LN は多層にしたときの学習が安定 – 近年の多層なモデル(例︓GPT)はPre-LN しかし Pre-LN は性能が低い – Post-LN は学習が成功した場合 Pre-LN より⾼性能 • 6層 Transformer エンコーダ・デコーダの性能⽐較 7 Post-LN Post-LN Pre-LN Pre-LN 翻訳(WMT英-独)でのBLEU ⾒出し⽂⽣成でのROUGE-1 ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻ 1PTU-/ͷଟ૚Խ͸೉͍͠ エンコーダ側 デコーダ側 各層の勾配のノルム デコーダ側で勾配消失が発⽣ 101 100 10-1 10-1 100

Slide 44

Slide 44 text

Post-LNͱPre-LNͷྑ͍ͱ͜औΓΛ͍ͨ͠ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 46 ͪΐ͏Ͳྑ͍ݚڀ͕…

Slide 45

Slide 45 text

ޯ഑ফࣦͷݪҼ͸LN • 勾配消失の原因を探る – デコーダの18層⽬の各位置における勾配のノルムを調べる Layer Norm Attention FFN Layer Norm Layer Norm Attention (4) → (3),(2) → (1) で勾配が⼤きく減衰 → LN をまたぐと勾配が⼤きく減衰 → LN が勾配を減衰させる → LN が勾配消失の原因 ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 47 ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻

Slide 46

Slide 46 text

B2T Connection [Takase+2022] Ͱྑ͍ͱ͜औΓΛ࣮ݱ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 48 Layer Norm Attention FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N × N (c) Post-LN with B2T connection ͸ Pre-LNɼ(c) ͸ Post-LN ʹఏҊख๏Λ૊Έ 提案⼿法︓⾚字の Residual Connection を追加 各層の下から上までの経路︓Bottom-to-Top Connection 1. 各層の最後の LN 以外を迂回する → LN による勾配の減衰を抑制 2. 最後の LN は⼊⼒にも適⽤ ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻

Slide 47

Slide 47 text

ൺֱ࣮ݧɿػց຋༁λεΫ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 49 ※ߴ੉ᠳࢯͷNLP2022ൃදεϥΠυΑΓҾ༻ B2T connection は 学習に成功 性能も他の⼿法より⾼い • あ 学習 失敗 6層エンコーダ・デコーダ 18層エンコーダ・デコーダ Post-LN は Pre-LN よりも⾼い性能 B2T connection は Post-LN と同程度の性能

Slide 48

Slide 48 text

LNΛফͯ͠ޯ഑ফࣦʹରॲɿT-FixUp [Huang+2020] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 50 • T-FixUpɿ͔ͳΓݡ͍ॳظԽ • LNΛϞσϧશମ͔ΒຣফͰ͖Δ • ֶश཰ͷWarmup͕ཁΒͳ͘ͳΔ • ଟ૚ʹͯ͠΋ֶश͕҆ఆ͢Δ • Warmup͸ͳͥඞཁ͔ͩͬͨʁ • LNͷޯ഑͸ೖྗͷϊϧϜʹ൓ൺྫ • ֶशॳظ͸Adamͷਪఆྔ͕ෆ҆ఆ→ߋ৽ྔ📈 • ݁ՌɺೖྗͷϊϧϜ͕📈 • ޯ഑͕ফֶ͑ͯशࠔ೉ʹ • 👍ϋΠύϥ (Warmup) ͕ফͤΔ • 👎 Կεςοϓ͔Βֶश཰ΛԼ͛Δ͔ʁ ͱ͍͏ϋΠύϥ͕૿͍͑ͯΔ • ϓϥϚΠθϩͰ͸ʁ Figure from [Huang+2020] ޯ഑ͷ༷ࢠ

Slide 49

Slide 49 text

͋͑ͯLNΛ૿΍͢ɿNormformer [Shleifer+2021] • LNΛ૿΍͢ͱऩଋ଎౓ͱੑೳ͕վળ͢Δ ͱ͍͏ݚڀ • େن໛ݴޠϞσϧͰͷ࣮ݧͰൺֱ • Pre-LNʹ͓͚Δ௿૚ͷޯ഑ͱ ߴ૚ͷޯ഑ͷεέʔϧͷෆҰகΛ໰୊ࢹ • AttentionͱFeedForwardޙʹLNͯ͠ εέʔϧΛҰகͤ͞Δ • εέʔϧͷෆҰக͕ͳͥ໰୊ͳͷ͔ʁ • ΈΜͳ࢖ͬͯΔσʔλ͕ҧ͏ɿ֤ख๏ͷ ൺֱ͕ࠔ೉ • ࠞಱͱ͍ͯ͠Δ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 51 Figure from [Shleifer+2021]

Slide 50

Slide 50 text

ͻͱ΍͢ΈɿมΘΓछTransformer ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 52

Slide 51

Slide 51 text

Sandwitch Transformer [Press+2019] ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 53 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + ΞΠσΞ (Attention+FeedForward)ΛN૚ੵΉɺ͸ຊ౰ʹ࠷ద͔ʁ →֤αϒϨΠϠ͸Մ׵Ͱ͋Δ͜ͱʹண໨͠ɺ ॱ൪ΛೖΕସ͑ͯΈΔ

Slide 52

Slide 52 text

Sandwitch Transformer [Press+2019] • ॱ൪ΛϥϯμϜʹೖΕସ͑ͯΈ࣮ͯݧ • ϕʔεϥΠϯΑΓ΋ྑ͍ߏ଄͕͋ΔͬΆ͍ • ೖྗଆʹAttentionଟΊɺग़ྗଆʹFFNଟΊ͕ ྑͦ͞͏ • ͭ·Γ • ݴޠϞσϧͰੑೳ޲্ɺ຋༁Ͱ͸ޮՌͳ͠ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 54 which is comparable to the performance of the base- ine with the same number of parameters. We next generalize this model and the original nterleaved transformer, creating the family of sand- wich transformers. A sandwichn k transformer con- ists of 2n sublayers in total (n of each type), con- orming to the regular expression sk(sf)n k fk. The first k sublayers are purely self-attention (s), while the last k are feedforward sublayers (f). In etween, we use the original interleaving pattern sf) to fill the remaining 2(n k) sublayers. When k = 0, we get the original transformer model, and when k = n 1 (its maximal value) we get the Model Test Baseline (Baevski and Auli, 2019) 18.70 Transformer XL (Dai et al., 2019) 18.30 kNN-LM (Khandelwal et al., 2019) 15.79 Baseline (5 Runs) 18.63 ± 0.26 Sandwich16 6 17.96 Table 3: Performance on the WikiText-103 test set. We compare the best sandwich transformer to the unmod- ified, interleaved transformer baseline (Baevski and Auli, 2019) trained over 5 random seeds and to other previously reported results. than the average baseline transformer. Of those, 6 models outperform the best baseline transformer (k = 5, 6, 8, 9, 10, 11). The best performance of 17.84 perplexity is obtained when k = 6. We com- pare this model to the baseline on WikiText-103’s test set. Table 3 shows that, despite its simple design, the sandwich transformer outperforms the original transformer baseline by roughly double the gap be- tween the baseline (Baevski and Auli, 2019) and Transformer XL (Dai et al., 2019). This improve- ment comes at no extra cost in parameters, data, memory, or computation; we did not even change any of the original hyperparameters, including the number of training epochs. To check whether this advantage is consistent, we train 4 more sandwich16 6 models with different Figure 5: The transformer’s sandwich coefficient (k) and validation perplexity, for k 2 {1, . . . , 15}. The dotted line is the average baseline model’s perplex- ity (trained with different random seeds), whereas the dashed line represents the best baseline model. Figure 6: Performance on the WikiText-103 develop- ment set of the Sandwich16 transformer and the base- Figures from [Press+2019] ଠࣈ͕Baseline ௿͍΄Ͳྑ͍ੑೳ

Slide 53

Slide 53 text

Primer [So+2021] • TransformerͷΞʔΩςΫνϟΛҨ఻తΞϧΰϦζϜͰ୳ࡧ • ୳ࡧۭؒΛAdd/Square/Conv…ͳͲͷ௿ϨϕϧԋࢉͰߏ੒ • ൚༻తʹྑ͔֦ͬͨு͸2ͭ 1. Query, Key, Valueʹରͯ͠෯3ͷDepthwise Convolution (DConv) 2. ReLUͷग़ྗΛೋ৐͢Δ • DConvͷؾ࣋ͪʹͳͬͯΈΔ • Attentionͷલʹۙ๣Λݟ͍ͨ • Ͱ΋ී௨ͷConv͡ΌͩΊ • ෯Λ޿ͯ͘͠΋ͩΊ • αϯϓϧޮ཰্͕͕Δͱͷ͜ͱ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 55 νϟωϧؒͰ ৘ใΛࠞͥͳ͍Conv Figure from [So+2021]

Slide 54

Slide 54 text

Transformer͸”ྑ͘”ͳͬͨͷ͔ʁ ͞·͟·ͳ௥ࢼͷ঺հ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 56

Slide 55

Slide 55 text

ʮྑ͞ʯ͸࣮૷Λ·͍ͨͰ൚Խ͢Δ͔ʁɿNo • Do Transformer Modifications Transfer Across Implementations and Applications? [Narang+2021] • ͞·͟·ͳTransformerͷ֦ுΛڞ௨ͷίʔυϕʔεʹ࠶࣮૷ • ݩ࿦จͷใࠂ͢Δ஌ݟʢi.e., ྑ͞ʣΛ࠶ݱͰ͖Δ͔௥ࢼ • ࠶ݱͰ͖Δख๏͸ҎԼͷ͍ͣΕ͔ͱใࠂ • γϯϓϧͳมߋʢ׆ੑԽؔ਺ͳͲʣ • ܭࢉྔͷ૿ՃΛ൐͏΋ͷʢଟ૚ԽͳͲʣ • ύϥϝʔλ਺ͷ૿ՃΛ൐͏΋ͷʢ.JYUVSFPG&YQFSUTͳͲʣ • ϋΠύϥͷௐ੔Λ͍ͯ͠ͳ͍͜ͱʹ஫ҙ • ֦ுख๏ͨͪʹͱͬͯएׯෆརͳઃఆͱΈͳͤΔ • ϋΠύϥͷௐ੔͕ඞཁͳ࣌఺ͰμϝͰ͠ΐɺͱ͍͏ڧ͍ओு ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 57 ͩΊ͡ΌΜ…

Slide 56

Slide 56 text

ʮྑ͞ʯ͸σʔλʹؔͯ͠൚Խ͢Δ͔ʁɿNo • The Impact of Positional Encodings on Multilingual Compression • [Ravishankar+2021] • Ґஔදݱख๏ʢઈରҐஔɺ૬ରҐஔʣΛม͑ͯଟݴޠBERTΛ܇࿅ • ଟݴޠλεΫʢ୯ޠΞϥΠϯϝϯτɾจநग़ͳͲʣͰධՁ • ઈରҐஔ͕૬ରҐஔΛ্ճΔੑೳ • ૬ରҐஔ͕ଟݴޠؒͷΞϥΠϯϝϯτͷֶशΛࠔ೉ʹ͍ͯ͠Δʁ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 58 Figure from [Ravishankar+2021]

Slide 57

Slide 57 text

ʮྑ͞ʯ͸Ϟσϧͷεέʔϧʹؔͯ͠൚Խ͢Δ͔ʁɿNo • Transformerͷѥछʹ͍ͭͯϞσϧͷεέʔϧΛม͑ͨͱ͖ͷੑೳΛൺֱ • Vanilla, Universal Transformer, MLP-Mixer, ALBERT, Performer, etc… • ݁ՌɺVanilla Transformer͕Ұ൪ྑ͔ͬͨ • 😀 ಛఆͷεέʔϧͰTransformerΑΓྑ͍Ϟσϧ͸ଘࡏ • 😨 TransformerΑΓҰ؏ͯ͠ྑ͍΋ͷ͸ଘࡏ͠ͳ͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 59 [Anonymous+2021] ARRʹొ৔ͯ͠Ҏ߱ Իࠫଡͳ͕ͩ͠… Figure from [Anonymous+2021]

Slide 58

Slide 58 text

ؔ࿈ɿ175BݴޠϞσϧʹ͓͚ΔMetaͷ஌ݟ • খن໛࣮ݧ͔ΒಘͨϋΠύϥ͕௨༻͠ͳ͍ • ͪͳΈʹɿ13Bύϥϝλ͸”খن໛” • ֶश཰ΛGPT-3ઃఆΑΓେ͖ͯ͘͠Έ͚ͨͲେࣦഊ • طଘͷϞσϧ͕௨༻͠ͳ͍ • Normformerɿֶश్தͰloss͕ఀ଺ • ֶशͷ҆ఆԽ͕ॏཁ • FP16ͩͱGeLU͕਺஋తʹ҆ఆ͠ͳ͍ à ReLUͰଥڠ • EmbeddingʹLN͢Δͱֶश͕҆ఆ͢Δ͚Ͳੑೳ͕ѱ͍ • ؤுͬͯ΋਺ઍUpdate͝ͱʹloss͕ൃࢄ͢Δ à ͦͷ౎౓restart͕ඞཁ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 60 https://github.com/facebookresearch/metaseq/blob/ main/projects/OPT/chronicles/56_percent_update.md LogBook͸Ұಡͷ Ձ஋͋Γ

Slide 59

Slide 59 text

༨ஊɿWMT 2020ͷܦݧ͔Β • γεςϜ։ൃͷ࠷தʹ͋Ε͜Εطଘख๏Λࢼ͢ • େମ͏·͍͔͘ͳ͍ • ྫɿٙࣅσʔλͷϑΟϧλϦϯά • ਺ԯจͷٙࣅσʔλશ෦Ͱֶशͨ͘͠ͳ͍→ྑ͍σʔλΛऔΓग़͍ͨ͠ • ϑΟϧλϦϯάͷ༗ແʹΑΒͣੑೳʹӨڹͳ͠ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 61 ਗ਼໺ͷࢀՃͨ͠ ػց຋༁ͷίϯϖ Setting En!De De!En En!Ja Ja!En ASE (Section 3.1) 42.4 42.0 19.7 21.6 ASE (l = 9)+TAGGED-BT (Section 3.3) 42.7 42.5 22.0 23.9 b) + fine-tuning (Section 3.4) 44.9 42.3 23.1 24.4 c) ⇥ 4 (Section 3.5) 45.5 42.8 23.9 25.4 d) + 4 ⇥ (c)-R2L (Section 3.6) 45.4 43.6 24.2 25.9 e) + reranking (Section 3.7) 45.7 43.8 24.9 26.2 he best system in WMT’19 44.9 42.8 - - of each technique: we use newstest2019 and official validation set for En$De and En$Ja result from WMT’19 is unavailable for En$Ja, because this task has newly appeared this g / ID BLEU chrF able 5) 37.5 0.647 able 5) 43.8 0.690 able 5) 40.1 0.343 ENSEMBLE 25.5 0.536 e on WMT’20 Test Set: refer to newstest Amount of Synthetic Data Used: r (%) 2014 2018 2019 100 33.0 48.0 42.0 50 32.9 48.4 42.3 33 33.1 47.9 42.2 25 32.9 48.5 42.4 Table 7: Effectiveness of corpus filtering on En!De. newstest Table from [Kiyono+2020]

Slide 60

Slide 60 text

·ͱΊ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 62

Slide 61

Slide 61 text

ΑΓྑ͍TransformerΛͭ͘Δ • Ґஔදݱ • ઈରҐஔΑΓ΋૬ରҐஔ • ݸਓతʹ͸ALiBiͱSHAPEʹՄೳੑΛײ͡Δ • Attention • ௕͍ܥྻͷॲཧʹ՝୊ • Perceiver͕࣍ͷ೼ݖΛऔΔ͔΋͠Εͳ͍ • Feed-Forward • ࣍ݩΛ૿΍͢ɺ׆ੑԽؔ਺ɺMixture-of-Experts • Layer Normalization • PreLN͕Ұൠత • ྑ͠ѱ͠͸ࠞಱͱ͍ͯ͠Δ… • ྑ͞ͷ௥ࢼ • ൚༻తʹྑ͍ख๏͸΄ͱΜͲແ͍ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 63

Slide 62

Slide 62 text

ΑΓྑ͍TransformerΛͭ͘Δʁ • ಛఆͷઃఆͰ”ྑ͍”Transformer͸ग़͖͍ͯͯΔ • ͔͠͠ɺ൚༻తʹྑ͍΋ͷΛͭ͘Δͷ͸ඇৗʹࠔ೉ • ݴ͍׵͑ΔͱɺσʔλྔɾλεΫɾεέʔϧɾ࣮૷ʹΑͬͯ ྑ͍Transformerͷઃఆ͸ҧ͏ • …ͱ͍͏ͷ͕͜Ε·Ͱͷ஌ݟ • ݁ՌɿҰൠతʹ࠾༻͞Ε͍ͯΔख๏͸͔ᷮ͘͝ • Pre-LNɾGeLUʢɾ૬ରҐஔʁʁʣ • ΠϯύΫτͷେ͖͍ݚڀ෼໺Ͱ͋Δ͜ͱʹมΘΓ͸ͳ͍ • ࠓޙ΋ओྲྀͷݚڀͱͯ͠ଓ͖ͦ͏ ໊ݹ԰஍۠NLPηϛφʔ 2022/06/07 64