[Vaswani+2017] • 😀 γϯϓϧɺ͍ɺ࣮͕؆୯ • Ґஔ୯ͳΔߦྻ͔ΒͷࣙॻҾ͖Ͱදݱ • 😰 ະͷ͞ͷܥྻʹରͯ͠൚Խ͠ͳ͍ • i.e., ֎ૠ͕ࠔ ໊ݹ۠NLPηϛφʔ 2022/06/07 10 (a) ASPEC English-to-Japanese (b) WMT2014 English-to-German Figure 3: BLEU scores on test data split by the sentence length (no training data in the gray-colored area). Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 Figure from [Neishi+2019] ະͷ͞
Suzuki, and Kentaro Inui • EMNLP2021ʹ࠾ ໊ݹ۠NLPηϛφʔ 2022/06/07 13 Sosuke Kobayashi~,} Jun Suzuki~, Kentaro Inui~, ~ Tohoku University } Preferred Networks, Inc. [email protected], [email protected], jun.suzuki, inui}@tohoku.ac.jp t rucial for building ns in Transformers. ations suffer from test data with un- utational cost. We osition embedding ues. The basic idea invariance, which uccessful position y shifting absolute Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 k-1 k k+1 k+2 k+3 Absolute Position Embedding (APE) (a) Shifted APE (SHAPE) (c) Transformer Relative Position Embedding (RPE) (b) Relative Distance John yelled at Kevin ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} -2 -1 0 1 2 Key Value Shifted by random offset k Figure 1: Overview of position representations. (a) APE
𝑲) ͰϥϯμϜʹγϑτ • ϞσϧઈରҐஔΛར༻ෆՄೳ • ΘΓʹ૬ରҐஔΛར༻͢ΔֶशΛڧ੍ ໊ݹ۠NLPηϛφʔ 2022/06/07 15 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 k-1 k k+1 k+2 k+3 APE Shifted by random offset k SHAPE
dea ich ion ute Multi-Head Attention Add & Norm Feed Forward Add & Norm Input Embedding + John yelled at Kevin Position 0 1 2 3 4 k-1 k k+1 k+2 k+3 Absolute Position Embedding (APE) (a) Shifted APE (SHAPE) (c) Transformer Relative Position Embedding (RPE) (b) Relative Distance John yelled at Kevin ! !"# {%&',)*+,&} ! ."# {%&',)*+,&} -2 -1 0 1 2 Key Value Shifted by random offset k
શͯͷϞσϧ͕ಉͷੑೳ • SHAPEੑೳѱԽͷڪΕͳͬͯ͘ྑͦ͞͏ ໊ݹ۠NLPηϛφʔ 2022/06/07 20 Figure 2: Cosine similarities of the encoder hidden states with different offsets k 2 {0, 100, 250, 500}. Only the representation of SHAPE is invariant with k. Dataset Model Valid Test Speed VANILLA APE† 23.61 30.46 x1.00 RPE† 23.67 30.54 x0.91 SHAPE† 23.63 30.49 x1.01 EXTRAPOLATE APE 22.18 29.22 x1.00 RPE 22.97 29.86 x0.91 SHAPE 22.96 29.80 x0.99 Table 2: BLEU scores on newstest2010-2016. Valid is Figure 3: BLEU score improveme dation and test sets with respect to length. The gray color means no t
ͨͩ͠ڑʹͳΔͱܭࢉ͕େม • ܥྻYܥྻ ͷܭࢉ͕ඞཁ Process Output array Q x L V Attention K V Attention Q K V V K Q Attention scores O E ͜Ε͕Qͱಉ͡ͱ͖ Self-Attention Figure from [Jaegle+2022]
[Shazeer+2020] • Կނྑ͍ͷ͔͔Βͳ͍͕… • “We offer no explanation as to why these architectures seem to work” ໊ݹ۠NLPηϛφʔ 2022/06/07 39 Figure from [Hendrycks+2016] Figure from [Narang+2021]
increasing network capacity by increasing embed dimension, FFN size, number of heads, and number of layers. • We find that using a larger FFN size (8192) gives a reasonable improvement in performance while maintaining a manageable network size • WMT’20 ࢀՃ࣌ʹࢼɾ࠾༻ [Kiyono+2020] ໊ݹ۠NLPηϛφʔ 2022/06/07 40 Table from [Ng+2019]
• ͭ·Γ • ݴޠϞσϧͰੑೳ্ɺ༁ͰޮՌͳ͠ ໊ݹ۠NLPηϛφʔ 2022/06/07 54 which is comparable to the performance of the base- ine with the same number of parameters. We next generalize this model and the original nterleaved transformer, creating the family of sand- wich transformers. A sandwichn k transformer con- ists of 2n sublayers in total (n of each type), con- orming to the regular expression sk(sf)n k fk. The first k sublayers are purely self-attention (s), while the last k are feedforward sublayers (f). In etween, we use the original interleaving pattern sf) to fill the remaining 2(n k) sublayers. When k = 0, we get the original transformer model, and when k = n 1 (its maximal value) we get the Model Test Baseline (Baevski and Auli, 2019) 18.70 Transformer XL (Dai et al., 2019) 18.30 kNN-LM (Khandelwal et al., 2019) 15.79 Baseline (5 Runs) 18.63 ± 0.26 Sandwich16 6 17.96 Table 3: Performance on the WikiText-103 test set. We compare the best sandwich transformer to the unmod- ified, interleaved transformer baseline (Baevski and Auli, 2019) trained over 5 random seeds and to other previously reported results. than the average baseline transformer. Of those, 6 models outperform the best baseline transformer (k = 5, 6, 8, 9, 10, 11). The best performance of 17.84 perplexity is obtained when k = 6. We com- pare this model to the baseline on WikiText-103’s test set. Table 3 shows that, despite its simple design, the sandwich transformer outperforms the original transformer baseline by roughly double the gap be- tween the baseline (Baevski and Auli, 2019) and Transformer XL (Dai et al., 2019). This improve- ment comes at no extra cost in parameters, data, memory, or computation; we did not even change any of the original hyperparameters, including the number of training epochs. To check whether this advantage is consistent, we train 4 more sandwich16 6 models with different Figure 5: The transformer’s sandwich coefficient (k) and validation perplexity, for k 2 {1, . . . , 15}. The dotted line is the average baseline model’s perplex- ity (trained with different random seeds), whereas the dashed line represents the best baseline model. Figure 6: Performance on the WikiText-103 develop- ment set of the Sandwich16 transformer and the base- Figures from [Press+2019] ଠࣈ͕Baseline ͍΄Ͳྑ͍ੑೳ
• ϑΟϧλϦϯάͷ༗ແʹΑΒͣੑೳʹӨڹͳ͠ ໊ݹ۠NLPηϛφʔ 2022/06/07 61 ਗ਼ͷࢀՃͨ͠ ػց༁ͷίϯϖ Setting En!De De!En En!Ja Ja!En ASE (Section 3.1) 42.4 42.0 19.7 21.6 ASE (l = 9)+TAGGED-BT (Section 3.3) 42.7 42.5 22.0 23.9 b) + fine-tuning (Section 3.4) 44.9 42.3 23.1 24.4 c) ⇥ 4 (Section 3.5) 45.5 42.8 23.9 25.4 d) + 4 ⇥ (c)-R2L (Section 3.6) 45.4 43.6 24.2 25.9 e) + reranking (Section 3.7) 45.7 43.8 24.9 26.2 he best system in WMT’19 44.9 42.8 - - of each technique: we use newstest2019 and official validation set for En$De and En$Ja result from WMT’19 is unavailable for En$Ja, because this task has newly appeared this g / ID BLEU chrF able 5) 37.5 0.647 able 5) 43.8 0.690 able 5) 40.1 0.343 ENSEMBLE 25.5 0.536 e on WMT’20 Test Set: refer to newstest Amount of Synthetic Data Used: r (%) 2014 2018 2019 100 33.0 48.0 42.0 50 32.9 48.4 42.3 33 33.1 47.9 42.2 25 32.9 48.5 42.4 Table 7: Effectiveness of corpus filtering on En!De. newstest Table from [Kiyono+2020]