SNLP2020_sandwich

Improving Transformer Models by Reordering their Sublayers Ofir Press, Noah
A. Smith, Omer Levy ACL 2020 発表者︓⾼瀬翔（東京⼯業⼤学） 2020/9/25 1 図・表は論⽂，および [Vaswani+ 17] より引⽤

概要 • Transformer [Vaswani+ 17] の Self-attention（s） FeedForward（f）部分の順序を組み替えてみる • ⾔語モデルでは
s を下に，f を上に多めにした⽅が良いことを発⾒ – 翻訳では効果なし • 2 roving Transformer Models by Reordering their Sublayers Ofir Press} Noah A. Smith} Omer Levy| Allen School of Computer Science & Engineering, University of Washington Allen Institute for AI |Facebook AI Research Abstract ransformer networks consist of in- elf-attention and feedforward sub- uld ordering the sublayers in a dif- rn lead to better performance? We ndomly ordered transformers and with the language modeling objec- bserve that some of these models chieve better performance than the baseline, and that those successful sfsfsfsfsfsfsfsfsfsfsfsfsfsf (a) Interleaved Transformer sssssssfsfsfsfsfsfsfsfffffff (b) Sandwich Transformer Figure 1: A transformer model (a) is composed of inte leaved self-attention (green) and feedforward (purple sublayers. Our sandwich transformer (b), a reorderin オリジナル︓ Ofir Press} Noah A. Smith} Omer Levy| Allen School of Computer Science & Engineering, University of Washington Allen Institute for AI |Facebook AI Research Abstract ransformer networks consist of in- lf-attention and feedforward sub- ld ordering the sublayers in a dif- n lead to better performance? We ndomly ordered transformers and with the language modeling objec- bserve that some of these models chieve better performance than the baseline, and that those successful d to have more self-attention at the more feedforward sublayers at the pose a new transformer pattern that sfsfsfsfsfsfsfsfsfsfsfsfsfsf (a) Interleaved Transformer sssssssfsfsfsfsfsfsfsfffffff (b) Sandwich Transformer Figure 1: A transformer model (a) is composed of inte leaved self-attention (green) and feedforward (purpl sublayers. Our sandwich transformer (b), a reorderin of the transformer sublayers, performs better on la guage modeling. Input flows from left to right. サンドウィッチ型︓ （⾔語モデルで性能良し） FeedForward Self-attention f s

研究の⽴ち位置 • Transformer の構造を再考する研究のひとつ – Layer-normalization の位置を議論するものが多い [Wang+ 19, Nguyen+
19, Xiong+ 19] – アテンション⾏列の構築⽅法の再考 [Tay+ 20] • 極めて Empirical な論⽂ – 論⽂内でも性能が上がる理由は不明としている • At the time of writing, we do not have an explanation for why sublayer reordering improves performance • 個⼈的には好きだが⼀発ネタ感は強い – ここから何かが読み取れるかと⾔うと…… 3

最初にやったこと • パラメータ数を固定したうえで – s, f の順序をランダムに組み替える – s, f
の数もランダムに変更してみる • s, f は通常 1:1 だがこれを変更してみる • ⾔語モデルで実験し，結果を⾒る – Wikitext-103（約1億トークン）で実験 – 下に s を，上に f を重点的に置くのが良い，という感覚を得る 4

ランダムに組み換えた結果 5 太字︓ベースライン（通常の順序）の結果⾚枠︓ベースラインの平均値（18.65）より良い結果 Model PPL f s f s
f f f s f f s f s s s f f s f s s f s s s s f f s f f s 20.74 s f s s f f s f f f f s s s s f s f f f s f s f f s f s s s s f 20.64 f s f f s s f f s s s s f f s s s s s f f s f s s f s f f f f f 20.33 f s f f f f f f s s s f s s f f s f s s f f s f s s s f f s s s 20.27 f s s f f f f f f s f s s s f f f s s s s f f f s s s s f f s s 19.98 s s s f s s f s f f f f s s f s f s f s s s f f s f s f f f s f 19.92 f f f s f s s s f s f f s f s f f s f f s s s s s f f s s f f s 19.69 f f f s f f s s f f s s s f s s f s s s f f f f f s f s s s f s 19.54 s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f 19.13 f s f f s s f s s f f f s s s s f f f s s s f f f f s f s s f s 19.08 s f s f f s s s s f f s s f f f f s s s f f s s s f s f f s f f 18.90 s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f 18.83 s s s s s s s f f s f f s f s f s f f f f s f f f s f s s f f s 18.83 s f f s f s f f s f s s s f f s s f s s s s s s f f f f f f f s 18.77 s s s f s s f f s f s s f s f f s f f f s s f f s f s f f s s f 18.68 f f f s s s s s f f f s f s s s s f f s f s f s f s s f f s f f 18.64 s f f f s s s f s f s s f s s s s s f s s f f f f f s f f f s f 18.61 s s f f s s f s s s s f f f f f f s s f f s s s f s f f s s f f 18.60 f s f s s s s s f s f s f f f f f s f f f s f f s s f f s s s s 18.55 s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f 18.54 s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f 18.49 f s f s s s s s f s f f f s s f s f f s f s f s f s f f f f s s 18.38 s f s s f f s f s f s f f s s s s s f f f s s s f f f s f f s f 18.28 s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f 18.25 s f s f s s f s s s f f s f s f s f s f f f f s s f f s f s s f 18.19 Table 1: Randomly generated models with 16 self- attention (s) sublayers and 16 feedforward (f) sublayers, and their perplexity on the WikiText-103 development set. The baselines (the standard transformer trained with different random seeds) are in bold. Figure 2: The perplexities on the WikiText-103 development set of 20 randomly generated models with 16 self-attention and 16 feedforward sublayers and of the 5 baselines (the standard transformer trained with different random seeds). models as strings, with s and f representing self- attention and feedforward sublayers, respectively. A three-layer transformer network, for example, would be denoted sfsfsf, with the ﬂow of com- putation moving from input on the left to output on the right. Thus, any string in the regular language (s|f)⇤ deﬁnes a valid network that uses the same Model PPL s f f f s s f s f s f s s f f f f s f s f f s f f f f f f 22.80 s f f s s f s s s s s s s s s s s s s f s f s s s f s f f s s s f s s s f s 21.02 s s s s s s f f s f f f f s s f f f f f s s s f s f s s s s s s s s s 20.98 f f f f f f f f f s f f s s f f s f f s s s s f s f s s s f 20.75 f s s f s s s f f f f f f s s f s s s f s f f f s s s s f s f s s 20.43 s f f s f f f f f f s f s f s s f s s s f s f s f s s f s s f s 20.28 s f f s s f f s f f f s f s f s s s s f f f f f f s s s s f f 20.02 f s f f s f s s f f f f s f s f f f s f f f s s f f f s s s 19.93 s f f s f f s s f f s f s f f s s s f s s s s s f s s s f f f s s s 19.85 s s f f f f f f f s s f f f s s f s s f f s f s f s f f s f 19.82 s f s f s f f f s f f f s s f s f f f s f f s s f s f s f s s 19.77 s f s f f s s s f f s f f s s s f s s f f f f f s s s s f s s s f 19.55 s f f s f s s f f f s f f s f s s s s f s f s f f f f s f s s s 19.49 s f f f f s f f s s s s f s s s f s s f f f s s s f s s s s f s f s 19.47 f s s s f f s s s s s s f s f s f s f f s f f f f s s f s f s s s s 19.25 s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f 19.13 f s s s s s s f s f s f s f f f s f s s s f s s f f s s s s f s f f 18.86 s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f 18.83 s s f s f s s s f s s s s s f f s f s f s s s f s s f s f s s s s s s s f 18.62 s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f 18.54 s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f 18.49 s s s f s f f s f s s f s s s f f s f f f f f f s s f s f f f 18.34 s s s f s f s f f s s s f s f f f f f s f s f f f f s s s f f 18.31 s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f 18.25 s s s s s s f s s s f f f f s f s f f f f f f f f f f f s f 18.12 Table 2: Randomly generated models with the same number of parameters as the baseline, and their perplexity on the WikiText-103 development set. The baselines (the standard transformer trained with different random seeds) are in bold. Fi op sa 5 fe su el pe w th si 順序をランダムに組み換え各レイヤ数もランダムに変更順序をランダムに組み替えた場合の 1/3 はベースラインよりも良い → 良い順序があるのでは︖

どういう構造にすべきか︖ • ⾔語モデルの結果について – 通常の Transformer の平均値を基準に分ける – パラメータ数で上と下を分けて s
と f の数をカウント 6 パラメータ数の半分より下パラメータ数の半分より上平均より悪い平均より良い平均より良い平均より悪い

Sandwich Transformer • s を下に，f を上に重点的に置いたほうが良さそう – s と
f が n 層ずつだとして︓ – 表記としては︓ – k を sandwich coefficient と呼ぶ 7 ke an- mes es ce. he rty. rm do ly- ter nd m- of which is comparable to the performance of the baseline with the same number of parameters. We next generalize this model and the original interleaved transformer, creating the family of sandwich transformers. A sandwichn k transformer con- sists of 2n sublayers in total (n of each type), con- forming to the regular expression sk(sf)n k fk. The first k sublayers are purely self-attention (s), while the last k are feedforward sublayers (f). In between, we use the original interleaving pattern (sf) to fill the remaining 2(n k) sublayers. When k = 0, we get the original transformer model, and when k = n 1 (its maximal value) we get the previously mentioned snfn model. We refer to k as the transformer’s sandwich coefficient. We train sandwich transformers for n = 16 and train a transformer model of 16 self-attention sublayers followed by 16 feedforward sublayers (s16f16). This model achieves 18.82 perplexity, which is comparable to the performance of the baseline with the same number of parameters. We next generalize this model and the original interleaved transformer, creating the family of sandwich transformers. A sandwichn k transformer con- sists of 2n sublayers in total (n of each type), con- forming to the regular expression sk(sf)n k fk. The first k sublayers are purely self-attention (s), while the last k are feedforward sublayers (f). In between, we use the original interleaving pattern (sf) to fill the remaining 2(n k) sublayers. When k = 0, we get the original transformer model, and

Wikitext-103での実験 8 6 . We mod- and other se, 6
Figure 5: The transformer’s sandwich coefﬁcient (k) and validation perplexity, for k 2 {1, . . . , 15}. The Sandwich coefficient (k) を変えたときの結果線は通常の Transformer の5回平均とベストの値

他のデータでも実験してみる • ⾔語モデルでは性能が上昇 • 翻訳（WMT 14 EnDe）では効果なし 9 Model PPL
Baseline (5 runs) 11.89 ± 0.35 kNN-LM (Khandelwal et al., 2019) 10.89 Sandwich16 7 10.83 Table 4: Performance on the Toronto Books Corpus language modeling test set. The baseline model (Baevski and Auli, 2019) is trained over 5 random seeds. The sandwich coefficient is tuned on the validation set and we run our model on the test set only once. 5.1 Books-Domain Language Modeling We first apply sandwich transformers to a different domain, while retaining the other architectural aspects and hyperparameter settings from Baevski and Auli (2019). Specifically, we use the Toronto Books Corpus (Zhu et al., 2015), which has previously been used to train GPT (Radford et al., 2018) and also BERT (Devlin et al., 2019) (combined with Wikipedia). The corpus contains roughly 700M tokens. model learns to control each attention head’s maximal attention span, freeing up memory in the bot- tom layers (which typically need very short attention spans) and applying it to the top layers, allow- ing the top-level attention heads to reach signifi- cantly longer distances. The adaptive span model’s efficient use of attention also results in a significant speed boost. We tune the sandwich coefficient on the development set for k 2 {1, . . . , 8} (the baseline model has 24 transformer layers). We do not modify any hyperparameters, including the number of training epochs. Table 5 compares the baseline model’s performance with the sandwich transformer’s. On text8, the sandwich transformer performs within the baseline’s random seed variance. On enwik8, the sandwich transformer gains an improvement of about 0.007 bits-per-character, matching the state of the art results obtained by the Transformer- XL-based Compressive Transformer of Rae et al. (2020). Toronto Book Corpusでの⾔語モデル Model text8 (BPC) enwik8 (BPC) Transformer-XL (Dai et al., 2019) 1.08 0.99 Adaptive Span (Sukhbaatar et al., 2019) 1.07 0.98 Compressive (Rae et al., 2020) — 0.97 Baseline (Adaptive Span; 5 Runs) 1.0802 ± 0.0103 0.9752 ± 0.0008 Sandwich24 3 1.076 — Sandwich24 5 — 0.968 mance on character-level language modeling, evaluated on the enwik8 and text8 test sets. The Sukhbaatar et al., 2019) is trained over 5 random seeds. The sandwich coefficient is tuned on each idation set, and we run our model on the test only once. s) and cross-attention (c) sublay- hem as a single unit for reordering For example, a three layer decoder ) with a sandwiching coefficient of be: scscfscff. We apply the Sandwich Encoder Decoder Coefficient Sandwich Sandwich 0 (Baseline) 28.74 ± 0.15 1 28.71 28.64 2 28.71 28.56 ⽂字レベル⾔語モデル Sandwich24 5 — 0.968 e 5: Performance on character-level language modeling, evaluated on the enwik8 and text8 test sets ine model (Sukhbaatar et al., 2019) is trained over 5 random seeds. The sandwich coefficient is tuned o hmark’s validation set, and we run our model on the test only once. attention (s) and cross-attention (c) sublay- and treat them as a single unit for reordering oses (sc). For example, a three layer decoder fscfscf) with a sandwiching coefficient of 1 would be: scscfscff. We apply the wich pattern to either the encoder or decoder rately, while keeping the other stack in its orig- interleaved pattern. eriment Setting As a baseline, we use the e transformer model (6 encoder/decoder layers, edding size of 1024, feedforward inner dimen- of 4096, and 16 attention heads) with the hy- arameters of Ott et al. (2018). We also follow setup for training and evaluation: we train he WMT 2014 En-De dataset which contains M sentence pairs; we validate on newstest13 and Sandwich Encoder Decoder Coefficient Sandwich Sandwich 0 (Baseline) 28.74 ± 0.15 1 28.71 28.64 2 28.71 28.56 3 28.81 28.67 4 28.48 28.66 5 28.45 28.76 Table 6: BLEU on newstest2014 En-De. Our en (decoder) sandwich model keeps the decoder (en unmodified. We train the baseline model (Transf large with the hyperparameters of Ott et al., 20 times with different random seeds. mance degradation. Since the sandwich p naively groups self- and cross-attention sub together, it is also possible that a reorderin 翻訳（WMT 14 EnDe）

まとめ • Transformer の Self-attention (s) と FeedForward (f) 部分の順序を組み替え
• ⾔語モデルでは s を下に，f を上に多めに配置したほうが性能が良いことを発⾒ – ⾔語モデルでは安定して良さそう – 翻訳では効果なし • 任意のタスクに適⽤可能かは不明 • みんなも最⾼の組み合わせを⾒つけてくれよな︕ – we hope that future research continues this line of work by looking into optimal sublayer ordering for other tasks, such as translation, question answering, and classification. 10

SNLP2020_sandwich

SNLP2020_sandwich

Sho Takase

More Decks by Sho Takase

Featured

Transcript

Improving Transformer Models by Reordering their Sublayers Ofir Press, Noah

概要 • Transformer [Vaswani+ 17] の Self-attention（s） FeedForward（f）部分の順序を組み替えてみる • ⾔語モデルでは

研究の⽴ち位置 • Transformer の構造を再考する研究のひとつ – Layer-normalization の位置を議論するものが多い [Wang+ 19, Nguyen+

最初にやったこと • パラメータ数を固定したうえで – s, f の順序をランダムに組み替える – s, f

ランダムに組み換えた結果 5 太字︓ベースライン（通常の順序）の結果⾚枠︓ベースラインの平均値（18.65）より良い結果 Model PPL f s f s

どういう構造にすべきか︖ • ⾔語モデルの結果について – 通常の Transformer の平均値を基準に分ける – パラメータ数で上と下を分けて s

Sandwich Transformer • s を下に，f を上に重点的に置いたほうが良さそう – s と

Wikitext-103での実験 8 6 . We mod- and other se, 6

他のデータでも実験してみる • ⾔語モデルでは性能が上昇 • 翻訳（WMT 14 EnDe）では効果なし 9 Model PPL

まとめ • Transformer の Self-attention (s) と FeedForward (f) 部分の順序を組み替え