[SNLP2022] ABC: Attention with Bounded-memory Control

ABC: Attention with Bounded-memory Control Hao Peng, Jungo Kasai, Nikolaos
Pappas, Dani Yogatama, Zhaofeng Wu, Lingpeng Kong, Roy Schwartz, Noah A. Smith ACL 2022 読む⼈︓⾼瀬翔（東京⼯業⼤学） 2022/9/27 1

⾃⼰紹介 • 2008-2017︓東北⼤学（学⼠-博⼠） • 2017-2018︓NTT CS研（ポスドク） • 2018-2020︓東⼯⼤（研究員） • 2020-2022︓東⼯⼤（助教）
• 最近の研究の興味 – ⽣成を伴う⾃然⾔語処理（翻訳・要約） • 出⼒⻑を調整可能なTransformer（NAACL 19） • 擬似的なアンサンブルで性能向上（ACL Findings 22） – 効率的なニューラルモデル • パラメータ効率の良い埋め込み（NeurIPS 20） • 学習時間に対して効率の良い正則化（NAACL 21） 2

⾃然⾔語処理におけるニューラルモデルの代表的な研究 3 2014 2010 RNN⾔語モデル [Mikolov+ 10] LSTM⾔語モデル [Zaremba+
14] アテンション機構 [Bahdanau+ 14] 2013 word2vec [Mikolov+ 13] 2017 Transformer (Self-attention) [Vaswani+ 17] Transformer（Self-attention ベースの⼿法）の次代のモデルはまだない（明らかでない）

次代のモデルの探求 • Self-attetnion の代替 – MLP系（MLP-Mixer）[Tolstikhin+ 21] – CNN [Gehring+
17, Tay+ 21] – N-gram [Sun+ 21, Loem+ 22] • アテンション機構の計算を効率化 – Self-attention の計算︓O(L2) • L: ⼊⼒の⻑さ • 正確には⼊⼒ベクトルの次元数 d を考えて O(dL2) – L2 を⼩さくする • Routing Transformer: L1.5 [Roy+ 21] • Strided attention: nL [Child+ 19] • Reformer: L log L [Kitaev+ 20] 4

次代のモデルの探求 • Self-attetnion の代替 – MLP系（MLP-Mixer）[Tolstikhin+ 21] – CNN [Gehring+
17, Tay+ 21] – N-gram [Sun+ 21, Loem+ 22] • アテンション機構の計算を効率化 – Self-attention の計算︓O(L2) • L: ⼊⼒の⻑さ • 正確には⼊⼒ベクトルの次元数 d を考えて O(dL2) – L2 を⼩さくする • Routing Transformer: L1.5 [Roy+ 21] • Strided attention: nL [Child+ 19] • Reformer: L log L [Kitaev+ 20] 5 本研究の対象

本論⽂のまとめ • アテンション機構の効率化⼿法をまとめる – 既存の効率化⼿法を統⼀した式で表現 – ここが本研究の最も⾒るべきところ • 統⼀した式を元に従来⼿法を拡張 –
従来︓アテンションの対象は⽂脈⾮依存 – 本⼿法︓アテンションの対象を⽂脈依存に • ⾔語モデル，機械翻訳などで効果を⽰した – 従来の効率化⼿法よりも⾼い性能 – 通常の Transformer と同等の性能で効率的 • この辺りはみんなそう主張するのでなんとも… 6

アテンションの計算 • クエリ q とキー⾏列 K のアテンション計算 • 計算量は O(L)
– 本スライドでは d は考えない（本当は O(dL)） • Self-attention では q が L 個なので O(L2) 7 L d d q K ・ L アテンションベクトル

[重要] キー⾏列 K を外積で表す • K は各位置のキーベクトル ki をまとめた⾏列 –
φi を i 番⽬の要素のみ 1 の one-hot ベクトルとすると K = ∑ φi ki T – 外積は例えば i = 4 のとき – よって i を 1 から L について計算した和を取ると K 8 [ 0 0 0 1 0 ] L φ4 k4 転地をとる d L d K において k4 に該当する要素以外はゼロの⾏列

[重要] φ で K を制御できる • φ の次元数で K の⼤きさを制御可能
– φi の次元数を n とすると K は d × n の⾏列に • 例えば φi を n = 3 の one-hot ベクトルとする • 1 クエリのアテンションの計算量は O(L) → O(n) – Self-attention（L 個のクエリ）では O(nL) • この形式で既存の効率的なアテンションを表現可能 9 [ 0 1 0 ] n φ4 k4 転地をとる d n d i = 4 のとき k4 k1 + k5 k2 + k3 K（φi ki の合計）はある列に複数の ki を含む

例1︓直近 n トークンのみにアテンション（e.g., strided attention [Child+ 19]） • 定義︓直近
n 個の φi を並べると単位⾏列 – それ以外の φi はゼロとする – 例えば i = 4, n = 3 のとき • 論⽂にはもっと厳密な形で表記があります 10 [ 0 0 0 ] [ 1 0 0 ] [ 0 1 0 ] [ 0 0 1 ] φ1 φ2 φ3 φ4 これで K をつくると直近 n 個の ki のみ含む⾏列に k4 k2 k3

例2︓アテンション⾏列を n × d にしておく（e.g., Linformer [Wang+ 20]） •
Linformer: L → n の⾏列 W を⽤意 – アテンションの計算前に W で次元を縮⼩ – 1 クエリのアテンションの計算量は O(L) → O(n) • W の学習は φ を学習することと同じ 11 L d K W を適⽤ K’ n d こちらをアテンションの計算に使⽤

ここまでの整理 • キー⾏列 K を外積で構築するための φ を導⼊ – 既存の効率的なアテンションを表現可能 •
この枠組を論⽂では ABC と呼んでいる • 本発表では 2種を例⽰，より多くの例は論⽂参照 • デコーダ側にも⾃然に適⽤可能 – エンコーダ側のみ可能とされていた既存⼿法も存在 • 厳密にはデコーダ側では効率が悪いとされた⼿法がある • Linformer [Wang+ 20]︓⾏列の L 次元を n 次元に落とす – L → n の計算を毎ステップ⾏う必要あり – 本⼿法は K を外積の和で構築＝再帰で書ける • デコーダ側に適⽤しても効率が落ちない • 任意の既存⼿法がデコーダ側に適⽤可能と⽰した 12

既存⼿法を拡張する • 既存⼿法では φ が⼊⼒⾮依存 – 事前に定義する︓直近 n トークン –
⼊⼒⾮依存の学習パラメータ︓Linformer • φi を⼊⼒に依存したベクトルにする – ⼊⼒に依存させた⽅が性能上がりそうなので – ⼊⼒を xi としたとき • この⼿法を ABCMLP と呼ぶ 13 and Table 2 details their complexity. 4 Learned Memory Control The ABC abstraction connects several existing ap- proaches that would otherwise seem distinct. This inspires the design of new architectures. We hy- pothesize that learning a contextualized strategy can achieve better performance. This section intro- duces ABCMLP. It parameterizes with a single- layer multi-layer perceptron (MLP) that takes as input the token’s representation xi, and determines which slots to write it into and how much. ↵i = exp (W xi) , i = ↵i , N X j=1 ↵j. (7) Matrix W is learned. exp is an elementwise activation function. The motivation is to allow for storing a “fractional” (but never negative) amount of input into the memory.4 Using a non-negative is small: inspire -MLP’s parame adds less than 1% ABCMLP: co dependent atten and show that two attention m context-agnostic with a context-de with a one-dime generalizes to hi Example 1. Con ory slot (n = 1). vector w , and xj). Since i is e K> = N X i=1 N X L デコーダで使う場合は⼊⼒の地点 i まで

実験 • 既存⼿法（特に Linformer）と性能・効率を⽐較 – 論⽂では⾔語モデル，機械翻訳，事前学習 • ここでは機械翻訳と事前学習を紹介 – 機械翻訳
• 対象が⽂︓WMT 14 EnDe（標準的なベンチマーク） • 対象が⽂書︓IWSLT 14 EsEn – 4⽂から4⽂への翻訳 – 事前学習 • Masked LM で事前学習 → ファインチューニング – 詳細︓ RoBERTa で初期化 → 事前学習 → ファインチューニング • 評価は GLUE ベンチマーク（の⼀部） 14

機械翻訳の実験結果 • 効率的なアテンションはエンコーダ側には適⽤しない – エンコーダ側は適⽤しなくても効率的 – エンコーダ側に適⽤すると性能が下がるのでは︖ • 通常のアテンション（Base）と同等の性能を達成 –
Linformer は⽂では性能が低く，⽂書では学習失敗 15 Model Cross n Causal n BLEU BASE - - 27.2 ABCRD 32 32 25.7 ABCRD 64 64 26.2 Linformer 32 32 26.6 Linformer 64 64 26.7 ABCMLP 32 8 27.1 ABCMLP 32 32 27.3 (a) Bolded number outperforms BASE. Model Cross n Causal n BLEU BASE - - 39.9 Linformer 128 64 - ABCRD 128 64 38.6 ABCMLP 128 64 39.7 Results. Table 4a summarizes sentence-level chine translation results on the WMT14 EN-DE set. Overall ABCMLP performs on par with BA with either 32-32 cross-causal memory sizes or 8. Even with smaller memory sizes, it outperfo other ABC variants by more than 1.1 BLEU. ferently from the trend in the language model experiment (§5.1), Linformer outperforms ABC by more than 0.5 BLEU. We attribute this to smaller sequence lengths of this dataset. ABC outperforms other ABC models by more than BLEU, even with smaller memory sizes. The trend is similar on document-level tr lation with IWSLT14 ES-EN (Table 4b), exc that ABCMLP slightly underperforms BASE by BLEU. This suggests that even with longer quences, ABCMLP is effective despite its boun memory size. Linformer fails to converge e with multiple random seeds, suggesting the lim Linformer 32 32 26.6 Linformer 64 64 26.7 ABCMLP 32 8 27.1 ABCMLP 32 32 27.3 (a) Bolded number outperforms BASE. Model Cross n Causal n BLEU BASE - - 39.9 Linformer 128 64 - ABCRD 128 64 38.6 ABCMLP 128 64 39.7 (b) Linformer fails to converge even with multiple random seeds. Bold number performs the best among ABC models. Table 4: Machine translation test SacreBLEU. Left: sentence-level translation with WMT14 EN-DE; right: document-level translation with IWSLT14 ES-EN. (Bojar et al., 2014). The preprocessing and data splits follow Vaswani et al. (2017). • Document-level translation with IWSLT14 ES- EN (Cettolo et al., 2014). We use Miculicich ⽂の翻訳⽂書の翻訳

事前学習 → ファインチューニングの実験結果 • 通常のアテンション（Base）と同等の性能を達成 16 Model n MNLI
QNLI QQP SST Avg. BASE - 87.2 92.4 91.7 94.3 91.4 Linformer 64 85.3 91.8 90.8 92.4 90.1 Linformer 128 86.1 91.9 91.4 93.7 90.8 ABCMLP 64 85.6 91.8 91.7 93.8 90.7 ABCMLP 128 87.1 92.6 91.8 94.4 91.5 Table 5: Text classiﬁcation development set accuracy. All models continue pretraining RoBERTa-base on our data with the MLM objective. Bold numbers perform the best among ABC models, and underlined ones per-

効率について • Linformer より少し効率は落ちる • 性能を考慮すると Linformer より良さそう 17 BASE
Linformer ABCMLP n - 64 128 64 128 Speed 1.0⇥ 1.7⇥ 1.5⇥ 1.5⇥ 1.3⇥ Memory 1.0⇥ 0.5⇥ 0.6⇥ 0.5⇥ 0.6⇥ Table 6: Text encoding inference speed (higher is better) and memory (lower is better). Inputs are text segments with 512 tokens and batch size 16. Cross n 8 16 32 64 baselin improve Ackno We wou versity and the ful com by NSF Nikolao tional S 512トークンをエンコード・推論したときのコスト

本論⽂のまとめ • アテンション機構の効率化⼿法をまとめる – 既存の効率化⼿法を統⼀した式で表現（ABC） – ここが本研究の最も⾒るべきところ • 統⼀した式を元に従来⼿法を拡張 –
従来︓アテンションの対象は⽂脈⾮依存 – 本⼿法︓アテンションの対象を⽂脈依存に • ⾔語モデル，機械翻訳などで効果を⽰した – 従来の効率化⼿法よりも⾼い性能 – 通常の Transformer と同等の性能で効率的 • この辺りはみんなそう主張するのでなんとも… 18

おまけ︓Self-attention 以外の構造 • CNN 系 – 事前学習 → ファインチューニングにおいて， CNN
は Self-attention と同等の性能 [Tay+ 21] • N-gram 系 – Transformer の1層⽬はニューラル N-gram に置き換えても性能に悪影響なし [Sun+ 21] – ニューラル N-gram をマルチヘッド化すると Self-attention と同等の性能 [Loem+ 22] 19

参考⽂献 • スライドでの⾔及順に書いています • Mikolov+ 10: Recurrent neural network based
language model • Mikolov+ 13: Distributed Representations of Words and Phrases and their Compositionality • Zaremba+ 14: Recurrent Neural Network Regularization • Bahdanau+ 14: Neural Machine Translation by Jointly Learning to Align and Translate • Vaswani+ 14: Attention Is All You Need • Tolstikhin+ 21: MLP-Mixer: An all-MLP Architecture for Vision • Gehring+ 17: Convolutional Sequence to Sequence Learning • Tay+ 21: Are Pre-trained Convolutions Better than Pre-trained Transformers? • Sun+ 21: Revisiting Simple Neural Probabilistic Language Models • Loem+ 22: Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self- attention • Roy+ 21: Efficient Content-Based Sparse Attention with Routing Transformers • Child+ 19: Generating Long Sequences with Sparse Transformers • Kitaev+ 20: Reformer: The Efficient Transformer • Wang+ 20: Linformer: Self-Attention with Linear Complexity 20

[SNLP2022] ABC: Attention with Bounded-memory C...

[SNLP2022] ABC: Attention with Bounded-memory Control

Sho Takase

More Decks by Sho Takase

Other Decks in Research

Featured

Transcript

ABC: Attention with Bounded-memory Control Hao Peng, Jungo Kasai, Nikolaos

⾃⼰紹介 • 2008-2017︓東北⼤学（学⼠-博⼠） • 2017-2018︓NTT CS研（ポスドク） • 2018-2020︓東⼯⼤（研究員） • 2020-2022︓東⼯⼤（助教）

⾃然⾔語処理におけるニューラルモデルの代表的な研究 3 2014 2010 RNN⾔語モデル [Mikolov+ 10] LSTM⾔語モデル [Zaremba+

次代のモデルの探求 • Self-attetnion の代替 – MLP系（MLP-Mixer）[Tolstikhin+ 21] – CNN [Gehring+

次代のモデルの探求 • Self-attetnion の代替 – MLP系（MLP-Mixer）[Tolstikhin+ 21] – CNN [Gehring+

本論⽂のまとめ • アテンション機構の効率化⼿法をまとめる – 既存の効率化⼿法を統⼀した式で表現 – ここが本研究の最も⾒るべきところ • 統⼀した式を元に従来⼿法を拡張 –

アテンションの計算 • クエリ q とキー⾏列 K のアテンション計算 • 計算量は O(L)

[重要] キー⾏列 K を外積で表す • K は各位置のキーベクトル ki をまとめた⾏列 –

[重要] φ で K を制御できる • φ の次元数で K の⼤きさを制御可能

例1︓直近 n トークンのみにアテンション（e.g., strided attention [Child+ 19]） • 定義︓直近

例2︓アテンション⾏列を n × d にしておく（e.g., Linformer [Wang+ 20]） •

ここまでの整理 • キー⾏列 K を外積で構築するための φ を導⼊ – 既存の効率的なアテンションを表現可能 •

既存⼿法を拡張する • 既存⼿法では φ が⼊⼒⾮依存 – 事前に定義する︓直近 n トークン –

実験 • 既存⼿法（特に Linformer）と性能・効率を⽐較 – 論⽂では⾔語モデル，機械翻訳，事前学習 • ここでは機械翻訳と事前学習を紹介 – 機械翻訳

事前学習 → ファインチューニングの実験結果 • 通常のアテンション（Base）と同等の性能を達成 16 Model n MNLI

効率について • Linformer より少し効率は落ちる • 性能を考慮すると Linformer より良さそう 17 BASE

本論⽂のまとめ • アテンション機構の効率化⼿法をまとめる – 既存の効率化⼿法を統⼀した式で表現（ABC） – ここが本研究の最も⾒るべきところ • 統⼀した式を元に従来⼿法を拡張 –

おまけ︓Self-attention 以外の構造 • CNN 系 – 事前学習 → ファインチューニングにおいて， CNN

参考⽂献 • スライドでの⾔及順に書いています • Mikolov+ 10: Recurrent neural network based