of Informatics, Nagoya University, Japan Hayato Tsukagoshi Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré ICML2023
•୯ମͰTransformerΛ͑ΒΕͳ͍ • AttentionΛڬΜͩhybridϞσϧͰͬͱಉҎ্ • hybridϞσϧਪ͕AttentionʹҾͬுΒΕ͍ͯ Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight. ઌߦݚڀ: Hungry Hungry Hippos (H3) 48
•୯ମͰTransformerΛ͑ΒΕͳ͍ • AttentionΛڬΜͩhybridϞσϧͰͬͱಉҎ্ • hybridϞσϧਪ͕AttentionʹҾͬுΒΕ͍ͯ Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight. ઌߦݚڀ: Hungry Hungry Hippos (H3) 49
•୯ମͰTransformerΛ͑ΒΕͳ͍ • AttentionΛڬΜͩhybridϞσϧͰͬͱಉҎ্ • hybridϞσϧਪ͕AttentionʹҾͬுΒΕ͍ͯ Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight. ઌߦݚڀ: Hungry Hungry Hippos (H3) 50
•୯ମͰTransformerΛ͑ΒΕͳ͍ • AttentionΛڬΜͩhybridϞσϧͰͬͱಉҎ্ • hybridϞσϧਪ͕AttentionʹҾͬுΒΕ͍ͯ Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight. ઌߦݚڀ: Hungry Hungry Hippos (H3) 51
Attentionʹ͓͚ΔQKVܭࢉ 52 Q K V Q K V Attention Linear Attention Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
Attention Linear Attention •QK & V Ͱͳ͘ Q & KV Λܭࢉ͢Δ Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
Attention O(N2d) O(Nd2) N N d d d N d N •QK & V Ͱͳ͘ Q & KV Λܭࢉ͢Δ Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
Attention O(N2d) O(Nd2) N N d d d N d N •QK & V Ͱͳ͘ Q & KV Λܭࢉ͢Δ ܭࢉ͕͍ܰʂ Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
Attention O(N2d) O(Nd2) N N d d d N d N •QK & V Ͱͳ͘ Q & KV Λܭࢉ͢Δ ܭࢉ͕͍ܰʂ Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020. Causalityͷ୲อ͕ େมͳͷͰมܗ͕ ͍͔ͭ͘ଘࡏ
Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023. RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021. Hyena: ΈࠐΈϑΟϧλ 78 f = [h0, h1, h2, …, hN] ht =FFN(PositionalEncoding(t)) · Window(t)
Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023. RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021. Hyena: ΈࠐΈϑΟϧλ 79 f = [h0, h1, h2, …, hN] ht =FFN(PositionalEncoding(t)) · Window(t) Multi-scale Retention RoPEతͳ͓ؾ࣋ͪ
Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023. RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021. Hyena: ΈࠐΈϑΟϧλ 80 f = [h0, h1, h2, …, hN] ht =FFN(PositionalEncoding(t)) · Window(t) Multi-scale Retention RoPEతͳ͓ؾ࣋ͪ
• ॳظ࣮ݧͰ1.3BϞσϧ܇࿅͍ͯ͠ΔΒ͍͠(cf. Appendix A.2) • ؾ߹͍ͰSuperGLUEͰධՁͭͭ͠Scaling lawݟͯ΄͍͠ • S4H3ͱͷൺֱͱͯ͠Long Range Arena (LRA)ͰධՁͯ͠΄͔ͬͨ͠ LRA — Tay+: Long Range Arena: A Benchmark for E ff i cient Transformers. ICLR 2020. ·ͱΊ 86
Attention All You Need? Part 1 Transformer Λ͑Δ(?) ৽ϞσϧS4 •HyenaDNA: DNAͷݴޠΛಡΈղ͘LLMͷ৽ͨͳΔԠ༻ •[Journal club] Hyena Hierarchy: Towards Larger Convolutional Language Models •The Annotated S4 •Hungry Hungry Hippos: Towards Language Modeling with State Space Models ؔ࿈ࢿྉ 87