Hayato Tsukagoshi
August 22, 2023
1.9k

# Hyena Hierarchy: Towards Larger Convolutional Language Models

2023-08-29: 第15回 最先端NLP勉強会

August 22, 2023

## Transcript

1. ### Hyena Hierarchy: Towards Larger Convolutional Language Models D1, Graduate School

of Informatics, Nagoya University, Japan Hayato Tsukagoshi Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré ICML2023

֓ཁ 2
3. ### •ঢ়ଶۭؒϞσϧ • ίϯηϓτ • ৞ΈࠐΈԋࢉͰͷදݱ •Hyena • ઌߦݚڀ • ͓ؾ࣋ͪ

• ධՁ࣮ݧ ໔੹ࣄ߲ •εϥΠυதͷਤද͸֤εϥΠυͰݴٴ͞Ε͍ͯΔ࿦จ͔ΒͷҾ༻Ͱ͢ •࿦จதͷ਺ࣜͱ͸ҟͳΔจࣈΛ࢖͍ͬͯΔ৔߹͕͋Γ·͢ ൃද໨࣍ / ໔੹ࣄ߲ 3
4. ### •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ • RNNͬΆ͍Ϟσϧ ঢ়ଶۭؒϞσϧ: State Space Models (SSMs) 4 si+1

= Asi + Bxi yi = Csi + Dxi
5. ### •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ • RNNͬΆ͍Ϟσϧ ঢ়ଶۭؒϞσϧ: State Space Models (SSMs) 5 ೖྗ

xi-1 ঢ়ଶ si-1 si+1 = Asi + Bxi yi = Csi + Dxi
6. ### •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ • RNNͬΆ͍Ϟσϧ ঢ়ଶۭؒϞσϧ: State Space Models (SSMs) 6 ೖྗ

xi-1 ঢ়ଶ si-1 ग़ྗ yi-1 si+1 = Asi + Bxi yi = Csi + Dxi
7. ### •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ • RNNͬΆ͍Ϟσϧ ঢ়ଶۭؒϞσϧ: State Space Models (SSMs) 7 ೖྗ

xi-1 ঢ়ଶ si-1 ग़ྗ yi-1 ঢ়ଶ si si+1 = Asi + Bxi yi = Csi + Dxi
8. ### •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ • RNNͬΆ͍Ϟσϧ ঢ়ଶۭؒϞσϧ: State Space Models (SSMs) 8 ೖྗ

xi-1 ঢ়ଶ si-1 ग़ྗ yi-1 ೖྗ xi ঢ়ଶ si si+1 = Asi + Bxi yi = Csi + Dxi
9. ### •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ • RNNͬΆ͍Ϟσϧ ঢ়ଶۭؒϞσϧ: State Space Models (SSMs) 9 ೖྗ

xi-1 ঢ়ଶ si-1 ग़ྗ yi-1 ೖྗ xi ঢ়ଶ si ग़ྗ yi-1 si+1 = Asi + Bxi yi = Csi + Dxi
10. ### •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ • RNNͬΆ͍Ϟσϧ ঢ়ଶۭؒϞσϧ: State Space Models (SSMs) 10 ೖྗ

xi-1 ঢ়ଶ si-1 ग़ྗ yi-1 ೖྗ xi ঢ়ଶ si ग़ྗ yi-1 ঢ়ଶ si+1 si+1 = Asi + Bxi yi = Csi + Dxi

Asi + Bxi
12. ### ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։ 12 yi = Csi + Dxi si+1 =

Asi + Bxi yi = C (Asi-1 + Bxi-1 ) + Dxi
13. ### ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։ 13 yi = Csi + Dxi si+1 =

Asi + Bxi yi = C (Asi-1 + Bxi-1 ) + Dxi yi = C(A (Asi-2 + Bxi-2 ) + Bxi-1 ) + Dxi
14. ### ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։ 14 yi = Csi + Dxi si+1 =

Asi + Bxi yi = C (Asi-1 + Bxi-1 ) + Dxi yi = C(A (Asi-2 + Bxi-2 ) + Bxi-1 ) + Dxi yi = C(A(A (Asi-3 + Bxi-3 ) + Bxi-2 ) + Bxi-1 ) + Dxi

Dx1
17. ### ঢ়ଶۭؒϞσϧ: ۩ମྫ 17 y0 = Dx0 y1 = CA0Bx0 +

Dx1 y2 = CA1Bx0 + CA0Bx1 + Dx2
18. ### ঢ়ଶۭؒϞσϧ: ۩ମྫ 18 y0 = Dx0 y1 = CA0Bx0 +

Dx1 y2 = CA1Bx0 + CA0Bx1 + Dx2 y3 = CA2Bx0 + CA1Bx1 + CA0Bx2 + Dx3 …

y2 y3

y1 y2 y3
21. ### ঢ়ଶۭؒϞσϧ: t=2 21 x0 x2 x3 D CA1B y0 y1

y2 y3 x1 CA0B
22. ### ঢ়ଶۭؒϞσϧ: t=3 22 x0 x3 D CA1B y0 y1 y2

y3 x1 CA0B x2 CA2B

x2
24. ### ঢ়ଶۭؒϞσϧ: શମ૾ 24 y0 y1 y2 y3 ग़ྗಉ࢜ͷґଘ͕ͳ͍ → ฒྻܭࢉՄೳ

x0 x3 x1 x2 Attentionͱಉ͡ੑ࣭
25. ### ৞ΈࠐΈԋࢉͰͷදݱ 25 y0 = Dx0 y1 = CA0Bx0 + Dx1

y2 = CA1Bx0 + CA0Bx1 + Dx2 y3 = CA2Bx0 + CA1Bx1 + CA0Bx2 + Dx3
26. ### ৞ΈࠐΈԋࢉͰͷදݱ 26 y0 = Dx0 y1 = CA0Bx0 + Dx1

y2 = CA1Bx0 + CA0Bx1 + Dx2 y3 = CA2Bx0 + CA1Bx1 + CA0Bx2 + Dx3 ಉ͡Α͏ͳܭࢉ͕ଟ͍
27. ### ৞ΈࠐΈԋࢉͰͷදݱ 27 y0 = Dx0 y1 = CA0Bx0 + Dx1

y2 = CA1Bx0 + CA0Bx1 + Dx2 y3 = CA2Bx0 + CA1Bx1 + CA0Bx2 + Dx3 ಉ͡Α͏ͳܭࢉ͕ଟ͍ Ͳ͏ʹ͔ͯ͠ߴ଎ԽͰ͖ͳ͍͔ʁ

] CABΛฒ΂Δ
29. ### ৞ΈࠐΈԋࢉͰͷදݱ 29 f = [ CA0B, CA1B, CA2B, …, CAN-1B

] x = [ x0 , x1 , x2 , …, xN-1 ]
30. ### ৞ΈࠐΈԋࢉͰͷදݱ 30 f = [ CA0B, CA1B, CA2B, …, CAN-1B

] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [
31. ### ৞ΈࠐΈԋࢉͰͷදݱ 31 f = [ CA0B, CA1B, CA2B, …, CAN-1B

] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [ CA0Bx0 ,
32. ### ৞ΈࠐΈԋࢉͰͷදݱ 32 f = [ CA0B, CA1B, CA2B, …, CAN-1B

] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [ CA0Bx0 ,  CA1Bx0 + CA0Bx1 ,
33. ### ৞ΈࠐΈԋࢉͰͷදݱ 33 f = [ CA0B, CA1B, CA2B, …, CAN-1B

] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [ CA0Bx0 ,  CA1Bx0 + CA0Bx1 ,  CA2Bx0 + CA1Bx1 + CA0Bx2 ,  … ]
34. ### ৞ΈࠐΈԋࢉͰͷදݱ 34 f = [ CA0B, CA1B, CA2B, …, CAN-1B

] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [ CA0Bx0 ,  CA1Bx0 + CA0Bx1 ,  CA2Bx0 + CA1Bx1 + CA0Bx2 ,  … ] → y1 → y2 → y3 … ೖྗͱಉ͡௕͞ͷ  ग़ྗܥྻ
35. ### ৞ΈࠐΈԋࢉͰͷදݱ 35 f = [ CA0B, CA1B, CA2B, …, CAN-1B

] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [ CA0Bx0 ,  CA1Bx0 + CA0Bx1 ,  CA2Bx0 + CA1Bx1 + CA0Bx2 ,  … ] → y1 → y2 → y3 … yN = ( f ˎ x )N-1 + DxN
36. ### ৞ΈࠐΈԋࢉͰͷදݱ 36 f = [ CA0B, CA1B, CA2B, …, CAN-1B

] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [ CA0Bx0 ,  CA1Bx0 + CA0Bx1 ,  CA2Bx0 + CA1Bx1 + CA0Bx2 ,  … ] → y1 → y2 → y3 … yN = ( f ˎ x )N-1 + DxN ৞ΈࠐΈܭࢉͷ݁ՌΛ ϐοΫΞοϓ
37. ### •৞ΈࠐΈԋࢉ͸ϑʔϦΤม׵ͨ͠ܥྻಉ࢜ͷཁૉੵͱͯ͠දݱՄೳ ௨ৗͷ৞ΈࠐΈԋࢉ •ܭࢉճ਺: N * (N+1) / 2 → O(N2)

ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉͷߴ଎Խ 37
38. ### •৞ΈࠐΈԋࢉ͸ϑʔϦΤม׵ͨ͠ܥྻಉ࢜ͷཁૉੵͱͯ͠දݱՄೳ ௨ৗͷ৞ΈࠐΈԋࢉ •ܭࢉճ਺: N * (N+1) / 2 → O(N2)

ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉ •f ͱ x Λߴ଎ϑʔϦΤม׵: O(N log N) •FFT(f) ͱ FFT(x) ͷཁૉੵΛͱΔ: O(N) •f ͱ x Λߴ଎ٯϑʔϦΤม׵: O(N log N) ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉͷߴ଎Խ 38
39. ### •৞ΈࠐΈԋࢉ͸ϑʔϦΤม׵ͨ͠ܥྻಉ࢜ͷཁૉੵͱͯ͠දݱՄೳ ௨ৗͷ৞ΈࠐΈԋࢉ •ܭࢉճ਺: N * (N+1) / 2 → O(N2)

ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉ •f ͱ x Λߴ଎ϑʔϦΤม׵: O(N log N) •FFT(f) ͱ FFT(x) ͷཁૉੵΛͱΔ: O(N) •f ͱ x Λߴ଎ٯϑʔϦΤม׵: O(N log N) ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉͷߴ଎Խ 39 Nݸͷग़ྗͷܭࢉ͕  O(N log N) ͰͰ͖Δʂ ঢ়ଶۭؒϞσϧ΁ͷద༻ʹ͸ ࣮ࡍʹ͸৭ʑͳԾఆ͕ඞཁ

40

42. ### Data-controlled Linear Operator •ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ SubLinear Parameter Scaling •ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

Unrestricted Context •೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ • context෯͕ແݶʹཉ͍͠ AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு 42 Local Attention: https://github.com/lucidrains/local-attention
43. ### Data-controlled Linear Operator •ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ SubLinear Parameter Scaling •ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

Unrestricted Context •೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ • context෯͕ແݶʹཉ͍͠ AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு 43 Local Attention: https://github.com/lucidrains/local-attention S4͸ͩΊ
44. ### Data-controlled Linear Operator •ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ SubLinear Parameter Scaling •ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

Unrestricted Context •೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ • context෯͕ແݶʹཉ͍͠ AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு 44 Local Attention: https://github.com/lucidrains/local-attention MLP-Mixer͸ͩΊ S4͸ͩΊ
45. ### Data-controlled Linear Operator •ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ SubLinear Parameter Scaling •ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

Unrestricted Context •೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ • context෯͕ແݶʹཉ͍͠ AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு 45 Local Attention: https://github.com/lucidrains/local-attention MLP-Mixer͸ͩΊ S4͸ͩΊ CNN / Local Attention  ͸ͩΊ
46. ### •ঢ়ଶۭؒϞσϧʹجͮ͘ਂ૚ֶशϞσϧͷύΠΦχΞతଘࡏ • ը૾(bitྻ)෼ྨͳͲ௕ܥྻɾ௕ڑ཭ґଘܥλεΫͰߴੑೳ •ೖྗܥྻʹґଘͨ͠ઢܗԋࢉ͕ଘࡏ͠ͳ͍ • AttentionͷQKVͷΑ͏ͳػߏ͕ͳ͘ɺදݱྗ͕ൺֱతऑ͍ Gu+: E ff i

ciently Modeling Long Sequences with Structured State Spaces. ICLR 2022 outstanding paper. ઌߦݚڀ: Structured State Space Sequence (S4) 46
47. ### •ঢ়ଶۭؒϞσϧʹجͮ͘ਂ૚ֶशϞσϧͷύΠΦχΞతଘࡏ • ը૾(bitྻ)෼ྨͳͲ௕ܥྻɾ௕ڑ཭ґଘܥλεΫͰߴੑೳ •ೖྗܥྻʹґଘͨ͠ઢܗԋࢉ͕ଘࡏ͠ͳ͍ • AttentionͷQKVͷΑ͏ͳػߏ͕ͳ͘ɺදݱྗ͕ൺֱతऑ͍ Gu+: E ff i

ciently Modeling Long Sequences with Structured State Spaces. ICLR 2022 outstanding paper. ઌߦݚڀ: Structured State Space Sequence (S4) 47
48. ### •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ •໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍ •ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿ • SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing • Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

•୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍ • AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্ • hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍ Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight. ઌߦݚڀ: Hungry Hungry Hippos (H3) 48
49. ### •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ •໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍ •ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿ • SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing • Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

•୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍ • AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্ • hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍ Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight. ઌߦݚڀ: Hungry Hungry Hippos (H3) 49
50. ### •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ •໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍ •ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿ • SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing • Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

•୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍ • AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্ • hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍ Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight. ઌߦݚڀ: Hungry Hungry Hippos (H3) 50
51. ### •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ •໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍ •ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿ • SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing • Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

•୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍ • AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্ • hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍ Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight. ઌߦݚڀ: Hungry Hungry Hippos (H3) 51
52. ### •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ دΓಓ: Linear

Attentionʹ͓͚ΔQKVܭࢉ 52 Q K V Q K V Attention Linear Attention Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
53. ### دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ 53 Q K V Q K V

Attention Linear Attention •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
54. ### دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ 54 QK V Q KV Attention Linear

Attention O(N2d) O(Nd2) N N d d d N d N •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
55. ### دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ 55 QK V Q KV Attention Linear

Attention O(N2d) O(Nd2) N N d d d N d N •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ ܭࢉ͕͍ܰʂ Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
56. ### دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ 56 QK V Q KV Attention Linear

Attention O(N2d) O(Nd2) N N d d d N d N •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ ܭࢉ͕͍ܰʂ Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020. Causalityͷ୲อ͕  େมͳͷͰมܗ͕  ͍͔ͭ͘ଘࡏ

58. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: Πϝʔδ 58 K V X Q (N

x d) (N x d) (N x d)
59. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: Πϝʔδ 59 SSM K V X Q

KV (N x d) (N x d) (N x d)
60. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: Πϝʔδ 60 SSM K V X Q

KV (N x d) (N x d) (N x d) ཁૉੵͳͷͰ  Causality͕อͨΕ͍ͯΔ
61. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: Πϝʔδ 61 SSM K V X Q

KV (N x d) (N x d) (N x d) SSM Y
62. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: Πϝʔδ 62 SSM K V X Q

KV (N x d) (N x d) (N x d) SSM Y Linear Attentionͷ  kernelʹ૬౰
63. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: Πϝʔδ 63 SSM K V X Q

KV (N x d) (N x d) (N x d) SSM Y Linear Attention + SSMs  ͳѱຐతΞʔΩςΫνϟ
64. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: ࣮ࡍͷܭࢉ 64 SSM K V X Q

KV (N x d x d) (N x d) (N x d) (N x d) SSM Y Ґஔ͝ͱ֎ੵ ཁૉੵ
65. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: ࣮ࡍͷܭࢉ 65 SSM K V X Q

KV (N x d x d) (N x d) (N x d) (N x d) SSM Y Q1 ∈ R1 x d, KV1 ∈ Rd x d

Hyena 66

X

SSM K1 X

SSM K1 V1 X
70. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ… • mճSSMͰࠞͥΔ • m=2ͷ࣌H3ͱҰக Hyena: Πϝʔδ 70

SSM K1 K2 V1 X SSM
71. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ… • mճSSMͰࠞͥΔ • m=2ͷ࣌H3ͱҰக Hyena: Πϝʔδ 71

SSM K1 K2 V1 V2 SSM X
72. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ… • mճSSMͰࠞͥΔ • m=2ͷ࣌H3ͱҰக Hyena: Πϝʔδ 72

SSM K1 K2 V1 V2 SSM X mճ
73. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ… • mճSSMͰࠞͥΔ • m=2ͷ࣌H3ͱҰக Hyena: Πϝʔδ 73

SSM K1 K2 V1 V2 SSM X mճ एׯ΍͚ͦ͘ײ͕͋Δ
74. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ… • mճSSMͰࠞͥΔ • m=2ͷ࣌H3ͱҰக Hyena: Πϝʔδ 74

SSM K1 K2 Km V1 V2 SSM SSM Vm X mճ
75. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ… • mճSSMͰࠞͥΔ • m=2ͷ࣌H3ͱҰக Hyena: Πϝʔδ 75

SSM K1 K2 Km V1 V2 SSM SSM Vm X mճ Y
76. ### •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ… • mճSSMͰࠞͥΔ • m=2ͷ࣌H3ͱҰக Hyena: Πϝʔδ 76

SSM K1 K2 Km V1 V2 SSM SSM Vm X mճ Y H3ͷQʹ૬౰
77. ### •H3ͷQ͸Vͱ΍͍ͬͯΔ͜ͱ͕͋·ΓมΘΒͳ͍ • H3ͷQΛV2 ͱݟΕ͹Hyenaͷm=2ͷ࣌ (Hyena-2) ͱҰக •H3: QKV •Hyena-3: QKVV

Hyena: ߦํෆ໌ͷQʹ͍ͭͯ 77 H3 Hyena
78. ### •৞ΈࠐΈϑΟϧλ f ΛҐஔຒΊࠐΈ+FFN+ࢦ਺తͳݮਰͰදݱ • on-the- fl yʹೖྗܥྻʹ߹Θͤͯຖճੜ੒ Multi-scale Retention —

Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023.  RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021. Hyena: ৞ΈࠐΈϑΟϧλ 78 f = [h0, h1, h2, …, hN] ht =FFN(PositionalEncoding(t)) · Window(t)
79. ### •৞ΈࠐΈϑΟϧλ f ΛҐஔຒΊࠐΈ+FFN+ࢦ਺తͳݮਰͰදݱ • on-the- fl yʹೖྗܥྻʹ߹Θͤͯຖճੜ੒ Multi-scale Retention —

Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023.  RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021. Hyena: ৞ΈࠐΈϑΟϧλ 79 f = [h0, h1, h2, …, hN] ht =FFN(PositionalEncoding(t)) · Window(t) Multi-scale Retention΍  RoPEతͳ͓ؾ࣋ͪ
80. ### •৞ΈࠐΈϑΟϧλ f ΛҐஔຒΊࠐΈ+FFN+ࢦ਺తͳݮਰͰදݱ • on-the- fl yʹೖྗܥྻʹ߹Θͤͯຖճੜ੒ Multi-scale Retention —

Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023.  RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021. Hyena: ৞ΈࠐΈϑΟϧλ 80 f = [h0, h1, h2, …, hN] ht =FFN(PositionalEncoding(t)) · Window(t) Multi-scale Retention΍  RoPEతͳ͓ؾ࣋ͪ

Pile
83. ### •ಉن໛ͷTransformer  ϕʔεϞσϧͱಉ౳ੑೳ •΍͚ͬͭײ͕͋ΔධՁ ධՁ࣮ݧ: SuperGLUE / ը૾෼ྨ 83 RWKV͸v4 SuperGLUE

(4-shot learning) ը૾෼ྨ

85. ### •ಛʹ௕͍ܥྻʹରͯ͠ΑΓখ͍͞ਪ࿦ίετ ධՁ࣮ݧ: ܭࢉ࣌ؒൺֱ 85 ܥྻ௕ ਪ࿦࣌ؒ Hyena͸ঢ়ଶۭؒϞσϧϕʔεͳͷʹ  ਪ࿦͕֤εςοϓO(n)ʹͳͬͯ  ͠·͍ͬͯΔ Implicit

fi lterͷ͍ͤͬΆ͍
86. ### •ঢ়ଶۭؒϞσϧʹجͮ͘৽ͨͳΞʔΩςΫνϟHyenaΛఏҊ •AttentionΑΓܭࢉྔ͕খ͘͞Transformerͱಉ౳Ҏ্ͷੑೳ • ෳ਺༻ҙͨ͠৞ΈࠐΈΧʔωϧ͕MHA΍Multi-scale RetentionͬΆ͍ •S4/H3ΑΓਪ࿦࣌ͷܭࢉίετ͕ѱԽ͍ͯ͠Δ఺ʹ஫ҙ ײ૝ •ධՁϞσϧ͕খ͍͞ (΄΅ ≦355M)

• ॳظ࣮ݧͰ1.3BϞσϧ͸܇࿅͍ͯ͠ΔΒ͍͠(cf. Appendix A.2) • ؾ߹͍ͰSuperGLUE౳ͰධՁͭͭ͠Scaling law΋ݟͯ΄͍͠ • S4΍H3ͱͷൺֱͱͯ͠Long Range Arena (LRA)Ͱ΋ධՁͯ͠΄͔ͬͨ͠ LRA — Tay+: Long Range Arena: A Benchmark for E ff i cient Transformers. ICLR 2020. ·ͱΊ 86
87. ### •Hyena: ࣍ੈ୅LLM΁޲͚ͨTransformerΛӽ͑Δ৽ػցֶशϞσϧ Is Attention All You Need? Part 3 •Is

Attention All You Need? Part 1 Transformer Λ௒͑Δ(?) ৽ϞσϧS4 •HyenaDNA: DNAͷݴޠΛಡΈղ͘LLMͷ৽ͨͳΔԠ༻ •[Journal club] Hyena Hierarchy: Towards Larger Convolutional Language Models •The Annotated S4 •Hungry Hungry Hippos: Towards Language Modeling with State Space Models ؔ࿈ࢿྉ 87