Hayato Tsukagoshi
August 22, 2023
860

# Hyena Hierarchy: Towards Larger Convolutional Language Models

2023-08-29: 第15回 最先端NLP勉強会

August 22, 2023

## Transcript

1. Hyena Hierarchy: Towards Larger
Convolutional Language Models
D1, Graduate School of Informatics, Nagoya University, Japan
Hayato Tsukagoshi
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus,

Yoshua Bengio, Stefano Ermon, Christopher Ré

ICML2023

2. •ঢ়ଶۭؒϞσϧ(SSMs)ϕʔεͷΞʔΩςΫνϟHyenaΛఏҊ

• AttentionΑΓখ͍͞ܭࢉྔ: O(N log N)

• Attentionෆ࢖༻ͷϞσϧͰॳΊͯAttentionͱಉ౳Ҏ্ͷੑೳΛୡ੒

•ঢ়ଶۭؒϞσϧͱLinear Transformerͷѱຐ߹ମϞσϧ
֓ཁ
2

3. •ঢ়ଶۭؒϞσϧ

• ίϯηϓτ

• ৞ΈࠐΈԋࢉͰͷදݱ

•Hyena

• ઌߦݚڀ

• ͓ؾ࣋ͪ

• ධՁ࣮ݧ

໔੹ࣄ߲
•εϥΠυதͷਤද͸֤εϥΠυͰݴٴ͞Ε͍ͯΔ࿦จ͔ΒͷҾ༻Ͱ͢

•࿦จதͷ਺ࣜͱ͸ҟͳΔจࣈΛ࢖͍ͬͯΔ৔߹͕͋Γ·͢
ൃද໨࣍ / ໔੹ࣄ߲
3

4. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

• RNNͬΆ͍Ϟσϧ
ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
4
si+1
= Asi
+ Bxi
yi
= Csi
+ Dxi

5. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

• RNNͬΆ͍Ϟσϧ
ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
5
ೖྗ xi-1
ঢ়ଶ si-1
si+1
= Asi
+ Bxi
yi
= Csi
+ Dxi

6. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

• RNNͬΆ͍Ϟσϧ
ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
6
ೖྗ xi-1
ঢ়ଶ si-1
ग़ྗ yi-1
si+1
= Asi
+ Bxi
yi
= Csi
+ Dxi

7. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

• RNNͬΆ͍Ϟσϧ
ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
7
ೖྗ xi-1
ঢ়ଶ si-1
ग़ྗ yi-1
ঢ়ଶ si
si+1
= Asi
+ Bxi
yi
= Csi
+ Dxi

8. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

• RNNͬΆ͍Ϟσϧ
ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
8
ೖྗ xi-1
ঢ়ଶ si-1
ग़ྗ yi-1
ೖྗ xi
ঢ়ଶ si
si+1
= Asi
+ Bxi
yi
= Csi
+ Dxi

9. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

• RNNͬΆ͍Ϟσϧ
ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
9
ೖྗ xi-1
ঢ়ଶ si-1
ग़ྗ yi-1
ೖྗ xi
ঢ়ଶ si
ग़ྗ yi-1
si+1
= Asi
+ Bxi
yi
= Csi
+ Dxi

10. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

• RNNͬΆ͍Ϟσϧ
ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
10
ೖྗ xi-1
ঢ়ଶ si-1
ग़ྗ yi-1
ೖྗ xi
ঢ়ଶ si
ग़ྗ yi-1
ঢ়ଶ si+1
si+1
= Asi
+ Bxi
yi
= Csi
+ Dxi

11. ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։
11
yi
= Csi
+ Dxi
si+1
= Asi
+ Bxi

12. ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։
12
yi
= Csi
+ Dxi
si+1
= Asi
+ Bxi
yi
= C (Asi-1
+ Bxi-1
) + Dxi

13. ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։
13
yi
= Csi
+ Dxi
si+1
= Asi
+ Bxi
yi
= C (Asi-1
+ Bxi-1
) + Dxi
yi
= C(A (Asi-2
+ Bxi-2
) + Bxi-1
) + Dxi

14. ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։
14
yi
= Csi
+ Dxi
si+1
= Asi
+ Bxi
yi
= C (Asi-1
+ Bxi-1
) + Dxi
yi
= C(A (Asi-2
+ Bxi-2
) + Bxi-1
) + Dxi
yi
= C(A(A (Asi-3
+ Bxi-3
) + Bxi-2
) + Bxi-1
) + Dxi

15. ঢ়ଶۭؒϞσϧ: ۩ମྫ
15
y0
= Dx0

16. ঢ়ଶۭؒϞσϧ: ۩ମྫ
16
y0
= Dx0
y1
= CA0Bx0
+ Dx1

17. ঢ়ଶۭؒϞσϧ: ۩ମྫ
17
y0
= Dx0
y1
= CA0Bx0
+ Dx1
y2
= CA1Bx0
+ CA0Bx1
+ Dx2

18. ঢ়ଶۭؒϞσϧ: ۩ମྫ
18
y0
= Dx0
y1
= CA0Bx0
+ Dx1
y2
= CA1Bx0
+ CA0Bx1
+ Dx2
y3
= CA2Bx0
+ CA1Bx1
+ CA0Bx2
+ Dx3

19. ঢ়ଶۭؒϞσϧ: t=0
19
x0
x1
x2
x3
D
y0
y1
y2
y3

20. ঢ়ଶۭؒϞσϧ: t=1
20
x0
x1
x2
x3
D
CA0B
y0
y1
y2
y3

21. ঢ়ଶۭؒϞσϧ: t=2
21
x0
x2
x3
D
CA1B
y0
y1
y2
y3
x1
CA0B

22. ঢ়ଶۭؒϞσϧ: t=3
22
x0
x3
D
CA1B
y0
y1
y2
y3
x1
CA0B
x2
CA2B

23. ঢ়ଶۭؒϞσϧ: શମ૾
23
y0
y1
y2
y3
x0
x3
x1
x2

24. ঢ়ଶۭؒϞσϧ: શମ૾
24
y0
y1
y2
y3
ग़ྗಉ࢜ͷґଘ͕ͳ͍
→ ฒྻܭࢉՄೳ
x0
x3
x1
x2
Attentionͱಉ͡ੑ࣭

25. ৞ΈࠐΈԋࢉͰͷදݱ
25
y0
= Dx0
y1
= CA0Bx0
+ Dx1
y2
= CA1Bx0
+ CA0Bx1
+ Dx2
y3
= CA2Bx0
+ CA1Bx1
+ CA0Bx2
+ Dx3

26. ৞ΈࠐΈԋࢉͰͷදݱ
26
y0
= Dx0
y1
= CA0Bx0
+ Dx1
y2
= CA1Bx0
+ CA0Bx1
+ Dx2
y3
= CA2Bx0
+ CA1Bx1
+ CA0Bx2
+ Dx3
ಉ͡Α͏ͳܭࢉ͕ଟ͍

27. ৞ΈࠐΈԋࢉͰͷදݱ
27
y0
= Dx0
y1
= CA0Bx0
+ Dx1
y2
= CA1Bx0
+ CA0Bx1
+ Dx2
y3
= CA2Bx0
+ CA1Bx1
+ CA0Bx2
+ Dx3
ಉ͡Α͏ͳܭࢉ͕ଟ͍
Ͳ͏ʹ͔ͯ͠ߴ଎ԽͰ͖ͳ͍͔ʁ

28. ৞ΈࠐΈԋࢉͰͷදݱ
28
f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
CABΛฒ΂Δ

29. ৞ΈࠐΈԋࢉͰͷදݱ
29
f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
x = [ x0
, x1
, x2
, …, xN-1
]

30. ৞ΈࠐΈԋࢉͰͷදݱ
30
f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
x = [ x0
, x1
, x2
, …, xN-1
]
( f ˎ x ) = [

31. ৞ΈࠐΈԋࢉͰͷදݱ
31
f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
x = [ x0
, x1
, x2
, …, xN-1
]
( f ˎ x ) = [ CA0Bx0
,

32. ৞ΈࠐΈԋࢉͰͷදݱ
32
f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
x = [ x0
, x1
, x2
, …, xN-1
]
( f ˎ x ) = [ CA0Bx0
,
CA1Bx0
+ CA0Bx1
,

33. ৞ΈࠐΈԋࢉͰͷදݱ
33
f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
x = [ x0
, x1
, x2
, …, xN-1
]
( f ˎ x ) = [ CA0Bx0
,
CA1Bx0
+ CA0Bx1
,
CA2Bx0
+ CA1Bx1
+ CA0Bx2
,
… ]

34. ৞ΈࠐΈԋࢉͰͷදݱ
34
f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
x = [ x0
, x1
, x2
, …, xN-1
]
( f ˎ x ) = [ CA0Bx0
,
CA1Bx0
+ CA0Bx1
,
CA2Bx0
+ CA1Bx1
+ CA0Bx2
,
… ]
→ y1
→ y2
→ y3

ೖྗͱಉ͡௕͞ͷ
ग़ྗܥྻ

35. ৞ΈࠐΈԋࢉͰͷදݱ
35
f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
x = [ x0
, x1
, x2
, …, xN-1
]
( f ˎ x ) = [ CA0Bx0
,
CA1Bx0
+ CA0Bx1
,
CA2Bx0
+ CA1Bx1
+ CA0Bx2
,
… ]
→ y1
→ y2
→ y3

yN
= ( f ˎ x )N-1
+ DxN

36. ৞ΈࠐΈԋࢉͰͷදݱ
36
f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
x = [ x0
, x1
, x2
, …, xN-1
]
( f ˎ x ) = [ CA0Bx0
,
CA1Bx0
+ CA0Bx1
,
CA2Bx0
+ CA1Bx1
+ CA0Bx2
,
… ]
→ y1
→ y2
→ y3

yN
= ( f ˎ x )N-1
+ DxN
৞ΈࠐΈܭࢉͷ݁ՌΛ
ϐοΫΞοϓ

37. •৞ΈࠐΈԋࢉ͸ϑʔϦΤม׵ͨ͠ܥྻಉ࢜ͷཁૉੵͱͯ͠දݱՄೳ

௨ৗͷ৞ΈࠐΈԋࢉ
•ܭࢉճ਺: N * (N+1) / 2 → O(N2)
ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉͷߴ଎Խ
37

38. •৞ΈࠐΈԋࢉ͸ϑʔϦΤม׵ͨ͠ܥྻಉ࢜ͷཁૉੵͱͯ͠දݱՄೳ

௨ৗͷ৞ΈࠐΈԋࢉ
•ܭࢉճ਺: N * (N+1) / 2 → O(N2)
ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉ
•f ͱ x Λߴ଎ϑʔϦΤม׵: O(N log N)

•FFT(f) ͱ FFT(x) ͷཁૉੵΛͱΔ: O(N)
•f ͱ x Λߴ଎ٯϑʔϦΤม׵: O(N log N)
ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉͷߴ଎Խ
38

39. •৞ΈࠐΈԋࢉ͸ϑʔϦΤม׵ͨ͠ܥྻಉ࢜ͷཁૉੵͱͯ͠දݱՄೳ

௨ৗͷ৞ΈࠐΈԋࢉ
•ܭࢉճ਺: N * (N+1) / 2 → O(N2)
ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉ
•f ͱ x Λߴ଎ϑʔϦΤม׵: O(N log N)

•FFT(f) ͱ FFT(x) ͷཁૉੵΛͱΔ: O(N)
•f ͱ x Λߴ଎ٯϑʔϦΤม׵: O(N log N)
ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉͷߴ଎Խ
39
Nݸͷग़ྗͷܭࢉ͕
O(N log N) ͰͰ͖Δʂ
ঢ়ଶۭؒϞσϧ΁ͷద༻ʹ͸
࣮ࡍʹ͸৭ʑͳԾఆ͕ඞཁ

40. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

• ৔߹ʹΑͬͯ͸ܭࢉΛܰ͘Ͱ͖Δ͜ͱ΋

•ϕΫτϧͷܥྻΛࠞͥͯϕΫτϧͷܥྻΛग़ྗ͢Δػߏ
• TransformerͱࣅͨΑ͏ͳ͜ͱ͕Ͱ͖Δ
·ͱΊ: ঢ়ଶۭؒϞσϧ (SSMs) ͱਂ૚ֶश
40

41. Hyena

42. Data-controlled Linear Operator
•ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ

SubLinear Parameter Scaling
•ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

Unrestricted Context
•೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ

• context෯͕ແݶʹཉ͍͠
AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு
42
Local Attention: https://github.com/lucidrains/local-attention

43. Data-controlled Linear Operator
•ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ

SubLinear Parameter Scaling
•ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

Unrestricted Context
•೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ

• context෯͕ແݶʹཉ͍͠
AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு
43
Local Attention: https://github.com/lucidrains/local-attention
S4͸ͩΊ

44. Data-controlled Linear Operator
•ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ

SubLinear Parameter Scaling
•ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

Unrestricted Context
•೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ

• context෯͕ແݶʹཉ͍͠
AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு
44
Local Attention: https://github.com/lucidrains/local-attention
MLP-Mixer͸ͩΊ
S4͸ͩΊ

45. Data-controlled Linear Operator
•ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ

SubLinear Parameter Scaling
•ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

Unrestricted Context
•೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ

• context෯͕ແݶʹཉ͍͠
AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு
45
Local Attention: https://github.com/lucidrains/local-attention
MLP-Mixer͸ͩΊ
S4͸ͩΊ
CNN / Local Attention
͸ͩΊ

46. •ঢ়ଶۭؒϞσϧʹجͮ͘ਂ૚ֶशϞσϧͷύΠΦχΞతଘࡏ
• ը૾(bitྻ)෼ྨͳͲ௕ܥྻɾ௕ڑ཭ґଘܥλεΫͰߴੑೳ

•ೖྗܥྻʹґଘͨ͠ઢܗԋࢉ͕ଘࡏ͠ͳ͍

• AttentionͷQKVͷΑ͏ͳػߏ͕ͳ͘ɺදݱྗ͕ൺֱతऑ͍
Gu+: E
ff i
ciently Modeling Long Sequences with Structured State Spaces. ICLR 2022 outstanding paper.
ઌߦݚڀ: Structured State Space Sequence (S4)
46

47. •ঢ়ଶۭؒϞσϧʹجͮ͘ਂ૚ֶशϞσϧͷύΠΦχΞతଘࡏ
• ը૾(bitྻ)෼ྨͳͲ௕ܥྻɾ௕ڑ཭ґଘܥλεΫͰߴੑೳ

•ೖྗܥྻʹґଘͨ͠ઢܗԋࢉ͕ଘࡏ͠ͳ͍

• AttentionͷQKVͷΑ͏ͳػߏ͕ͳ͘ɺදݱྗ͕ൺֱతऑ͍
Gu+: E
ff i
ciently Modeling Long Sequences with Structured State Spaces. ICLR 2022 outstanding paper.
ઌߦݚڀ: Structured State Space Sequence (S4)
47

48. •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ

•໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍
•ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿
• SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing

• Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

•୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍

• AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্

• hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍
Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight.
ઌߦݚڀ: Hungry Hungry Hippos (H3)
48

49. •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ

•໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍
•ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿
• SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing

• Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

•୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍

• AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্

• hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍
Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight.
ઌߦݚڀ: Hungry Hungry Hippos (H3)
49

50. •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ

•໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍
•ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿
• SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing

• Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

•୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍

• AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্

• hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍
Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight.
ઌߦݚڀ: Hungry Hungry Hippos (H3)
50

51. •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ

•໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍
•ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿
• SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing

• Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

•୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍

• AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্

• hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍
Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight.
ઌߦݚڀ: Hungry Hungry Hippos (H3)
51

52. •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ
دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ
52
Q
K
V Q
K
V
Attention Linear Attention
Shen+: E
ff i
cient Attention: Attention with Linear Complexities. WACV 2021.

https://github.com/lucidrains/linear-attention-transformer

Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.

53. دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ
53
Q
K
V Q
K
V
Attention Linear Attention
•QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ
Shen+: E
ff i
cient Attention: Attention with Linear Complexities. WACV 2021.

https://github.com/lucidrains/linear-attention-transformer

Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.

54. دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ
54
QK V Q
KV
Attention Linear Attention
O(N2d) O(Nd2)
N
N d d
d
N
d
N
•QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ
Shen+: E
ff i
cient Attention: Attention with Linear Complexities. WACV 2021.

https://github.com/lucidrains/linear-attention-transformer

Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.

55. دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ
55
QK V Q
KV
Attention Linear Attention
O(N2d) O(Nd2)
N
N d d
d
N
d
N
•QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ
ܭࢉ͕͍ܰʂ
Shen+: E
ff i
cient Attention: Attention with Linear Complexities. WACV 2021.

https://github.com/lucidrains/linear-attention-transformer

Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.

56. دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ
56
QK V Q
KV
Attention Linear Attention
O(N2d) O(Nd2)
N
N d d
d
N
d
N
•QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ
ܭࢉ͕͍ܰʂ
Shen+: E
ff i
cient Attention: Attention with Linear Complexities. WACV 2021.

https://github.com/lucidrains/linear-attention-transformer

Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
Causalityͷ୲อ͕
େมͳͷͰมܗ͕
͍͔ͭ͘ଘࡏ

57. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
H3: Πϝʔδ
57
X

58. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
H3: Πϝʔδ
58
K
V
X
Q
(N x d)
(N x d)
(N x d)

59. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
H3: Πϝʔδ
59
SSM
K
V
X
Q
KV
(N x d)
(N x d)
(N x d)

60. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
H3: Πϝʔδ
60
SSM
K
V
X
Q
KV
(N x d)
(N x d)
(N x d)
ཁૉੵͳͷͰ
Causality͕อͨΕ͍ͯΔ

61. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
H3: Πϝʔδ
61
SSM
K
V
X
Q
KV
(N x d)
(N x d)
(N x d)
SSM Y

62. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
H3: Πϝʔδ
62
SSM
K
V
X
Q
KV
(N x d)
(N x d)
(N x d)
SSM Y
Linear Attentionͷ
kernelʹ૬౰

63. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
H3: Πϝʔδ
63
SSM
K
V
X
Q
KV
(N x d)
(N x d)
(N x d)
SSM Y
Linear Attention + SSMs
ͳѱຐతΞʔΩςΫνϟ

64. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
H3: ࣮ࡍͷܭࢉ
64
SSM
K
V
X
Q
KV
(N x d x d)
(N x d)
(N x d)
(N x d)
SSM Y
Ґஔ͝ͱ֎ੵ
ཁૉੵ

65. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
H3: ࣮ࡍͷܭࢉ
65
SSM
K
V
X
Q
KV
(N x d x d)
(N x d)
(N x d)
(N x d)
SSM Y
Q1 ∈ R1 x d, KV1 ∈ Rd x d

66. •SSMsΛ༻͍ͨਅʹAttention-freeͳϞσϧ

•SSMʹΑΔtoken mixingΛ܁Γฦ࣮͠ࢪ
• H3ΛKVͷԋࢉճ਺ʹؔͯ͠ҰൠԽͨ͠΋ͷͱΈͳͤΔ
Hyena
66

67. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

• mճSSMͰࠞͥΔ

• m=2ͷ࣌H3ͱҰக
Hyena: Πϝʔδ
67
X

68. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

• mճSSMͰࠞͥΔ

• m=2ͷ࣌H3ͱҰக
Hyena: Πϝʔδ
68
SSM
K1
X

69. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

• mճSSMͰࠞͥΔ

• m=2ͷ࣌H3ͱҰக
Hyena: Πϝʔδ
69
SSM
K1
V1
X

70. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

• mճSSMͰࠞͥΔ

• m=2ͷ࣌H3ͱҰக
Hyena: Πϝʔδ
70
SSM
K1 K2
V1
X SSM

71. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

• mճSSMͰࠞͥΔ

• m=2ͷ࣌H3ͱҰக
Hyena: Πϝʔδ
71
SSM
K1 K2
V1 V2
SSM
X

72. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

• mճSSMͰࠞͥΔ

• m=2ͷ࣌H3ͱҰக
Hyena: Πϝʔδ
72
SSM
K1 K2
V1 V2
SSM
X

73. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

• mճSSMͰࠞͥΔ

• m=2ͷ࣌H3ͱҰக
Hyena: Πϝʔδ
73
SSM
K1 K2
V1 V2
SSM
X

एׯ΍͚ͦ͘ײ͕͋Δ

74. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

• mճSSMͰࠞͥΔ

• m=2ͷ࣌H3ͱҰக
Hyena: Πϝʔδ
74
SSM
K1 K2 Km
V1 V2
SSM SSM
Vm
X

75. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

• mճSSMͰࠞͥΔ

• m=2ͷ࣌H3ͱҰக
Hyena: Πϝʔδ
75
SSM
K1 K2 Km
V1 V2
SSM SSM
Vm
X

Y

76. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

• mճSSMͰࠞͥΔ

• m=2ͷ࣌H3ͱҰக
Hyena: Πϝʔδ
76
SSM
K1 K2 Km
V1 V2
SSM SSM
Vm
X

Y
H3ͷQʹ૬౰

77. •H3ͷQ͸Vͱ΍͍ͬͯΔ͜ͱ͕͋·ΓมΘΒͳ͍

• H3ͷQΛV2
ͱݟΕ͹Hyenaͷm=2ͷ࣌ (Hyena-2) ͱҰக

•H3: QKV
•Hyena-3: QKVV
Hyena: ߦํෆ໌ͷQʹ͍ͭͯ
77
H3 Hyena

78. •৞ΈࠐΈϑΟϧλ f ΛҐஔຒΊࠐΈ+FFN+ࢦ਺తͳݮਰͰදݱ

• on-the-
fl
yʹೖྗܥྻʹ߹Θͤͯຖճੜ੒
Multi-scale Retention — Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023.
RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021.
Hyena: ৞ΈࠐΈϑΟϧλ
78
f = [h0, h1, h2, …, hN]

ht =FFN(PositionalEncoding(t)) · Window(t)

79. •৞ΈࠐΈϑΟϧλ f ΛҐஔຒΊࠐΈ+FFN+ࢦ਺తͳݮਰͰදݱ

• on-the-
fl
yʹೖྗܥྻʹ߹Θͤͯຖճੜ੒
Multi-scale Retention — Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023.
RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021.
Hyena: ৞ΈࠐΈϑΟϧλ
79
f = [h0, h1, h2, …, hN]

ht =FFN(PositionalEncoding(t)) · Window(t)
Multi-scale Retention΍
RoPEతͳ͓ؾ࣋ͪ

80. •৞ΈࠐΈϑΟϧλ f ΛҐஔຒΊࠐΈ+FFN+ࢦ਺తͳݮਰͰදݱ

• on-the-
fl
yʹೖྗܥྻʹ߹Θͤͯຖճੜ੒
Multi-scale Retention — Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023.
RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021.
Hyena: ৞ΈࠐΈϑΟϧλ
80
f = [h0, h1, h2, …, hN]

ht =FFN(PositionalEncoding(t)) · Window(t)
Multi-scale Retention΍
RoPEతͳ͓ؾ࣋ͪ

81. •ݴޠϞσϦϯά

•ԼྲྀλεΫ (SuperGLUE)

•ը૾෼ྨ

•ܭࢉ࣌ؒൺֱ
ධՁ࣮ݧ
81

82. •m=3 (QKVV) ͷHyena͸PPLͰTransformerʹඖఢ

• ಉن໛ͷGPTతϞσϧΑΓ܇࿅ίετ΋খ͍͞
ධՁ࣮ݧ: ݴޠϞσϦϯά
82
WikiText-103 The Pile

83. •ಉن໛ͷTransformer
ϕʔεϞσϧͱಉ౳ੑೳ

•΍͚ͬͭײ͕͋ΔධՁ
ධՁ࣮ݧ: SuperGLUE / ը૾෼ྨ
83
RWKV͸v4
SuperGLUE (4-shot learning)
ը૾෼ྨ

84. •ಛʹ௕͍ܥྻʹରͯ͠ΑΓখ͍͞ਪ࿦ίετ
ධՁ࣮ݧ: ܭࢉ࣌ؒൺֱ
84
ܥྻ௕
ਪ࿦࣌ؒ

85. •ಛʹ௕͍ܥྻʹରͯ͠ΑΓখ͍͞ਪ࿦ίετ
ධՁ࣮ݧ: ܭࢉ࣌ؒൺֱ
85
ܥྻ௕
ਪ࿦࣌ؒ
Hyena͸ঢ়ଶۭؒϞσϧϕʔεͳͷʹ
ਪ࿦͕֤εςοϓO(n)ʹͳͬͯ
͠·͍ͬͯΔ
Implicit
fi
lterͷ͍ͤͬΆ͍

86. •ঢ়ଶۭؒϞσϧʹجͮ͘৽ͨͳΞʔΩςΫνϟHyenaΛఏҊ

•AttentionΑΓܭࢉྔ͕খ͘͞Transformerͱಉ౳Ҏ্ͷੑೳ

• ෳ਺༻ҙͨ͠৞ΈࠐΈΧʔωϧ͕MHA΍Multi-scale RetentionͬΆ͍

•S4/H3ΑΓਪ࿦࣌ͷܭࢉίετ͕ѱԽ͍ͯ͠Δ఺ʹ஫ҙ

ײ૝
•ධՁϞσϧ͕খ͍͞ (΄΅ ≦355M)

• ॳظ࣮ݧͰ1.3BϞσϧ͸܇࿅͍ͯ͠ΔΒ͍͠(cf. Appendix A.2)

• ؾ߹͍ͰSuperGLUE౳ͰධՁͭͭ͠Scaling law΋ݟͯ΄͍͠

• S4΍H3ͱͷൺֱͱͯ͠Long Range Arena (LRA)Ͱ΋ධՁͯ͠΄͔ͬͨ͠
LRA — Tay+: Long Range Arena: A Benchmark for E
ff
i
cient Transformers. ICLR 2020.
·ͱΊ
86

87. •Hyena: ࣍ੈ୅LLM΁޲͚ͨTransformerΛӽ͑Δ৽ػցֶशϞσϧ Is
Attention All You Need? Part 3

•Is Attention All You Need? Part 1 Transformer Λ௒͑Δ(?) ৽ϞσϧS4