Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hyena Hierarchy: Towards Larger Convolutional Language Models

Hyena Hierarchy: Towards Larger Convolutional Language Models

状態空間モデルベースの新しいアーキテクチャ、Hyenaについて解説した資料です。

2023-08-29: 第15回 最先端NLP勉強会
https://sites.google.com/view/snlp-jp/home/2023

Hayato Tsukagoshi

August 22, 2023
Tweet

More Decks by Hayato Tsukagoshi

Other Decks in Research

Transcript

  1. Hyena Hierarchy: Towards Larger Convolutional Language Models D1, Graduate School

    of Informatics, Nagoya University, Japan Hayato Tsukagoshi Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré ICML2023
  2. •ঢ়ଶۭؒϞσϧ • ίϯηϓτ • ৞ΈࠐΈԋࢉͰͷදݱ •Hyena • ઌߦݚڀ • ͓ؾ࣋ͪ

    • ධՁ࣮ݧ ໔੹ࣄ߲ •εϥΠυதͷਤද͸֤εϥΠυͰݴٴ͞Ε͍ͯΔ࿦จ͔ΒͷҾ༻Ͱ͢ •࿦จதͷ਺ࣜͱ͸ҟͳΔจࣈΛ࢖͍ͬͯΔ৔߹͕͋Γ·͢ ൃද໨࣍ / ໔੹ࣄ߲ 3
  3. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ • RNNͬΆ͍Ϟσϧ ঢ়ଶۭؒϞσϧ: State Space Models (SSMs) 9 ೖྗ

    xi-1 ঢ়ଶ si-1 ग़ྗ yi-1 ೖྗ xi ঢ়ଶ si ग़ྗ yi-1 si+1 = Asi + Bxi yi = Csi + Dxi
  4. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ • RNNͬΆ͍Ϟσϧ ঢ়ଶۭؒϞσϧ: State Space Models (SSMs) 10 ೖྗ

    xi-1 ঢ়ଶ si-1 ग़ྗ yi-1 ೖྗ xi ঢ়ଶ si ग़ྗ yi-1 ঢ়ଶ si+1 si+1 = Asi + Bxi yi = Csi + Dxi
  5. ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։ 12 yi = Csi + Dxi si+1 =

    Asi + Bxi yi = C (Asi-1 + Bxi-1 ) + Dxi
  6. ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։ 13 yi = Csi + Dxi si+1 =

    Asi + Bxi yi = C (Asi-1 + Bxi-1 ) + Dxi yi = C(A (Asi-2 + Bxi-2 ) + Bxi-1 ) + Dxi
  7. ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։ 14 yi = Csi + Dxi si+1 =

    Asi + Bxi yi = C (Asi-1 + Bxi-1 ) + Dxi yi = C(A (Asi-2 + Bxi-2 ) + Bxi-1 ) + Dxi yi = C(A(A (Asi-3 + Bxi-3 ) + Bxi-2 ) + Bxi-1 ) + Dxi
  8. ঢ়ଶۭؒϞσϧ: ۩ମྫ 18 y0 = Dx0 y1 = CA0Bx0 +

    Dx1 y2 = CA1Bx0 + CA0Bx1 + Dx2 y3 = CA2Bx0 + CA1Bx1 + CA0Bx2 + Dx3 …
  9. ৞ΈࠐΈԋࢉͰͷදݱ 25 y0 = Dx0 y1 = CA0Bx0 + Dx1

    y2 = CA1Bx0 + CA0Bx1 + Dx2 y3 = CA2Bx0 + CA1Bx1 + CA0Bx2 + Dx3
  10. ৞ΈࠐΈԋࢉͰͷදݱ 26 y0 = Dx0 y1 = CA0Bx0 + Dx1

    y2 = CA1Bx0 + CA0Bx1 + Dx2 y3 = CA2Bx0 + CA1Bx1 + CA0Bx2 + Dx3 ಉ͡Α͏ͳܭࢉ͕ଟ͍
  11. ৞ΈࠐΈԋࢉͰͷදݱ 27 y0 = Dx0 y1 = CA0Bx0 + Dx1

    y2 = CA1Bx0 + CA0Bx1 + Dx2 y3 = CA2Bx0 + CA1Bx1 + CA0Bx2 + Dx3 ಉ͡Α͏ͳܭࢉ͕ଟ͍ Ͳ͏ʹ͔ͯ͠ߴ଎ԽͰ͖ͳ͍͔ʁ
  12. ৞ΈࠐΈԋࢉͰͷදݱ 30 f = [ CA0B, CA1B, CA2B, …, CAN-1B

    ] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [
  13. ৞ΈࠐΈԋࢉͰͷදݱ 31 f = [ CA0B, CA1B, CA2B, …, CAN-1B

    ] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [ CA0Bx0 ,
  14. ৞ΈࠐΈԋࢉͰͷදݱ 32 f = [ CA0B, CA1B, CA2B, …, CAN-1B

    ] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [ CA0Bx0 ,
 CA1Bx0 + CA0Bx1 ,
  15. ৞ΈࠐΈԋࢉͰͷදݱ 33 f = [ CA0B, CA1B, CA2B, …, CAN-1B

    ] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [ CA0Bx0 ,
 CA1Bx0 + CA0Bx1 ,
 CA2Bx0 + CA1Bx1 + CA0Bx2 ,
 … ]
  16. ৞ΈࠐΈԋࢉͰͷදݱ 34 f = [ CA0B, CA1B, CA2B, …, CAN-1B

    ] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [ CA0Bx0 ,
 CA1Bx0 + CA0Bx1 ,
 CA2Bx0 + CA1Bx1 + CA0Bx2 ,
 … ] → y1 → y2 → y3 … ೖྗͱಉ͡௕͞ͷ
 ग़ྗܥྻ
  17. ৞ΈࠐΈԋࢉͰͷදݱ 35 f = [ CA0B, CA1B, CA2B, …, CAN-1B

    ] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [ CA0Bx0 ,
 CA1Bx0 + CA0Bx1 ,
 CA2Bx0 + CA1Bx1 + CA0Bx2 ,
 … ] → y1 → y2 → y3 … yN = ( f ˎ x )N-1 + DxN
  18. ৞ΈࠐΈԋࢉͰͷදݱ 36 f = [ CA0B, CA1B, CA2B, …, CAN-1B

    ] x = [ x0 , x1 , x2 , …, xN-1 ] ( f ˎ x ) = [ CA0Bx0 ,
 CA1Bx0 + CA0Bx1 ,
 CA2Bx0 + CA1Bx1 + CA0Bx2 ,
 … ] → y1 → y2 → y3 … yN = ( f ˎ x )N-1 + DxN ৞ΈࠐΈܭࢉͷ݁ՌΛ ϐοΫΞοϓ
  19. •৞ΈࠐΈԋࢉ͸ϑʔϦΤม׵ͨ͠ܥྻಉ࢜ͷཁૉੵͱͯ͠දݱՄೳ ௨ৗͷ৞ΈࠐΈԋࢉ •ܭࢉճ਺: N * (N+1) / 2 → O(N2)

    ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉ •f ͱ x Λߴ଎ϑʔϦΤม׵: O(N log N) •FFT(f) ͱ FFT(x) ͷཁૉੵΛͱΔ: O(N) •f ͱ x Λߴ଎ٯϑʔϦΤม׵: O(N log N) ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉͷߴ଎Խ 38
  20. •৞ΈࠐΈԋࢉ͸ϑʔϦΤม׵ͨ͠ܥྻಉ࢜ͷཁૉੵͱͯ͠දݱՄೳ ௨ৗͷ৞ΈࠐΈԋࢉ •ܭࢉճ਺: N * (N+1) / 2 → O(N2)

    ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉ •f ͱ x Λߴ଎ϑʔϦΤม׵: O(N log N) •FFT(f) ͱ FFT(x) ͷཁૉੵΛͱΔ: O(N) •f ͱ x Λߴ଎ٯϑʔϦΤม׵: O(N log N) ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉͷߴ଎Խ 39 Nݸͷग़ྗͷܭࢉ͕
 O(N log N) ͰͰ͖Δʂ ঢ়ଶۭؒϞσϧ΁ͷద༻ʹ͸ ࣮ࡍʹ͸৭ʑͳԾఆ͕ඞཁ
  21. Data-controlled Linear Operator •ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ SubLinear Parameter Scaling •ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

    Unrestricted Context •೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ • context෯͕ແݶʹཉ͍͠ AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு 42 Local Attention: https://github.com/lucidrains/local-attention
  22. Data-controlled Linear Operator •ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ SubLinear Parameter Scaling •ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

    Unrestricted Context •೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ • context෯͕ແݶʹཉ͍͠ AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு 43 Local Attention: https://github.com/lucidrains/local-attention S4͸ͩΊ
  23. Data-controlled Linear Operator •ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ SubLinear Parameter Scaling •ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

    Unrestricted Context •೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ • context෯͕ແݶʹཉ͍͠ AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு 44 Local Attention: https://github.com/lucidrains/local-attention MLP-Mixer͸ͩΊ S4͸ͩΊ
  24. Data-controlled Linear Operator •ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ SubLinear Parameter Scaling •ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

    Unrestricted Context •೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ • context෯͕ແݶʹཉ͍͠ AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு 45 Local Attention: https://github.com/lucidrains/local-attention MLP-Mixer͸ͩΊ S4͸ͩΊ CNN / Local Attention
 ͸ͩΊ
  25. •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ •໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍ •ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿ • SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing • Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

    •୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍ • AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্ • hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍ Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight. ઌߦݚڀ: Hungry Hungry Hippos (H3) 48
  26. •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ •໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍ •ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿ • SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing • Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

    •୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍ • AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্ • hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍ Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight. ઌߦݚڀ: Hungry Hungry Hippos (H3) 49
  27. •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ •໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍ •ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿ • SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing • Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

    •୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍ • AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্ • hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍ Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight. ઌߦݚڀ: Hungry Hungry Hippos (H3) 50
  28. •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ •໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍ •ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿ • SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing • Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

    •୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍ • AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্ • hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍ Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight. ઌߦݚڀ: Hungry Hungry Hippos (H3) 51
  29. •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ دΓಓ: Linear

    Attentionʹ͓͚ΔQKVܭࢉ 52 Q K V Q K V Attention Linear Attention Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
  30. دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ 53 Q K V Q K V

    Attention Linear Attention •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
  31. دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ 54 QK V Q KV Attention Linear

    Attention O(N2d) O(Nd2) N N d d d N d N •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
  32. دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ 55 QK V Q KV Attention Linear

    Attention O(N2d) O(Nd2) N N d d d N d N •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ ܭࢉ͕͍ܰʂ Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
  33. دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ 56 QK V Q KV Attention Linear

    Attention O(N2d) O(Nd2) N N d d d N d N •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ ܭࢉ͕͍ܰʂ Shen+: E ff i cient Attention: Attention with Linear Complexities. WACV 2021. https://github.com/lucidrains/linear-attention-transformer Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020. Causalityͷ୲อ͕
 େมͳͷͰมܗ͕
 ͍͔ͭ͘ଘࡏ
  34. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: Πϝʔδ 60 SSM K V X Q

    KV (N x d) (N x d) (N x d) ཁૉੵͳͷͰ
 Causality͕อͨΕ͍ͯΔ
  35. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: Πϝʔδ 62 SSM K V X Q

    KV (N x d) (N x d) (N x d) SSM Y Linear Attentionͷ
 kernelʹ૬౰
  36. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: Πϝʔδ 63 SSM K V X Q

    KV (N x d) (N x d) (N x d) SSM Y Linear Attention + SSMs
 ͳѱຐతΞʔΩςΫνϟ
  37. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: ࣮ࡍͷܭࢉ 64 SSM K V X Q

    KV (N x d x d) (N x d) (N x d) (N x d) SSM Y Ґஔ͝ͱ֎ੵ ཁૉੵ
  38. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ H3: ࣮ࡍͷܭࢉ 65 SSM K V X Q

    KV (N x d x d) (N x d) (N x d) (N x d) SSM Y Q1 ∈ R1 x d, KV1 ∈ Rd x d
  39. •৞ΈࠐΈϑΟϧλ f ΛҐஔຒΊࠐΈ+FFN+ࢦ਺తͳݮਰͰදݱ • on-the- fl yʹೖྗܥྻʹ߹Θͤͯຖճੜ੒ Multi-scale Retention —

    Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023.
 RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021. Hyena: ৞ΈࠐΈϑΟϧλ 78 f = [h0, h1, h2, …, hN] ht =FFN(PositionalEncoding(t)) · Window(t)
  40. •৞ΈࠐΈϑΟϧλ f ΛҐஔຒΊࠐΈ+FFN+ࢦ਺తͳݮਰͰදݱ • on-the- fl yʹೖྗܥྻʹ߹Θͤͯຖճੜ੒ Multi-scale Retention —

    Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023.
 RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021. Hyena: ৞ΈࠐΈϑΟϧλ 79 f = [h0, h1, h2, …, hN] ht =FFN(PositionalEncoding(t)) · Window(t) Multi-scale Retention΍
 RoPEతͳ͓ؾ࣋ͪ
  41. •৞ΈࠐΈϑΟϧλ f ΛҐஔຒΊࠐΈ+FFN+ࢦ਺తͳݮਰͰදݱ • on-the- fl yʹೖྗܥྻʹ߹Θͤͯຖճੜ੒ Multi-scale Retention —

    Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023.
 RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021. Hyena: ৞ΈࠐΈϑΟϧλ 80 f = [h0, h1, h2, …, hN] ht =FFN(PositionalEncoding(t)) · Window(t) Multi-scale Retention΍
 RoPEతͳ͓ؾ࣋ͪ
  42. •ঢ়ଶۭؒϞσϧʹجͮ͘৽ͨͳΞʔΩςΫνϟHyenaΛఏҊ •AttentionΑΓܭࢉྔ͕খ͘͞Transformerͱಉ౳Ҏ্ͷੑೳ • ෳ਺༻ҙͨ͠৞ΈࠐΈΧʔωϧ͕MHA΍Multi-scale RetentionͬΆ͍ •S4/H3ΑΓਪ࿦࣌ͷܭࢉίετ͕ѱԽ͍ͯ͠Δ఺ʹ஫ҙ ײ૝ •ධՁϞσϧ͕খ͍͞ (΄΅ ≦355M)

    • ॳظ࣮ݧͰ1.3BϞσϧ͸܇࿅͍ͯ͠ΔΒ͍͠(cf. Appendix A.2) • ؾ߹͍ͰSuperGLUE౳ͰධՁͭͭ͠Scaling law΋ݟͯ΄͍͠ • S4΍H3ͱͷൺֱͱͯ͠Long Range Arena (LRA)Ͱ΋ධՁͯ͠΄͔ͬͨ͠ LRA — Tay+: Long Range Arena: A Benchmark for E ff i cient Transformers. ICLR 2020. ·ͱΊ 86
  43. •Hyena: ࣍ੈ୅LLM΁޲͚ͨTransformerΛӽ͑Δ৽ػցֶशϞσϧ Is Attention All You Need? Part 3 •Is

    Attention All You Need? Part 1 Transformer Λ௒͑Δ(?) ৽ϞσϧS4 •HyenaDNA: DNAͷݴޠΛಡΈղ͘LLMͷ৽ͨͳΔԠ༻ •[Journal club] Hyena Hierarchy: Towards Larger Convolutional Language Models •The Annotated S4 •Hungry Hungry Hippos: Towards Language Modeling with State Space Models ؔ࿈ࢿྉ 87