$30 off During Our Annual Pro Sale. View Details »

Hyena Hierarchy: Towards Larger Convolutional Language Models

Hyena Hierarchy: Towards Larger Convolutional Language Models

状態空間モデルベースの新しいアーキテクチャ、Hyenaについて解説した資料です。

2023-08-29: 第15回 最先端NLP勉強会
https://sites.google.com/view/snlp-jp/home/2023

Hayato Tsukagoshi

August 22, 2023
Tweet

More Decks by Hayato Tsukagoshi

Other Decks in Research

Transcript

  1. Hyena Hierarchy: Towards Larger
    Convolutional Language Models
    D1, Graduate School of Informatics, Nagoya University, Japan
    Hayato Tsukagoshi
    Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus,

    Yoshua Bengio, Stefano Ermon, Christopher Ré

    ICML2023

    View Slide

  2. •ঢ়ଶۭؒϞσϧ(SSMs)ϕʔεͷΞʔΩςΫνϟHyenaΛఏҊ

    • AttentionΑΓখ͍͞ܭࢉྔ: O(N log N)

    • Attentionෆ࢖༻ͷϞσϧͰॳΊͯAttentionͱಉ౳Ҏ্ͷੑೳΛୡ੒

    •ঢ়ଶۭؒϞσϧͱLinear Transformerͷѱຐ߹ମϞσϧ
    ֓ཁ
    2

    View Slide

  3. •ঢ়ଶۭؒϞσϧ

    • ίϯηϓτ

    • ৞ΈࠐΈԋࢉͰͷදݱ

    •Hyena

    • ઌߦݚڀ

    • ͓ؾ࣋ͪ

    • ධՁ࣮ݧ

    ໔੹ࣄ߲
    •εϥΠυதͷਤද͸֤εϥΠυͰݴٴ͞Ε͍ͯΔ࿦จ͔ΒͷҾ༻Ͱ͢

    •࿦จதͷ਺ࣜͱ͸ҟͳΔจࣈΛ࢖͍ͬͯΔ৔߹͕͋Γ·͢
    ൃද໨࣍ / ໔੹ࣄ߲
    3

    View Slide

  4. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

    • RNNͬΆ͍Ϟσϧ
    ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
    4
    si+1
    = Asi
    + Bxi
    yi
    = Csi
    + Dxi

    View Slide

  5. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

    • RNNͬΆ͍Ϟσϧ
    ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
    5
    ೖྗ xi-1
    ঢ়ଶ si-1
    si+1
    = Asi
    + Bxi
    yi
    = Csi
    + Dxi

    View Slide

  6. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

    • RNNͬΆ͍Ϟσϧ
    ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
    6
    ೖྗ xi-1
    ঢ়ଶ si-1
    ग़ྗ yi-1
    si+1
    = Asi
    + Bxi
    yi
    = Csi
    + Dxi

    View Slide

  7. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

    • RNNͬΆ͍Ϟσϧ
    ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
    7
    ೖྗ xi-1
    ঢ়ଶ si-1
    ग़ྗ yi-1
    ঢ়ଶ si
    si+1
    = Asi
    + Bxi
    yi
    = Csi
    + Dxi

    View Slide

  8. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

    • RNNͬΆ͍Ϟσϧ
    ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
    8
    ೖྗ xi-1
    ঢ়ଶ si-1
    ग़ྗ yi-1
    ೖྗ xi
    ঢ়ଶ si
    si+1
    = Asi
    + Bxi
    yi
    = Csi
    + Dxi

    View Slide

  9. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

    • RNNͬΆ͍Ϟσϧ
    ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
    9
    ೖྗ xi-1
    ঢ়ଶ si-1
    ग़ྗ yi-1
    ೖྗ xi
    ঢ়ଶ si
    ग़ྗ yi-1
    si+1
    = Asi
    + Bxi
    yi
    = Csi
    + Dxi

    View Slide

  10. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

    • RNNͬΆ͍Ϟσϧ
    ঢ়ଶۭؒϞσϧ: State Space Models (SSMs)
    10
    ೖྗ xi-1
    ঢ়ଶ si-1
    ग़ྗ yi-1
    ೖྗ xi
    ঢ়ଶ si
    ग़ྗ yi-1
    ঢ়ଶ si+1
    si+1
    = Asi
    + Bxi
    yi
    = Csi
    + Dxi

    View Slide

  11. ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։
    11
    yi
    = Csi
    + Dxi
    si+1
    = Asi
    + Bxi

    View Slide

  12. ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։
    12
    yi
    = Csi
    + Dxi
    si+1
    = Asi
    + Bxi
    yi
    = C (Asi-1
    + Bxi-1
    ) + Dxi

    View Slide

  13. ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։
    13
    yi
    = Csi
    + Dxi
    si+1
    = Asi
    + Bxi
    yi
    = C (Asi-1
    + Bxi-1
    ) + Dxi
    yi
    = C(A (Asi-2
    + Bxi-2
    ) + Bxi-1
    ) + Dxi

    View Slide

  14. ঢ়ଶۭؒϞσϧ: ܭࢉաఔͷల։
    14
    yi
    = Csi
    + Dxi
    si+1
    = Asi
    + Bxi
    yi
    = C (Asi-1
    + Bxi-1
    ) + Dxi
    yi
    = C(A (Asi-2
    + Bxi-2
    ) + Bxi-1
    ) + Dxi
    yi
    = C(A(A (Asi-3
    + Bxi-3
    ) + Bxi-2
    ) + Bxi-1
    ) + Dxi

    View Slide

  15. ঢ়ଶۭؒϞσϧ: ۩ମྫ
    15
    y0
    = Dx0

    View Slide

  16. ঢ়ଶۭؒϞσϧ: ۩ମྫ
    16
    y0
    = Dx0
    y1
    = CA0Bx0
    + Dx1

    View Slide

  17. ঢ়ଶۭؒϞσϧ: ۩ମྫ
    17
    y0
    = Dx0
    y1
    = CA0Bx0
    + Dx1
    y2
    = CA1Bx0
    + CA0Bx1
    + Dx2

    View Slide

  18. ঢ়ଶۭؒϞσϧ: ۩ମྫ
    18
    y0
    = Dx0
    y1
    = CA0Bx0
    + Dx1
    y2
    = CA1Bx0
    + CA0Bx1
    + Dx2
    y3
    = CA2Bx0
    + CA1Bx1
    + CA0Bx2
    + Dx3

    View Slide

  19. ঢ়ଶۭؒϞσϧ: t=0
    19
    x0
    x1
    x2
    x3
    D
    y0
    y1
    y2
    y3

    View Slide

  20. ঢ়ଶۭؒϞσϧ: t=1
    20
    x0
    x1
    x2
    x3
    D
    CA0B
    y0
    y1
    y2
    y3

    View Slide

  21. ঢ়ଶۭؒϞσϧ: t=2
    21
    x0
    x2
    x3
    D
    CA1B
    y0
    y1
    y2
    y3
    x1
    CA0B

    View Slide

  22. ঢ়ଶۭؒϞσϧ: t=3
    22
    x0
    x3
    D
    CA1B
    y0
    y1
    y2
    y3
    x1
    CA0B
    x2
    CA2B

    View Slide

  23. ঢ়ଶۭؒϞσϧ: શମ૾
    23
    y0
    y1
    y2
    y3
    x0
    x3
    x1
    x2

    View Slide

  24. ঢ়ଶۭؒϞσϧ: શମ૾
    24
    y0
    y1
    y2
    y3
    ग़ྗಉ࢜ͷґଘ͕ͳ͍
    → ฒྻܭࢉՄೳ
    x0
    x3
    x1
    x2
    Attentionͱಉ͡ੑ࣭

    View Slide

  25. ৞ΈࠐΈԋࢉͰͷදݱ
    25
    y0
    = Dx0
    y1
    = CA0Bx0
    + Dx1
    y2
    = CA1Bx0
    + CA0Bx1
    + Dx2
    y3
    = CA2Bx0
    + CA1Bx1
    + CA0Bx2
    + Dx3

    View Slide

  26. ৞ΈࠐΈԋࢉͰͷදݱ
    26
    y0
    = Dx0
    y1
    = CA0Bx0
    + Dx1
    y2
    = CA1Bx0
    + CA0Bx1
    + Dx2
    y3
    = CA2Bx0
    + CA1Bx1
    + CA0Bx2
    + Dx3
    ಉ͡Α͏ͳܭࢉ͕ଟ͍

    View Slide

  27. ৞ΈࠐΈԋࢉͰͷදݱ
    27
    y0
    = Dx0
    y1
    = CA0Bx0
    + Dx1
    y2
    = CA1Bx0
    + CA0Bx1
    + Dx2
    y3
    = CA2Bx0
    + CA1Bx1
    + CA0Bx2
    + Dx3
    ಉ͡Α͏ͳܭࢉ͕ଟ͍
    Ͳ͏ʹ͔ͯ͠ߴ଎ԽͰ͖ͳ͍͔ʁ

    View Slide

  28. ৞ΈࠐΈԋࢉͰͷදݱ
    28
    f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
    CABΛฒ΂Δ

    View Slide

  29. ৞ΈࠐΈԋࢉͰͷදݱ
    29
    f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
    x = [ x0
    , x1
    , x2
    , …, xN-1
    ]

    View Slide

  30. ৞ΈࠐΈԋࢉͰͷදݱ
    30
    f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
    x = [ x0
    , x1
    , x2
    , …, xN-1
    ]
    ( f ˎ x ) = [

    View Slide

  31. ৞ΈࠐΈԋࢉͰͷදݱ
    31
    f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
    x = [ x0
    , x1
    , x2
    , …, xN-1
    ]
    ( f ˎ x ) = [ CA0Bx0
    ,

    View Slide

  32. ৞ΈࠐΈԋࢉͰͷදݱ
    32
    f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
    x = [ x0
    , x1
    , x2
    , …, xN-1
    ]
    ( f ˎ x ) = [ CA0Bx0
    ,

    CA1Bx0
    + CA0Bx1
    ,

    View Slide

  33. ৞ΈࠐΈԋࢉͰͷදݱ
    33
    f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
    x = [ x0
    , x1
    , x2
    , …, xN-1
    ]
    ( f ˎ x ) = [ CA0Bx0
    ,

    CA1Bx0
    + CA0Bx1
    ,

    CA2Bx0
    + CA1Bx1
    + CA0Bx2
    ,

    … ]

    View Slide

  34. ৞ΈࠐΈԋࢉͰͷදݱ
    34
    f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
    x = [ x0
    , x1
    , x2
    , …, xN-1
    ]
    ( f ˎ x ) = [ CA0Bx0
    ,

    CA1Bx0
    + CA0Bx1
    ,

    CA2Bx0
    + CA1Bx1
    + CA0Bx2
    ,

    … ]
    → y1
    → y2
    → y3

    ೖྗͱಉ͡௕͞ͷ

    ग़ྗܥྻ

    View Slide

  35. ৞ΈࠐΈԋࢉͰͷදݱ
    35
    f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
    x = [ x0
    , x1
    , x2
    , …, xN-1
    ]
    ( f ˎ x ) = [ CA0Bx0
    ,

    CA1Bx0
    + CA0Bx1
    ,

    CA2Bx0
    + CA1Bx1
    + CA0Bx2
    ,

    … ]
    → y1
    → y2
    → y3

    yN
    = ( f ˎ x )N-1
    + DxN

    View Slide

  36. ৞ΈࠐΈԋࢉͰͷදݱ
    36
    f = [ CA0B, CA1B, CA2B, …, CAN-1B ]
    x = [ x0
    , x1
    , x2
    , …, xN-1
    ]
    ( f ˎ x ) = [ CA0Bx0
    ,

    CA1Bx0
    + CA0Bx1
    ,

    CA2Bx0
    + CA1Bx1
    + CA0Bx2
    ,

    … ]
    → y1
    → y2
    → y3

    yN
    = ( f ˎ x )N-1
    + DxN
    ৞ΈࠐΈܭࢉͷ݁ՌΛ
    ϐοΫΞοϓ

    View Slide

  37. •৞ΈࠐΈԋࢉ͸ϑʔϦΤม׵ͨ͠ܥྻಉ࢜ͷཁૉੵͱͯ͠දݱՄೳ

    ௨ৗͷ৞ΈࠐΈԋࢉ
    •ܭࢉճ਺: N * (N+1) / 2 → O(N2)
    ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉͷߴ଎Խ
    37

    View Slide

  38. •৞ΈࠐΈԋࢉ͸ϑʔϦΤม׵ͨ͠ܥྻಉ࢜ͷཁૉੵͱͯ͠දݱՄೳ

    ௨ৗͷ৞ΈࠐΈԋࢉ
    •ܭࢉճ਺: N * (N+1) / 2 → O(N2)
    ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉ
    •f ͱ x Λߴ଎ϑʔϦΤม׵: O(N log N)

    •FFT(f) ͱ FFT(x) ͷཁૉੵΛͱΔ: O(N)
    •f ͱ x Λߴ଎ٯϑʔϦΤม׵: O(N log N)
    ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉͷߴ଎Խ
    38

    View Slide

  39. •৞ΈࠐΈԋࢉ͸ϑʔϦΤม׵ͨ͠ܥྻಉ࢜ͷཁૉੵͱͯ͠දݱՄೳ

    ௨ৗͷ৞ΈࠐΈԋࢉ
    •ܭࢉճ਺: N * (N+1) / 2 → O(N2)
    ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉ
    •f ͱ x Λߴ଎ϑʔϦΤม׵: O(N log N)

    •FFT(f) ͱ FFT(x) ͷཁૉੵΛͱΔ: O(N)
    •f ͱ x Λߴ଎ٯϑʔϦΤม׵: O(N log N)
    ߴ଎ϑʔϦΤม׵ʹΑΔ৞ΈࠐΈԋࢉͷߴ଎Խ
    39
    Nݸͷग़ྗͷܭࢉ͕

    O(N log N) ͰͰ͖Δʂ
    ঢ়ଶۭؒϞσϧ΁ͷద༻ʹ͸
    ࣮ࡍʹ͸৭ʑͳԾఆ͕ඞཁ

    View Slide

  40. •ೖྗͱࠓͷঢ়ଶ͔Βग़ྗͱ࣍ͷঢ়ଶΛ࡞ΔϞσϧ

    • ৔߹ʹΑͬͯ͸ܭࢉΛܰ͘Ͱ͖Δ͜ͱ΋

    •ϕΫτϧͷܥྻΛࠞͥͯϕΫτϧͷܥྻΛग़ྗ͢Δػߏ
    • TransformerͱࣅͨΑ͏ͳ͜ͱ͕Ͱ͖Δ
    ·ͱΊ: ঢ়ଶۭؒϞσϧ (SSMs) ͱਂ૚ֶश
    40

    View Slide

  41. Hyena

    View Slide

  42. Data-controlled Linear Operator
    •ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ

    SubLinear Parameter Scaling
    •ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

    Unrestricted Context
    •೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ

    • context෯͕ແݶʹཉ͍͠
    AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு
    42
    Local Attention: https://github.com/lucidrains/local-attention

    View Slide

  43. Data-controlled Linear Operator
    •ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ

    SubLinear Parameter Scaling
    •ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

    Unrestricted Context
    •೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ

    • context෯͕ແݶʹཉ͍͠
    AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு
    43
    Local Attention: https://github.com/lucidrains/local-attention
    S4͸ͩΊ

    View Slide

  44. Data-controlled Linear Operator
    •ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ

    SubLinear Parameter Scaling
    •ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

    Unrestricted Context
    •೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ

    • context෯͕ແݶʹཉ͍͠
    AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு
    44
    Local Attention: https://github.com/lucidrains/local-attention
    MLP-Mixer͸ͩΊ
    S4͸ͩΊ

    View Slide

  45. Data-controlled Linear Operator
    •ೖྗܥྻࣗମʹґଘͨ͠ԋࢉ (context-dependency) ͕࣮ݱͰ͖Δ

    SubLinear Parameter Scaling
    •ύϥϝʔλ਺͕ೖྗܥྻͷ௕͞ʹґଘ͠ͳ͍

    Unrestricted Context
    •೚ҙͷtokenؒͷؔ܎Λଊ͑Δ͜ͱ͕Ͱ͖Δ

    • context෯͕ແݶʹཉ͍͠
    AttentionΛࢧ͑Δੑ࣭: Hyena࿦จͷओு
    45
    Local Attention: https://github.com/lucidrains/local-attention
    MLP-Mixer͸ͩΊ
    S4͸ͩΊ
    CNN / Local Attention

    ͸ͩΊ

    View Slide

  46. •ঢ়ଶۭؒϞσϧʹجͮ͘ਂ૚ֶशϞσϧͷύΠΦχΞతଘࡏ
    • ը૾(bitྻ)෼ྨͳͲ௕ܥྻɾ௕ڑ཭ґଘܥλεΫͰߴੑೳ

    •ೖྗܥྻʹґଘͨ͠ઢܗԋࢉ͕ଘࡏ͠ͳ͍

    • AttentionͷQKVͷΑ͏ͳػߏ͕ͳ͘ɺදݱྗ͕ൺֱతऑ͍
    Gu+: E
    ff i
    ciently Modeling Long Sequences with Structured State Spaces. ICLR 2022 outstanding paper.
    ઌߦݚڀ: Structured State Space Sequence (S4)
    46

    View Slide

  47. •ঢ়ଶۭؒϞσϧʹجͮ͘ਂ૚ֶशϞσϧͷύΠΦχΞతଘࡏ
    • ը૾(bitྻ)෼ྨͳͲ௕ܥྻɾ௕ڑ཭ґଘܥλεΫͰߴੑೳ

    •ೖྗܥྻʹґଘͨ͠ઢܗԋࢉ͕ଘࡏ͠ͳ͍

    • AttentionͷQKVͷΑ͏ͳػߏ͕ͳ͘ɺදݱྗ͕ൺֱతऑ͍
    Gu+: E
    ff i
    ciently Modeling Long Sequences with Structured State Spaces. ICLR 2022 outstanding paper.
    ઌߦݚڀ: Structured State Space Sequence (S4)
    47

    View Slide

  48. •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ

    •໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍
    •ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿
    • SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing

    • Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

    •୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍

    • AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্

    • hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍
    Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight.
    ઌߦݚڀ: Hungry Hungry Hippos (H3)
    48

    View Slide

  49. •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ

    •໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍
    •ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿
    • SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing

    • Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

    •୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍

    • AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্

    • hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍
    Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight.
    ઌߦݚڀ: Hungry Hungry Hippos (H3)
    49

    View Slide

  50. •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ

    •໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍
    •ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿
    • SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing

    • Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

    •୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍

    • AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্

    • hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍
    Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight.
    ઌߦݚڀ: Hungry Hungry Hippos (H3)
    50

    View Slide

  51. •SSMs͸ݴޠλεΫʹऑ͍ͷͰվળ͢Δκ

    •໰୊: S4ؚΉSSMs͸tokenͷهԱྗɾൺֱೳྗ͕௿͍
    •ঢ়ଶۭؒϞσϧͰAttentionͷQKVΛ໛฿
    • SSMsͰmixingͯ͠Linear Attentionతʹߋʹmixing

    • Linear AttentionͱSSMsΛ૊Έ߹Θͤͨܗ

    •୯ମͰ͸TransformerΛ௒͑ΒΕͳ͍

    • AttentionΛڬΜͩhybridϞσϧͰ΍ͬͱಉ౳Ҏ্

    • hybridϞσϧ͸ਪ࿦͕AttentionʹҾͬுΒΕͯ஗͍
    Fu+: Hungry Hungry Hippos: Towards Language Modeling with State Space Models. ICLR 2023 spotlight.
    ઌߦݚڀ: Hungry Hungry Hippos (H3)
    51

    View Slide

  52. •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ
    دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ
    52
    Q
    K
    V Q
    K
    V
    Attention Linear Attention
    Shen+: E
    ff i
    cient Attention: Attention with Linear Complexities. WACV 2021.

    https://github.com/lucidrains/linear-attention-transformer

    Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.

    View Slide

  53. دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ
    53
    Q
    K
    V Q
    K
    V
    Attention Linear Attention
    •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ
    Shen+: E
    ff i
    cient Attention: Attention with Linear Complexities. WACV 2021.

    https://github.com/lucidrains/linear-attention-transformer

    Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.

    View Slide

  54. دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ
    54
    QK V Q
    KV
    Attention Linear Attention
    O(N2d) O(Nd2)
    N
    N d d
    d
    N
    d
    N
    •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ
    Shen+: E
    ff i
    cient Attention: Attention with Linear Complexities. WACV 2021.

    https://github.com/lucidrains/linear-attention-transformer

    Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.

    View Slide

  55. دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ
    55
    QK V Q
    KV
    Attention Linear Attention
    O(N2d) O(Nd2)
    N
    N d d
    d
    N
    d
    N
    •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ
    ܭࢉ͕͍ܰʂ
    Shen+: E
    ff i
    cient Attention: Attention with Linear Complexities. WACV 2021.

    https://github.com/lucidrains/linear-attention-transformer

    Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.

    View Slide

  56. دΓಓ: Linear Attentionʹ͓͚ΔQKVܭࢉ
    56
    QK V Q
    KV
    Attention Linear Attention
    O(N2d) O(Nd2)
    N
    N d d
    d
    N
    d
    N
    •QK & V Ͱ͸ͳ͘ Q & KV Λܭࢉ͢Δ
    ܭࢉ͕͍ܰʂ
    Shen+: E
    ff i
    cient Attention: Attention with Linear Complexities. WACV 2021.

    https://github.com/lucidrains/linear-attention-transformer

    Katharopoulos+: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
    Causalityͷ୲อ͕

    େมͳͷͰมܗ͕

    ͍͔ͭ͘ଘࡏ

    View Slide

  57. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
    H3: Πϝʔδ
    57
    X

    View Slide

  58. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
    H3: Πϝʔδ
    58
    K
    V
    X
    Q
    (N x d)
    (N x d)
    (N x d)

    View Slide

  59. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
    H3: Πϝʔδ
    59
    SSM
    K
    V
    X
    Q
    KV
    (N x d)
    (N x d)
    (N x d)

    View Slide

  60. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
    H3: Πϝʔδ
    60
    SSM
    K
    V
    X
    Q
    KV
    (N x d)
    (N x d)
    (N x d)
    ཁૉੵͳͷͰ

    Causality͕อͨΕ͍ͯΔ

    View Slide

  61. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
    H3: Πϝʔδ
    61
    SSM
    K
    V
    X
    Q
    KV
    (N x d)
    (N x d)
    (N x d)
    SSM Y

    View Slide

  62. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
    H3: Πϝʔδ
    62
    SSM
    K
    V
    X
    Q
    KV
    (N x d)
    (N x d)
    (N x d)
    SSM Y
    Linear Attentionͷ

    kernelʹ૬౰

    View Slide

  63. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
    H3: Πϝʔδ
    63
    SSM
    K
    V
    X
    Q
    KV
    (N x d)
    (N x d)
    (N x d)
    SSM Y
    Linear Attention + SSMs

    ͳѱຐతΞʔΩςΫνϟ

    View Slide

  64. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
    H3: ࣮ࡍͷܭࢉ
    64
    SSM
    K
    V
    X
    Q
    KV
    (N x d x d)
    (N x d)
    (N x d)
    (N x d)
    SSM Y
    Ґஔ͝ͱ֎ੵ
    ཁૉੵ

    View Slide

  65. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKVͱQͷཁૉੵ
    H3: ࣮ࡍͷܭࢉ
    65
    SSM
    K
    V
    X
    Q
    KV
    (N x d x d)
    (N x d)
    (N x d)
    (N x d)
    SSM Y
    Q1 ∈ R1 x d, KV1 ∈ Rd x d

    View Slide

  66. •SSMsΛ༻͍ͨਅʹAttention-freeͳϞσϧ
    • Multi Head Attention (MHA)ΑΖ͘͠৞ΈࠐΈϑΟϧλΛෳ਺༻ҙ

    •SSMʹΑΔtoken mixingΛ܁Γฦ࣮͠ࢪ
    • H3ΛKVͷԋࢉճ਺ʹؔͯ͠ҰൠԽͨ͠΋ͷͱΈͳͤΔ
    Hyena
    66

    View Slide

  67. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

    • mճSSMͰࠞͥΔ

    • m=2ͷ࣌H3ͱҰக
    Hyena: Πϝʔδ
    67
    X

    View Slide

  68. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

    • mճSSMͰࠞͥΔ

    • m=2ͷ࣌H3ͱҰக
    Hyena: Πϝʔδ
    68
    SSM
    K1
    X

    View Slide

  69. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

    • mճSSMͰࠞͥΔ

    • m=2ͷ࣌H3ͱҰக
    Hyena: Πϝʔδ
    69
    SSM
    K1
    V1
    X

    View Slide

  70. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

    • mճSSMͰࠞͥΔ

    • m=2ͷ࣌H3ͱҰக
    Hyena: Πϝʔδ
    70
    SSM
    K1 K2
    V1
    X SSM

    View Slide

  71. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

    • mճSSMͰࠞͥΔ

    • m=2ͷ࣌H3ͱҰக
    Hyena: Πϝʔδ
    71
    SSM
    K1 K2
    V1 V2
    SSM
    X

    View Slide

  72. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

    • mճSSMͰࠞͥΔ

    • m=2ͷ࣌H3ͱҰக
    Hyena: Πϝʔδ
    72
    SSM
    K1 K2
    V1 V2
    SSM
    X

    View Slide

  73. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

    • mճSSMͰࠞͥΔ

    • m=2ͷ࣌H3ͱҰக
    Hyena: Πϝʔδ
    73
    SSM
    K1 K2
    V1 V2
    SSM
    X

    एׯ΍͚ͦ͘ײ͕͋Δ

    View Slide

  74. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

    • mճSSMͰࠞͥΔ

    • m=2ͷ࣌H3ͱҰக
    Hyena: Πϝʔδ
    74
    SSM
    K1 K2 Km
    V1 V2
    SSM SSM
    Vm
    X

    View Slide

  75. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

    • mճSSMͰࠞͥΔ

    • m=2ͷ࣌H3ͱҰக
    Hyena: Πϝʔδ
    75
    SSM
    K1 K2 Km
    V1 V2
    SSM SSM
    Vm
    X

    Y

    View Slide

  76. •SSMͰࠞͥͨKͱVͷཁૉੵˠ SSMͰࠞͥͨKV (=K’)ͱV’ͷཁૉੵˠ…

    • mճSSMͰࠞͥΔ

    • m=2ͷ࣌H3ͱҰக
    Hyena: Πϝʔδ
    76
    SSM
    K1 K2 Km
    V1 V2
    SSM SSM
    Vm
    X

    Y
    H3ͷQʹ૬౰

    View Slide

  77. •H3ͷQ͸Vͱ΍͍ͬͯΔ͜ͱ͕͋·ΓมΘΒͳ͍

    • H3ͷQΛV2
    ͱݟΕ͹Hyenaͷm=2ͷ࣌ (Hyena-2) ͱҰக

    •H3: QKV
    •Hyena-3: QKVV
    Hyena: ߦํෆ໌ͷQʹ͍ͭͯ
    77
    H3 Hyena

    View Slide

  78. •৞ΈࠐΈϑΟϧλ f ΛҐஔຒΊࠐΈ+FFN+ࢦ਺తͳݮਰͰදݱ

    • on-the-
    fl
    yʹೖྗܥྻʹ߹Θͤͯຖճੜ੒
    Multi-scale Retention — Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023.

    RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021.
    Hyena: ৞ΈࠐΈϑΟϧλ
    78
    f = [h0, h1, h2, …, hN]

    ht =FFN(PositionalEncoding(t)) · Window(t)

    View Slide

  79. •৞ΈࠐΈϑΟϧλ f ΛҐஔຒΊࠐΈ+FFN+ࢦ਺తͳݮਰͰදݱ

    • on-the-
    fl
    yʹೖྗܥྻʹ߹Θͤͯຖճੜ੒
    Multi-scale Retention — Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023.

    RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021.
    Hyena: ৞ΈࠐΈϑΟϧλ
    79
    f = [h0, h1, h2, …, hN]

    ht =FFN(PositionalEncoding(t)) · Window(t)
    Multi-scale Retention΍

    RoPEతͳ͓ؾ࣋ͪ

    View Slide

  80. •৞ΈࠐΈϑΟϧλ f ΛҐஔຒΊࠐΈ+FFN+ࢦ਺తͳݮਰͰදݱ

    • on-the-
    fl
    yʹೖྗܥྻʹ߹Θͤͯຖճੜ੒
    Multi-scale Retention — Sun+: Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023.

    RoPE — Su+: RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021.
    Hyena: ৞ΈࠐΈϑΟϧλ
    80
    f = [h0, h1, h2, …, hN]

    ht =FFN(PositionalEncoding(t)) · Window(t)
    Multi-scale Retention΍

    RoPEతͳ͓ؾ࣋ͪ

    View Slide

  81. •ݴޠϞσϦϯά

    •ԼྲྀλεΫ (SuperGLUE)

    •ը૾෼ྨ

    •ܭࢉ࣌ؒൺֱ
    ධՁ࣮ݧ
    81

    View Slide

  82. •m=3 (QKVV) ͷHyena͸PPLͰTransformerʹඖఢ

    • ಉن໛ͷGPTతϞσϧΑΓ܇࿅ίετ΋খ͍͞
    ධՁ࣮ݧ: ݴޠϞσϦϯά
    82
    WikiText-103 The Pile

    View Slide

  83. •ಉن໛ͷTransformer

    ϕʔεϞσϧͱಉ౳ੑೳ

    •΍͚ͬͭײ͕͋ΔධՁ
    ධՁ࣮ݧ: SuperGLUE / ը૾෼ྨ
    83
    RWKV͸v4
    SuperGLUE (4-shot learning)
    ը૾෼ྨ

    View Slide

  84. •ಛʹ௕͍ܥྻʹରͯ͠ΑΓখ͍͞ਪ࿦ίετ
    ධՁ࣮ݧ: ܭࢉ࣌ؒൺֱ
    84
    ܥྻ௕
    ਪ࿦࣌ؒ

    View Slide

  85. •ಛʹ௕͍ܥྻʹରͯ͠ΑΓখ͍͞ਪ࿦ίετ
    ධՁ࣮ݧ: ܭࢉ࣌ؒൺֱ
    85
    ܥྻ௕
    ਪ࿦࣌ؒ
    Hyena͸ঢ়ଶۭؒϞσϧϕʔεͳͷʹ

    ਪ࿦͕֤εςοϓO(n)ʹͳͬͯ

    ͠·͍ͬͯΔ
    Implicit
    fi
    lterͷ͍ͤͬΆ͍

    View Slide

  86. •ঢ়ଶۭؒϞσϧʹجͮ͘৽ͨͳΞʔΩςΫνϟHyenaΛఏҊ

    •AttentionΑΓܭࢉྔ͕খ͘͞Transformerͱಉ౳Ҏ্ͷੑೳ

    • ෳ਺༻ҙͨ͠৞ΈࠐΈΧʔωϧ͕MHA΍Multi-scale RetentionͬΆ͍

    •S4/H3ΑΓਪ࿦࣌ͷܭࢉίετ͕ѱԽ͍ͯ͠Δ఺ʹ஫ҙ

    ײ૝
    •ධՁϞσϧ͕খ͍͞ (΄΅ ≦355M)

    • ॳظ࣮ݧͰ1.3BϞσϧ͸܇࿅͍ͯ͠ΔΒ͍͠(cf. Appendix A.2)

    • ؾ߹͍ͰSuperGLUE౳ͰධՁͭͭ͠Scaling law΋ݟͯ΄͍͠

    • S4΍H3ͱͷൺֱͱͯ͠Long Range Arena (LRA)Ͱ΋ධՁͯ͠΄͔ͬͨ͠
    LRA — Tay+: Long Range Arena: A Benchmark for E
    ff
    i
    cient Transformers. ICLR 2020.
    ·ͱΊ
    86

    View Slide

  87. •Hyena: ࣍ੈ୅LLM΁޲͚ͨTransformerΛӽ͑Δ৽ػցֶशϞσϧ Is
    Attention All You Need? Part 3

    •Is Attention All You Need? Part 1 Transformer Λ௒͑Δ(?) ৽ϞσϧS4

    •HyenaDNA: DNAͷݴޠΛಡΈղ͘LLMͷ৽ͨͳΔԠ༻

    •[Journal club] Hyena Hierarchy: Towards Larger Convolutional Language
    Models

    •The Annotated S4

    •Hungry Hungry Hippos: Towards Language Modeling with State Space
    Models
    ؔ࿈ࢿྉ
    87

    View Slide