Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep State Space Models 101 / Mamba

Hiroto Kurita
January 20, 2024

Deep State Space Models 101 / Mamba

Deep State Space Models 101 / Mamba

2024/01/19 Kotoba Technologies Seminar Series で使用した発表資料です.
連絡先:https://twitter.com/hiroto_kurita

Hiroto Kurita

January 20, 2024
Tweet

Other Decks in Research

Transcript

  1. Deep State Space Models
    101
    Hiroto Kurita
    2024/1/19 Kotoba Technologies Seminar Series
    1

    View full-size slide

  2. ⾃⼰紹介
    • 栗⽥ 宙⼈(くりた ひろと)
    • Kotoba Technologies では Research / ML Engineer を担当
    • 東北⼤学坂⼝研(TohokuNLP) 修⼠2年
    2
    X: @hiroto_kurita / https://kurita.dev

    View full-size slide

  3. 本トークについて
    • ⽬的︓⻑系列処理に強い Deep State Space Models をざっくり理解する
    • Mamba[Gu&Dao+ʼ23] に注⽬
    • テクニカルすぎる話題は割愛
    • 概要
    • State Space Models / S4 [Gu+ʼ22] の導⼊
    • Mamba
    • Kotoba Tech. での Mamba の取り組み
    3
    ※引⽤先が明⽰されていない⾃作以外の図表は [Gu&Dao+ʼ23] 及び [Gu+ʼ22] からの引⽤です

    View full-size slide

  4. State Space Models (SSMs) は 系列→系列変換器
    • ⼊出⼒︓ 𝑥!, 𝑥", ⋯ , 𝑥#$" → 𝑦!, 𝑦", ⋯ , 𝑦#$" , 𝑥% ∈ ℝ, 𝑦% ∈ ℝ
    • 実際には⼊出⼒はベクトルの集合 (容易に拡張可能)
    • Attention も SSM も同様
    • SSM は Attention のようにニューラルモデルに組み込み可能
    𝑥! 𝑥"
    𝑥&
    𝑦! 𝑦"
    𝑦&
    𝑥! 𝑥"
    𝑥&
    𝑦! 𝑦"
    𝑦&
    Normalization
    Linear
    SSM Layer


    Attention SSM
    4

    View full-size slide

  5. SSMs は Transformer と RNN の良いとこ取りを⽬指す
    • Transformer (Attention に注⽬)
    • 推論︓😫 𝑂(𝐿&)
    • 訓練︓😊 並列化可能
    • RNN
    • 推論︓😊 𝑂(𝐿)
    • 訓練︓😫 並列化不可
    • SSM
    • 推論︓😊 𝑂(𝐿)
    • 訓練︓😊 並列化可能
    𝑥! 𝑥"
    𝑥&
    𝑦!
    𝑦"
    𝑦&
    𝑥! 𝑥"
    𝑥&
    𝒉!
    𝒉"
    𝒉&
    𝑦!
    𝑦"
    𝑦&
    Attention
    SSM (≈ Linear RNN)
    5

    View full-size slide

  6. SSMs は Transformer と RNN の良いとこ取りを⽬指す
    • Transformer (Attention に注⽬)
    • 推論︓😫 𝑂(𝐿&)
    • 訓練︓😊 並列化可能
    • RNN
    • 推論︓😊 𝑂(𝐿)
    • 訓練︓😫 並列化不可
    • SSM
    • 推論︓😊 𝑂(𝐿)
    • 訓練︓😊 並列化可能
    𝑦!
    𝑦"
    𝑦&
    Attention
    𝑥"
    𝑥!
    𝑥&
    過去⼊⼒を
    全て⾒に⾏く
    𝑥! 𝑥"
    𝑥&
    𝒉!
    𝒉"
    𝒉&
    𝑦!
    𝑦"
    𝑦&
    SSM (≈ Linear RNN)
    6

    View full-size slide

  7. SSMs は Transformer と RNN の良いとこ取りを⽬指す
    • Transformer (Attention に注⽬)
    • 推論︓😫 𝑂(𝐿&)
    • 訓練︓😊 並列化可能
    • RNN
    • 推論︓😊 𝑂(𝐿)
    • 訓練︓😫 並列化不可
    • SSM
    • 推論︓😊 𝑂(𝐿)
    • 訓練︓😊 並列化可能
    現在の⼊⼒と
    1つ前の”状態”
    のみを⾒る
    𝑦!
    𝑦"
    𝑦&
    Attention
    𝑥"
    𝑥!
    𝑥&
    過去⼊⼒を
    全て⾒に⾏く
    𝑥! 𝑥"
    𝑥&
    𝒉!
    𝒉"
    𝒉&
    𝑦!
    𝑦"
    𝑦&
    SSM (≈ Linear RNN)
    7

    View full-size slide

  8. SSMs︓定式化(連続時間)
    • ⼊⼒ 𝑥 𝑡 ∈ ℝを状態 ℎ 𝑡 ∈ 𝑅( を通して出⼒ 𝑦 𝑡 ∈ ℝに変換
    • 制御⼯学や時系列分析などで古くから利⽤
    • 𝑨 ~ 𝑫 はパラメータ
    • 時間変化せず, 𝑥 𝑡 に対しても不変
    • 以降 𝑫 = 𝟎 として議論
    • 連続時間で定義されている関数 𝑥 ↦ 𝑦 の mapping だと⾒做せる
    • 系列 (⾔語…) を扱う場合は離散化する必要あり
    !
    ℎ! 𝑡 = 𝑨ℎ 𝑡 + 𝑩𝑥(𝑡)
    𝑦 𝑡 = 𝑪ℎ 𝑡 + 𝑫𝑥(𝑡)
    SSM
    𝑥(𝑡) 𝑦(𝑡)
    𝑫 = 𝟎
    8

    View full-size slide

  9. SSMs︓定式化(連続時間)
    • ⼊⼒ 𝑥 𝑡 ∈ ℝを状態 ℎ 𝑡 ∈ 𝑅( を通して出⼒ 𝑦 𝑡 ∈ ℝに変換
    • 制御⼯学や時系列分析などで古くから利⽤
    • 𝑨 ~ 𝑫 はパラメータ
    • 時間変化せず, 𝑥 𝑡 に対しても不変
    • 以降 𝑫 = 𝟎 として議論
    • 連続時間で定義されている関数 𝑥 ↦ 𝑦 の mapping だと⾒做せる
    • 系列 (⾔語…) を扱う場合は離散化する必要あり
    !
    ℎ! 𝑡 = 𝑨ℎ 𝑡 + 𝑩𝑥(𝑡)
    𝑦 𝑡 = 𝑪ℎ 𝑡 + 𝑫𝑥(𝑡)
    SSM
    𝑥(𝑡) 𝑦(𝑡)
    𝑫 = 𝟎
    スキップ接続と
    ⾒做して省略
    9

    View full-size slide

  10. SSMs︓離散化により再帰的な表現に
    !
    ℎ! 𝑡 = 𝑨ℎ 𝑡 + 𝑩𝑥(𝑡)
    𝑦 𝑡 = 𝑪ℎ 𝑡
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    離散化
    連続表現
    離散・再帰表現
    𝑥! 𝑥"
    𝒉!
    𝑦!
    𝑦"
    𝑦&
    SSM (≈ Linear RNN)
    𝑥&
    𝒉"
    𝒉&
    現在の⼊⼒と
    1つ前の”状態” のみを⾒る
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    𝑦!
    = -
    𝑪ℎ!
    Euler 法による離散化の例
    ℎ! 𝑡 = lim
    "→$
    % &'" (% &
    "
    → ℎ! 𝑡 ≈ % &'" (% &
    "
    を利⽤をすると
    ℎ! 𝑡 + Δ = 𝑨Δℎ 𝑡 + ℎ 𝑡 + 𝑩Δ𝑥 𝑡
    = 𝑨Δ + 𝑰 ℎ 𝑡 + 𝑩Δ𝑥 𝑡
    = 2
    𝑨ℎ 𝑡 + 2
    𝑩𝑥(𝑡), 2
    𝑨 ≔ 𝑨Δ + 𝑰 , 2
    𝑩 = 𝑩Δ
    𝑥&, 𝑦&, ℎ& ≔ 𝑥 𝑡𝛥 , 𝑦 𝑡𝛥 , ℎ 𝑡𝛥 と置くと,
    ℎ& = 2
    𝑨ℎ&() + 2
    𝑩𝑥&
    , 𝑦& = 2
    𝑪ℎ& (2
    𝑪 ≔ 𝑪)
    10

    View full-size slide

  11. SSMs︓離散化により再帰的な表現に
    !
    ℎ! 𝑡 = 𝑨ℎ 𝑡 + 𝑩𝑥(𝑡)
    𝑦 𝑡 = 𝑪ℎ 𝑡
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    離散化
    連続表現
    離散・再帰表現
    𝑥! 𝑥"
    𝒉!
    𝑦!
    𝑦"
    𝑦&
    SSM (≈ Linear RNN)
    𝑥&
    𝒉"
    𝒉&
    現在の⼊⼒と
    1つ前の”状態” のみを⾒る
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    𝑦!
    = -
    𝑪ℎ!
    Euler 法による離散化の例
    ℎ! 𝑡 = lim
    "→$
    % &'" (% &
    "
    → ℎ! 𝑡 ≈ % &'" (% &
    "
    を利⽤をすると
    ℎ! 𝑡 + Δ = 𝑨Δℎ 𝑡 + ℎ 𝑡 + 𝑩Δ𝑥 𝑡
    = 𝑨Δ + 𝑰 ℎ 𝑡 + 𝑩Δ𝑥 𝑡
    = 2
    𝑨ℎ 𝑡 + 2
    𝑩𝑥(𝑡), 2
    𝑨 ≔ 𝑨Δ + 𝑰 , 2
    𝑩 = 𝑩Δ
    𝑥&, 𝑦&, ℎ& ≔ 𝑥 𝑡𝛥 , 𝑦 𝑡𝛥 , ℎ 𝑡𝛥 と置くと,
    ℎ& = 2
    𝑨ℎ&() + 2
    𝑩𝑥&
    , 𝑦& = 2
    𝑪ℎ& (2
    𝑪 ≔ 𝑪)
    11

    View full-size slide

  12. SSMs︓離散化により再帰的な表現に
    !
    ℎ! 𝑡 = 𝑨ℎ 𝑡 + 𝑩𝑥(𝑡)
    𝑦 𝑡 = 𝑪ℎ 𝑡
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    離散化
    連続表現
    離散・再帰表現
    𝑥! 𝑥"
    𝒉!
    𝑦!
    𝑦"
    𝑦&
    SSM (≈ Linear RNN)
    𝑥&
    𝒉"
    𝒉&
    現在の⼊⼒と
    1つ前の”状態” のみを⾒る
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    𝑦!
    = -
    𝑪ℎ!
    Euler 法による離散化の例
    ℎ! 𝑡 = lim
    "→$
    % &'" (% &
    "
    → ℎ! 𝑡 ≈ % &'" (% &
    "
    を利⽤をすると
    ℎ! 𝑡 + Δ = 𝑨Δℎ 𝑡 + ℎ 𝑡 + 𝑩Δ𝑥 𝑡
    = 𝑨Δ + 𝑰 ℎ 𝑡 + 𝑩Δ𝑥 𝑡
    = 2
    𝑨ℎ 𝑡 + 2
    𝑩𝑥(𝑡), 2
    𝑨 ≔ 𝑨Δ + 𝑰 , 2
    𝑩 = 𝑩Δ
    𝑥&, 𝑦&, ℎ& ≔ 𝑥 𝑡𝛥 , 𝑦 𝑡𝛥 , ℎ 𝑡𝛥 と置くと,
    ℎ& = 2
    𝑨ℎ&() + 2
    𝑩𝑥&
    , 𝑦& = 2
    𝑪ℎ& (2
    𝑪 ≔ 𝑪)
    12

    View full-size slide

  13. SSMs 訓練時︓線形性により畳み込み演算で⾼速・並列化可能に
    • 😫 RNN は⼀般に並列化不可
    • 😊 SSMs は線形性により畳み込み計算で並列化可能に
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    ℎ#
    = -
    𝑨ℎ$"
    + -
    𝑩𝑥#
    = -
    𝑩𝑥#
    ℎ"
    = -
    𝑨ℎ#
    + -
    𝑩𝑥"
    = -
    𝑨-
    𝑩𝑥#
    + -
    𝑩𝑥"
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    = -
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑨-
    𝑩𝑥"
    + -
    𝑩𝑥!
    𝑦#
    = -
    𝑪ℎ#
    = -
    𝑪-
    𝑩𝑥#
    𝑦"
    = -
    𝑪ℎ"
    = -
    𝑪-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑩𝑥"
    𝑦!
    = -
    𝑪ℎ!
    = -
    𝑪-
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑨-
    𝑩𝑥"
    + -
    𝑪-
    𝑩𝑥!
    𝑦%
    = -
    𝑪ℎ%
    = -
    𝑪-
    𝑨% -
    𝑩𝑥#
    + -
    𝑪-
    𝑨%$" -
    𝑩𝑥"
    + ⋯ + -
    𝑪-
    𝑩𝑥%
    ℎ$"
    = 𝟎 とすると
    𝑦 = 𝑥 ∗ -
    𝐾 -
    𝐾 = (-
    𝑪-
    𝑩, -
    𝑪-
    𝑨-
    𝑩, ⋯ , -
    𝑪-
    𝑨% -
    𝑩, ⋯ )
    1次元の畳み込みとして表現可能
    13

    View full-size slide

  14. SSMs 訓練時︓線形性により畳み込み演算で⾼速・並列化可能に
    • 😫 RNN は⼀般に並列化不可
    • 😊 SSMs は線形性により畳み込み計算で並列化可能に
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    ℎ#
    = -
    𝑨ℎ$"
    + -
    𝑩𝑥#
    = -
    𝑩𝑥#
    ℎ"
    = -
    𝑨ℎ#
    + -
    𝑩𝑥"
    = -
    𝑨-
    𝑩𝑥#
    + -
    𝑩𝑥"
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    = -
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑨-
    𝑩𝑥"
    + -
    𝑩𝑥!
    𝑦#
    = -
    𝑪ℎ#
    = -
    𝑪-
    𝑩𝑥#
    𝑦"
    = -
    𝑪ℎ"
    = -
    𝑪-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑩𝑥"
    𝑦!
    = -
    𝑪ℎ!
    = -
    𝑪-
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑨-
    𝑩𝑥"
    + -
    𝑪-
    𝑩𝑥!
    𝑦%
    = -
    𝑪ℎ%
    = -
    𝑪-
    𝑨% -
    𝑩𝑥#
    + -
    𝑪-
    𝑨%$" -
    𝑩𝑥"
    + ⋯ + -
    𝑪-
    𝑩𝑥%
    ℎ$"
    = 𝟎 とすると
    𝑦 = 𝑥 ∗ -
    𝐾 -
    𝐾 = (-
    𝑪-
    𝑩, -
    𝑪-
    𝑨-
    𝑩, ⋯ , -
    𝑪-
    𝑨% -
    𝑩, ⋯ )
    1次元の畳み込みとして表現可能
    14

    View full-size slide

  15. SSMs 訓練時︓線形性により畳み込み演算で⾼速・並列化可能に
    • 😫 RNN は⼀般に並列化不可
    • 😊 SSMs は線形性により畳み込み計算で並列化可能に
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    ℎ#
    = -
    𝑨ℎ$"
    + -
    𝑩𝑥#
    = -
    𝑩𝑥#
    ℎ"
    = -
    𝑨ℎ#
    + -
    𝑩𝑥"
    = -
    𝑨-
    𝑩𝑥#
    + -
    𝑩𝑥"
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    = -
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑨-
    𝑩𝑥"
    + -
    𝑩𝑥!
    𝑦#
    = -
    𝑪ℎ#
    = -
    𝑪-
    𝑩𝑥#
    𝑦"
    = -
    𝑪ℎ"
    = -
    𝑪-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑩𝑥"
    𝑦!
    = -
    𝑪ℎ!
    = -
    𝑪-
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑨-
    𝑩𝑥"
    + -
    𝑪-
    𝑩𝑥!
    𝑦%
    = -
    𝑪ℎ%
    = -
    𝑪-
    𝑨% -
    𝑩𝑥#
    + -
    𝑪-
    𝑨%$" -
    𝑩𝑥"
    + ⋯ + -
    𝑪-
    𝑩𝑥%
    ℎ$"
    = 𝟎 とすると
    𝑦 = 𝑥 ∗ -
    𝐾 -
    𝐾 = (-
    𝑪-
    𝑩, -
    𝑪-
    𝑨-
    𝑩, ⋯ , -
    𝑪-
    𝑨% -
    𝑩, ⋯ )
    1次元の畳み込みとして表現可能
    15

    View full-size slide

  16. SSMs 訓練時︓線形性により畳み込み演算で⾼速・並列化可能に
    • 😫 RNN は⼀般に並列化不可
    • 😊 SSMs は線形性により畳み込み計算で並列化可能に
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    ℎ#
    = -
    𝑨ℎ$"
    + -
    𝑩𝑥#
    = -
    𝑩𝑥#
    ℎ"
    = -
    𝑨ℎ#
    + -
    𝑩𝑥"
    = -
    𝑨-
    𝑩𝑥#
    + -
    𝑩𝑥"
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    = -
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑨-
    𝑩𝑥"
    + -
    𝑩𝑥!
    𝑦#
    = -
    𝑪ℎ#
    = -
    𝑪-
    𝑩𝑥#
    𝑦"
    = -
    𝑪ℎ"
    = -
    𝑪-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑩𝑥"
    𝑦!
    = -
    𝑪ℎ!
    = -
    𝑪-
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑨-
    𝑩𝑥"
    + -
    𝑪-
    𝑩𝑥!
    𝑦%
    = -
    𝑪ℎ%
    = -
    𝑪-
    𝑨% -
    𝑩𝑥#
    + -
    𝑪-
    𝑨%$" -
    𝑩𝑥"
    + ⋯ + -
    𝑪-
    𝑩𝑥%
    ℎ$"
    = 𝟎 とすると
    𝑦 = 𝑥 ∗ -
    𝐾 -
    𝐾 = (-
    𝑪-
    𝑩, -
    𝑪-
    𝑨-
    𝑩, ⋯ , -
    𝑪-
    𝑨% -
    𝑩, ⋯ )
    1次元の畳み込みとして表現可能
    16

    View full-size slide

  17. SSMs 訓練時︓線形性により畳み込み演算で⾼速・並列化可能に
    • 😫 RNN は⼀般に並列化不可
    • 😊 SSMs は線形性により畳み込み計算で並列化可能に
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    ℎ#
    = -
    𝑨ℎ$"
    + -
    𝑩𝑥#
    = -
    𝑩𝑥#
    ℎ"
    = -
    𝑨ℎ#
    + -
    𝑩𝑥"
    = -
    𝑨-
    𝑩𝑥#
    + -
    𝑩𝑥"
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    = -
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑨-
    𝑩𝑥"
    + -
    𝑩𝑥!
    𝑦#
    = -
    𝑪ℎ#
    = -
    𝑪-
    𝑩𝑥#
    𝑦"
    = -
    𝑪ℎ"
    = -
    𝑪-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑩𝑥"
    𝑦!
    = -
    𝑪ℎ!
    = -
    𝑪-
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑨-
    𝑩𝑥"
    + -
    𝑪-
    𝑩𝑥!
    𝑦%
    = -
    𝑪ℎ%
    = -
    𝑪-
    𝑨% -
    𝑩𝑥#
    + -
    𝑪-
    𝑨%$" -
    𝑩𝑥"
    + ⋯ + -
    𝑪-
    𝑩𝑥%
    ℎ$"
    = 𝟎 とすると
    𝑦 = 𝑥 ∗ -
    𝐾 -
    𝐾 = (-
    𝑪-
    𝑩, -
    𝑪-
    𝑨-
    𝑩, ⋯ , -
    𝑪-
    𝑨% -
    𝑩, ⋯ )
    1次元の畳み込みとして表現可能
    17

    View full-size slide

  18. SSMs 訓練時︓線形性により畳み込み演算で⾼速・並列化可能に
    • 😫 RNN は⼀般に並列化不可
    • 😊 SSMs は線形性により畳み込み計算で並列化可能に
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    ℎ#
    = -
    𝑨ℎ$"
    + -
    𝑩𝑥#
    = -
    𝑩𝑥#
    ℎ"
    = -
    𝑨ℎ#
    + -
    𝑩𝑥"
    = -
    𝑨-
    𝑩𝑥#
    + -
    𝑩𝑥"
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    = -
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑨-
    𝑩𝑥"
    + -
    𝑩𝑥!
    𝑦#
    = -
    𝑪ℎ#
    = -
    𝑪-
    𝑩𝑥#
    𝑦"
    = -
    𝑪ℎ"
    = -
    𝑪-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑩𝑥"
    𝑦!
    = -
    𝑪ℎ!
    = -
    𝑪-
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑨-
    𝑩𝑥"
    + -
    𝑪-
    𝑩𝑥!
    𝑦%
    = -
    𝑪ℎ%
    = -
    𝑪-
    𝑨% -
    𝑩𝑥#
    + -
    𝑪-
    𝑨%$" -
    𝑩𝑥"
    + ⋯ + -
    𝑪-
    𝑩𝑥%
    ℎ$"
    = 𝟎 とすると
    𝑦 = 𝑥 ∗ -
    𝐾 -
    𝐾 = (-
    𝑪-
    𝑩, -
    𝑪-
    𝑨-
    𝑩, ⋯ , -
    𝑪-
    𝑨% -
    𝑩, ⋯ )
    1次元の畳み込みとして表現可能
    𝑥!
    𝑥"
    𝑦!
    𝑦"
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦
    ,
    𝑪,
    𝑨" ,
    𝑩 ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    𝑦#
    0 0 𝑥#
    18

    View full-size slide

  19. SSMs 訓練時︓線形性により畳み込み演算で⾼速・並列化可能に
    • 😫 RNN は⼀般に並列化不可
    • 😊 SSMs は線形性により畳み込み計算で並列化可能に
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    ℎ#
    = -
    𝑨ℎ$"
    + -
    𝑩𝑥#
    = -
    𝑩𝑥#
    ℎ"
    = -
    𝑨ℎ#
    + -
    𝑩𝑥"
    = -
    𝑨-
    𝑩𝑥#
    + -
    𝑩𝑥"
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    = -
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑨-
    𝑩𝑥"
    + -
    𝑩𝑥!
    𝑦#
    = -
    𝑪ℎ#
    = -
    𝑪-
    𝑩𝑥#
    𝑦"
    = -
    𝑪ℎ"
    = -
    𝑪-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑩𝑥"
    𝑦!
    = -
    𝑪ℎ!
    = -
    𝑪-
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑨-
    𝑩𝑥"
    + -
    𝑪-
    𝑩𝑥!
    𝑦%
    = -
    𝑪ℎ%
    = -
    𝑪-
    𝑨% -
    𝑩𝑥#
    + -
    𝑪-
    𝑨%$" -
    𝑩𝑥"
    + ⋯ + -
    𝑪-
    𝑩𝑥%
    ℎ$"
    = 𝟎 とすると
    𝑦 = 𝑥 ∗ -
    𝐾 -
    𝐾 = (-
    𝑪-
    𝑩, -
    𝑪-
    𝑨-
    𝑩, ⋯ , -
    𝑪-
    𝑨% -
    𝑩, ⋯ )
    1次元の畳み込みとして表現可能
    𝑥!
    𝑥"
    𝑦!
    𝑦"
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦
    ,
    𝑪,
    𝑨" ,
    𝑩 ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    𝑦#
    0 0 𝑥#

    総和
    19

    View full-size slide

  20. SSMs 訓練時︓線形性により畳み込み演算で⾼速・並列化可能に
    • 😫 RNN は⼀般に並列化不可
    • 😊 SSMs は線形性により畳み込み計算で並列化可能に
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    ℎ#
    = -
    𝑨ℎ$"
    + -
    𝑩𝑥#
    = -
    𝑩𝑥#
    ℎ"
    = -
    𝑨ℎ#
    + -
    𝑩𝑥"
    = -
    𝑨-
    𝑩𝑥#
    + -
    𝑩𝑥"
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    = -
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑨-
    𝑩𝑥"
    + -
    𝑩𝑥!
    𝑦#
    = -
    𝑪ℎ#
    = -
    𝑪-
    𝑩𝑥#
    𝑦"
    = -
    𝑪ℎ"
    = -
    𝑪-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑩𝑥"
    𝑦!
    = -
    𝑪ℎ!
    = -
    𝑪-
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑨-
    𝑩𝑥"
    + -
    𝑪-
    𝑩𝑥!
    𝑦%
    = -
    𝑪ℎ%
    = -
    𝑪-
    𝑨% -
    𝑩𝑥#
    + -
    𝑪-
    𝑨%$" -
    𝑩𝑥"
    + ⋯ + -
    𝑪-
    𝑩𝑥%
    ℎ$"
    = 𝟎 とすると
    𝑦 = 𝑥 ∗ -
    𝐾 -
    𝐾 = (-
    𝑪-
    𝑩, -
    𝑪-
    𝑨-
    𝑩, ⋯ , -
    𝑪-
    𝑨% -
    𝑩, ⋯ )
    1次元の畳み込みとして表現可能
    𝑥"
    𝑦"
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦 𝑦#
    0 0 𝑥#

    総和
    ,
    𝑪,
    𝑨" ,
    𝑩 ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    𝑥!
    𝑦!
    20

    View full-size slide

  21. SSMs 訓練時︓線形性により畳み込み演算で⾼速・並列化可能に
    • 😫 RNN は⼀般に並列化不可
    • 😊 SSMs は線形性により畳み込み計算で並列化可能に
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    ℎ#
    = -
    𝑨ℎ$"
    + -
    𝑩𝑥#
    = -
    𝑩𝑥#
    ℎ"
    = -
    𝑨ℎ#
    + -
    𝑩𝑥"
    = -
    𝑨-
    𝑩𝑥#
    + -
    𝑩𝑥"
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    = -
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑨-
    𝑩𝑥"
    + -
    𝑩𝑥!
    𝑦#
    = -
    𝑪ℎ#
    = -
    𝑪-
    𝑩𝑥#
    𝑦"
    = -
    𝑪ℎ"
    = -
    𝑪-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑩𝑥"
    𝑦!
    = -
    𝑪ℎ!
    = -
    𝑪-
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑨-
    𝑩𝑥"
    + -
    𝑪-
    𝑩𝑥!
    𝑦%
    = -
    𝑪ℎ%
    = -
    𝑪-
    𝑨% -
    𝑩𝑥#
    + -
    𝑪-
    𝑨%$" -
    𝑩𝑥"
    + ⋯ + -
    𝑪-
    𝑩𝑥%
    ℎ$"
    = 𝟎 とすると
    𝑦 = 𝑥 ∗ -
    𝐾 -
    𝐾 = (-
    𝑪-
    𝑩, -
    𝑪-
    𝑨-
    𝑩, ⋯ , -
    𝑪-
    𝑨% -
    𝑩, ⋯ )
    1次元の畳み込みとして表現可能
    𝑦!
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦 𝑦#
    0 0

    総和
    ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    ,
    𝑪,
    𝑨" ,
    𝑩 ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    𝑦"
    𝑥!
    𝑥"
    𝑥#
    21

    View full-size slide

  22. SSMs 訓練時︓線形性により畳み込み演算で⾼速・並列化可能に
    • 😫 RNN は⼀般に並列化不可
    • 😊 SSMs は線形性により畳み込み計算で並列化可能に
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    ℎ#
    = -
    𝑨ℎ$"
    + -
    𝑩𝑥#
    = -
    𝑩𝑥#
    ℎ"
    = -
    𝑨ℎ#
    + -
    𝑩𝑥"
    = -
    𝑨-
    𝑩𝑥#
    + -
    𝑩𝑥"
    ℎ!
    = -
    𝑨ℎ"
    + -
    𝑩𝑥!
    = -
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑨-
    𝑩𝑥"
    + -
    𝑩𝑥!
    𝑦#
    = -
    𝑪ℎ#
    = -
    𝑪-
    𝑩𝑥#
    𝑦"
    = -
    𝑪ℎ"
    = -
    𝑪-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑩𝑥"
    𝑦!
    = -
    𝑪ℎ!
    = -
    𝑪-
    𝑨-
    𝑨-
    𝑩𝑥#
    + -
    𝑪-
    𝑨-
    𝑩𝑥"
    + -
    𝑪-
    𝑩𝑥!
    𝑦%
    = -
    𝑪ℎ%
    = -
    𝑪-
    𝑨% -
    𝑩𝑥#
    + -
    𝑪-
    𝑨%$" -
    𝑩𝑥"
    + ⋯ + -
    𝑪-
    𝑩𝑥%
    ℎ$"
    = 𝟎 とすると
    𝑦 = 𝑥 ∗ -
    𝐾 -
    𝐾 = (-
    𝑪-
    𝑩, -
    𝑪-
    𝑨-
    𝑩, ⋯ , -
    𝑪-
    𝑨% -
    𝑩, ⋯ )
    1次元の畳み込みとして表現可能
    𝑦!
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦 𝑦#
    0 0

    総和
    ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    ,
    𝑪,
    𝑨" ,
    𝑩 ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    𝑦"
    𝑥!
    𝑥"
    𝑥#
    • 訓練時︓𝑥 が事前に分かる → -
    𝐾 を事前計算可能
    • -
    𝐾 が計算出来れば畳み込みは並列化可能
    • 畳み込みは FFT / iFFT 等により⾼速計算可能
    22

    View full-size slide

  23. SSMs まとめ︓推論時は RNN, 訓練時は Transformer のように動作
    𝑦!
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦 𝑦#
    0 0

    総和
    ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    ,
    𝑪,
    𝑨" ,
    𝑩 ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    𝑦"
    𝑥!
    𝑥"
    𝑥#
    • 訓練時︓𝑥 が事前に分かる → -
    𝐾 を事前計算可能
    • -
    𝐾 が計算出来れば畳み込みは並列化可能
    • 畳み込みは FFT / iFFT 等により⾼速計算可能
    𝑥!
    𝑥"
    𝒉!
    𝑦! 𝑦"
    𝑦&
    SSM (≈ Linear RNN)
    𝑥&
    𝒉"
    𝒉&
    現在の⼊⼒と1つ前の状態のみを⾒る
    訓練時: Transformer (CNN) のように並列化可能
    推論時: RNN のように⾼速推論可能
    6
    ℎ:
    = 7
    𝑨ℎ:$"
    + 7
    𝑩𝑥:
    𝑦: = 7
    𝑪ℎ:
    𝑦 = 𝑥 ∗ 7
    𝐾
    7
    𝐾 = (7
    𝑪7
    𝑩, 7
    𝑪7
    𝑨7
    𝑩, ⋯ , 7
    𝑪7
    𝑨; 7
    𝑩, ⋯ )
    23

    View full-size slide

  24. SSMs (≈ Linear RNN) の問題点
    • 🥲 そのまま学習すると上⼿く⾏かない
    • ⾏列 7
    𝑨 が⾮常に重要
    • 🥲 畳み込みカーネル 7
    𝐾 の計算が⼤変
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    𝑦!
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦 𝑦#
    0 0

    総和
    ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    ,
    𝑪,
    𝑨" ,
    𝑩 ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    𝑦"
    𝑥!
    𝑥"
    𝑥#
    𝑦 = 𝑥 ∗ 7
    𝐾
    7
    𝐾 = (7
    𝑪7
    𝑩, 7
    𝑪7
    𝑨7
    𝑩, ⋯ , 7
    𝑪7
    𝑨; 7
    𝑩, ⋯ )
    • SSM では状態 ℎ に過去の情報を全て”記憶” させている
    • 行列 2
    𝑨 は ℎ&()
    と ℎ&
    を結ぶ → 情報の”記憶”を司る
    • ランダム初期化からスタートすると上⼿く⾏かず
    2
    𝑨𝒌 の累乗計算が⼤変
    24

    View full-size slide

  25. Structured State Space Models (S4) [Gu+ʼ20]
    • 🥲 そのまま学習すると上⼿く⾏かない
    • ⾏列 7
    𝑨 が⾮常に重要 → 過去の⼊⼒を良く記憶出来る HiPPO ⾏列の利⽤
    • 🥲 畳み込みカーネル 7
    𝐾 の計算が⼤変 → ⾼速な計算⼿法を導出
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    𝑦!
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦 𝑦#
    0 0

    総和
    ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    ,
    𝑪,
    𝑨" ,
    𝑩 ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    𝑦"
    𝑥!
    𝑥"
    𝑥#
    𝑦 = 𝑥 ∗ 7
    𝐾
    7
    𝐾 = (7
    𝑪7
    𝑩, 7
    𝑪7
    𝑨7
    𝑩, ⋯ , 7
    𝑪7
    𝑨; 7
    𝑩, ⋯ )
    2
    𝑨𝒌 の累乗計算が⼤変
    7
    𝑨<;
    = − >
    2𝑛 + 1 "/& 2𝑘 + 1 "/&
    𝑛 + 1
    0
    (𝑛 > 𝑘)
    (𝑛 = 𝑘)
    (𝑛 < 𝑘)
    25

    View full-size slide

  26. Structured State Space Models (S4) [Gu+ʼ20]
    • 🥲 そのまま学習すると上⼿く⾏かない
    • ⾏列 7
    𝑨 が⾮常に重要 → 過去の⼊⼒を良く記憶出来る HiPPO ⾏列の利⽤
    • 🥲 畳み込みカーネル 7
    𝐾 の計算が⼤変 → ⾼速な計算⼿法を導出
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    𝑦!
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦 𝑦#
    0 0

    総和
    ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    ,
    𝑪,
    𝑨" ,
    𝑩 ,
    𝑪,
    𝑨,
    𝑩 ,
    𝑪,
    𝑩
    𝑦"
    𝑥!
    𝑥"
    𝑥#
    𝑦 = 𝑥 ∗ 7
    𝐾
    7
    𝐾 = (7
    𝑪7
    𝑩, 7
    𝑪7
    𝑨7
    𝑩, ⋯ , 7
    𝑪7
    𝑨; 7
    𝑩, ⋯ )
    2
    𝑨𝒌 の累乗計算が⼤変
    7
    𝑨<;
    = − >
    2𝑛 + 1 "/& 2𝑘 + 1 "/&
    𝑛 + 1
    0
    (𝑛 > 𝑘)
    (𝑛 = 𝑘)
    (𝑛 < 𝑘)
    26
    両者の理論・導出は今回のトークでは省略

    View full-size slide

  27. S4 の問題点︓/
    𝑨, /
    𝑩, /
    𝑪 が⼊⼒ 𝑥 に⾮依存→”動的”な推論が不可
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    27

    View full-size slide

  28. S4 の問題点︓/
    𝑨, /
    𝑩, /
    𝑪 が⼊⼒ 𝑥 に⾮依存→”動的”な推論が不可
    𝑥!
    0
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦
    1 0 0
    0
    0 0 𝑥#

    総和
    0
    0
    0
    -
    𝑨, -
    𝑩, -
    𝑪が固定の通常の SSM でも解ける
    28
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"

    View full-size slide

  29. S4 の問題点︓/
    𝑨, /
    𝑩, /
    𝑪 が⼊⼒ 𝑥 に⾮依存→”動的”な推論が不可
    𝑥!
    0
    0
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦
    1 0 0
    0
    0 0 𝑥#

    総和
    0
    0
    0
    29
    -
    𝑨, -
    𝑩, -
    𝑪が固定の通常の SSM でも解ける
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"

    View full-size slide

  30. S4 の問題点︓/
    𝑨, /
    𝑩, /
    𝑪 が⼊⼒ 𝑥 に⾮依存→”動的”な推論が不可
    𝑥!
    0
    0 𝑥#
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦
    1 0 0
    0
    0 0 𝑥#

    総和
    0
    0
    0
    30
    -
    𝑨, -
    𝑩, -
    𝑪が固定の通常の SSM でも解ける
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"

    View full-size slide

  31. S4 の問題点︓/
    𝑨, /
    𝑩, /
    𝑪 が⼊⼒ 𝑥 に⾮依存→”動的”な推論が不可
    𝑥!
    0
    0 𝑥#
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦
    1 0 0
    0
    0 0 𝑥#

    総和
    𝑥!
    0
    0
    0
    31
    -
    𝑨, -
    𝑩, -
    𝑪が固定の通常の SSM でも解ける
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"

    View full-size slide

  32. S4 の問題点︓/
    𝑨, /
    𝑩, /
    𝑪 が⼊⼒ 𝑥 に⾮依存→”動的”な推論が不可
    𝑥!
    0
    0 𝑥#
    カーネル -
    𝐾
    ⼊⼒ 𝑥
    出⼒ 𝑦
    1 0 0
    0
    0 0 𝑥#

    総和
    𝑥!
    0
    0
    0
    • In-Context Learning に近い
    • ⼊⼒に応じた動的な推論が必要
    • -
    𝑨, -
    𝑩, -
    𝑪 が固定 → 🥲 解けない
    32
    -
    𝑨, -
    𝑩, -
    𝑪が固定の通常の SSM でも解ける
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"

    View full-size slide

  33. Mamba: ⼊⼒ 𝑥 に応じて動的に /
    𝑨, /
    𝑩, /
    𝑪 を計算
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    33
    𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数

    View full-size slide

  34. Mamba: ⼊⼒ 𝑥 に応じて動的に /
    𝑨, /
    𝑩, /
    𝑪 を計算
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    34
    𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数
    次元ごとに個別の SSM (ℝ → ℝ) を適⽤するイメージ

    View full-size slide

  35. Mamba: ⼊⼒ 𝑥 に応じて動的に /
    𝑨, /
    𝑩, /
    𝑪 を計算
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    35
    𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数
    次元ごとに個別の SSM (ℝ → ℝ) を適⽤するイメージ
    𝑥 に依存せず使いまわし

    View full-size slide

  36. Mamba: ⼊⼒ 𝑥 に応じて動的に /
    𝑨, /
    𝑩, /
    𝑪 を計算
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    36
    𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数
    次元ごとに個別の SSM (ℝ → ℝ) を適⽤するイメージ
    𝑥 に依存せず使いまわし
    ステップ幅 Δ も学習

    View full-size slide

  37. Mamba: ⼊⼒ 𝑥 に応じて動的に /
    𝑨, /
    𝑩, /
    𝑪 を計算
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    37
    𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数
    次元ごとに個別の SSM (ℝ → ℝ) を適⽤するイメージ
    𝑥 に依存せず使いまわし
    ステップ幅 Δ も学習
    時間依存しないので畳み込みも可能(並列化可)

    View full-size slide

  38. Mamba: ⼊⼒ 𝑥 に応じて動的に /
    𝑨, /
    𝑩, /
    𝑪 を計算
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    38
    𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数
    時間依存しないので畳み込みも可能(並列化可)

    View full-size slide

  39. Mamba: ⼊⼒ 𝑥 に応じて動的に /
    𝑨, /
    𝑩, /
    𝑪 を計算
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    39
    𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数
    時間依存しないので畳み込みも可能(並列化可)
    𝒙 に線形変換を噛ませる (𝐷 → 𝑁)

    View full-size slide

  40. Mamba: ⼊⼒ 𝑥 に応じて動的に /
    𝑨, /
    𝑩, /
    𝑪 を計算
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    40
    𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数
    時間依存しないので畳み込みも可能(並列化可)
    𝒙 に線形変換を噛ませる (𝐷 → 𝑁)
    Δ を経由して 2
    𝑨 も 𝒙 に依存

    View full-size slide

  41. Mamba: ⼊⼒ 𝑥 に応じて動的に /
    𝑨, /
    𝑩, /
    𝑪 を計算
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    41
    𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数
    時間依存しないので畳み込みも可能(並列化可)
    𝒙 に線形変換を噛ませる (𝐷 → 𝑁)
    Δ を経由して 2
    𝑨 も 𝒙 に依存
    • ⼊⼒ 𝒙 と時間に依存に︕
    • 🥲 畳み込み不可(並列化不可)
    • どのように⾼速化すれば?

    View full-size slide

  42. Mamba: ⼊⼒ 𝑥 に応じて動的に /
    𝑨, /
    𝑩, /
    𝑪 を計算
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    In particular, we highlight that these parameters now have a length dimension , meaning that the model has
    changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This
    loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
    We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1
    ( )), and = ,
    .
    ℎ"
    = /
    𝑨ℎ"#$
    + /
    𝑩𝑥"
    𝑦"
    = /
    𝑪ℎ"
    42
    𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数
    時間依存しないので畳み込みも可能(並列化可)
    𝒙 に線形変換を噛ませる (𝐷 → 𝑁)
    Δ を経由して 2
    𝑨 も 𝒙 に依存
    • ⼊⼒ 𝒙 と時間に依存に︕
    • 🥲 畳み込み不可(並列化不可)
    • どのように⾼速化すれば?
    Mamba では⾼速化のため以下を⼯夫
    • Parallel scan
    • Kernel fusion
    • 活性値の再計算

    View full-size slide

  43. SSM の再帰計算は Scan と⾒做せる
    • パラメータ 7
    𝑨, 7
    𝑩, 7
    𝑪 が⼊⼒ 𝑥:
    に依存 → 畳み込み計算不可
    • SSM の再帰計算は Scan と⾒做せる
    • Scan は並列計算可能
    5 6
    2 3 4 7
    1 8
    15 21
    3 6 10 28
    1 36
    ⼊⼒
    出⼒
    Scan (累積和の計算)
    43

    View full-size slide

  44. SSM の再帰計算は Scan と⾒做せる
    • パラメータ 7
    𝑨, 7
    𝑩, 7
    𝑪 が⼊⼒ 𝑥:
    に依存 → 畳み込み計算不可
    • SSM の再帰計算は Scan と⾒做せる
    • Scan は並列計算可能
    5 6
    2 3 4 7
    1 8
    15 21
    3 6 10 28
    1 36
    ⼊⼒
    出⼒
    Scan (累積和の計算)
    𝑥$
    𝑥%
    𝑥!
    𝑥"
    𝑥&
    𝑥'
    𝑥#
    𝑥(
    ℎ$
    ℎ%
    ℎ!
    ℎ"
    ℎ&
    ℎ'
    ℎ#
    ℎ(
    ⼊⼒ 𝑥
    状態 ℎ
    SSM の再帰計算 ≈ Scan
    ;
    ℎ& = 2
    𝑨ℎ&() + 2
    𝑩𝑥&
    𝑦& = 2
    𝑪ℎ&
    ⼊⼒と1つ前の出⼒(状態) から
    現在の出⼒(状態)を計算
    44

    View full-size slide

  45. Scan の並列化︓Parallel Scan
    David Walker "Parallel Scans & Prefix Sums." COS 326, Princeton University
    1 2 3 4 5 6 7 8
    出⼒
    ⼊⼒
    r:0,1
    s:1
    l:
    r:1,2
    s:2
    l:
    r:2,3
    s:3
    l:
    r:3,4
    s:4
    l:
    r:4,5
    s:5
    l:
    r:5,6
    s:6
    l:
    r:6,7
    s:7
    l:
    r:7,8
    s:8
    l:
    r:0,2
    s:
    l:
    r:2,4
    s:
    l:
    r:4,6
    s:
    l:
    r:6,8
    s:
    l:
    r:0,4
    s:
    l:
    r:4,8
    s:
    l:
    r:0,8
    s:
    l:
    45
    range
    sum
    left
    • ⼊⼒をペアに分割→⼆分⽊上で総和を計算
    • 深さが同じペア同⼠の和計算は並列化可能

    View full-size slide

  46. Scan の並列化︓Parallel Scan
    David Walker "Parallel Scans & Prefix Sums." COS 326, Princeton University
    1 2 3 4 5 6 7 8
    出⼒
    ⼊⼒
    r:0,1
    s:1
    l:
    r:1,2
    s:2
    l:
    r:2,3
    s:3
    l:
    r:3,4
    s:4
    l:
    r:4,5
    s:5
    l:
    r:5,6
    s:6
    l:
    r:6,7
    s:7
    l:
    r:7,8
    s:8
    l:
    r:0,2
    s: 3
    l:
    r:2,4
    s: 7
    l:
    r:4,6
    s:11
    l:
    r:6,8
    s: 15
    l:
    r:0,4
    s:
    l:
    r:4,8
    s:
    l:
    r:0,8
    s:
    l:
    46
    range
    sum
    left
    • ⼊⼒をペアに分割→⼆分⽊上で総和を計算
    • 深さが同じペア同⼠の和計算は並列化可能

    View full-size slide

  47. Scan の並列化︓Parallel Scan
    David Walker "Parallel Scans & Prefix Sums." COS 326, Princeton University
    1 2 3 4 5 6 7 8
    出⼒
    ⼊⼒
    r:0,1
    s:1
    l:
    r:1,2
    s:2
    l:
    r:2,3
    s:3
    l:
    r:3,4
    s:4
    l:
    r:4,5
    s:5
    l:
    r:5,6
    s:6
    l:
    r:6,7
    s:7
    l:
    r:7,8
    s:8
    l:
    r:0,2
    s: 3
    l:
    r:2,4
    s: 7
    l:
    r:4,6
    s:11
    l:
    r:6,8
    s: 15
    l:
    r:0,4
    s:10
    l:
    r:4,8
    s:26
    l:
    r:0,8
    s:
    l:
    47
    range
    sum
    left
    • ⼊⼒をペアに分割→⼆分⽊上で総和を計算
    • 深さが同じペア同⼠の和計算は並列化可能

    View full-size slide

  48. Scan の並列化︓Parallel Scan
    David Walker "Parallel Scans & Prefix Sums." COS 326, Princeton University
    1 2 3 4 5 6 7 8
    出⼒
    ⼊⼒
    r:0,1
    s:1
    l:
    r:1,2
    s:2
    l:
    r:2,3
    s:3
    l:
    r:3,4
    s:4
    l:
    r:4,5
    s:5
    l:
    r:5,6
    s:6
    l:
    r:6,7
    s:7
    l:
    r:7,8
    s:8
    l:
    r:0,2
    s: 3
    l:
    r:2,4
    s: 7
    l:
    r:4,6
    s:11
    l:
    r:6,8
    s: 15
    l:
    r:0,4
    s:10
    l:
    r:4,8
    s:26
    l:
    r:0,8
    s:36
    l:
    48
    range
    sum
    left
    • ⼊⼒をペアに分割→⼆分⽊上で総和を計算
    • 深さが同じペア同⼠の和計算は並列化可能

    View full-size slide

  49. Scan の並列化︓Parallel Scan
    David Walker "Parallel Scans & Prefix Sums." COS 326, Princeton University
    1 2 3 4 5 6 7 8
    出⼒
    ⼊⼒
    r:0,1
    s:1
    l:
    r:1,2
    s:2
    l:
    r:2,3
    s:3
    l:
    r:3,4
    s:4
    l:
    r:4,5
    s:5
    l:
    r:5,6
    s:6
    l:
    r:6,7
    s:7
    l:
    r:7,8
    s:8
    l:
    r:0,2
    s: 3
    l:
    r:2,4
    s: 7
    l:
    r:4,6
    s:11
    l:
    r:6,8
    s: 15
    l:
    r:0,4
    s:10
    l:
    r:4,8
    s:26
    l:
    r:0,8
    s:𝟑𝟔
    l:
    49
    range
    sum
    left
    log%
    𝐿
    • ⼊⼒をペアに分割→⼆分⽊上で総和を計算
    • 深さが同じペア同⼠の和計算は並列化可能

    View full-size slide

  50. Scan の並列化︓Parallel Scan
    David Walker "Parallel Scans & Prefix Sums." COS 326, Princeton University
    1 2 3 4 5 6 7 8
    出⼒
    ⼊⼒
    r:0,1
    s:1
    l:
    r:1,2
    s:2
    l:
    r:2,3
    s:3
    l:
    r:3,4
    s:4
    l:
    r:4,5
    s:5
    l:
    r:5,6
    s:6
    l:
    r:6,7
    s:7
    l:
    r:7,8
    s:8
    l:
    r:0,2
    s: 3
    l:
    r:2,4
    s: 7
    l:
    r:4,6
    s:11
    l:
    r:6,8
    s: 15
    l:
    r:0,4
    s:10
    l:
    r:4,8
    s:26
    l:
    r:0,8
    s:36
    l:𝟎
    50
    range
    sum
    left
    log%
    𝐿
    left の計算⽅法
    • 左-⼦ … 親の left をコピー
    • 右-⼦ … 親の left + 左-⼦ の sum
    → 同様に並列化可能

    View full-size slide

  51. Scan の並列化︓Parallel Scan
    David Walker "Parallel Scans & Prefix Sums." COS 326, Princeton University
    1 2 3 4 5 6 7 8
    出⼒
    ⼊⼒
    r:0,1
    s:1
    l:
    r:1,2
    s:2
    l:
    r:2,3
    s:3
    l:
    r:3,4
    s:4
    l:
    r:4,5
    s:5
    l:
    r:5,6
    s:6
    l:
    r:6,7
    s:7
    l:
    r:7,8
    s:8
    l:
    r:0,2
    s: 3
    l:
    r:2,4
    s: 7
    l:
    r:4,6
    s:11
    l:
    r:6,8
    s: 15
    l:
    r:0,4
    s:10
    l: 0
    r:4,8
    s:26
    l: 10
    r:0,8
    s:36
    l:𝟎
    51
    range
    sum
    left
    log%
    𝐿
    left の計算⽅法
    • 左-⼦ … 親の left をコピー
    • 右-⼦ … 親の left + 左-⼦ の sum
    → 同様に並列化可能

    View full-size slide

  52. Scan の並列化︓Parallel Scan
    David Walker "Parallel Scans & Prefix Sums." COS 326, Princeton University
    1 2 3 4 5 6 7 8
    出⼒
    ⼊⼒
    r:0,1
    s:1
    l:
    r:1,2
    s:2
    l:
    r:2,3
    s:3
    l:
    r:3,4
    s:4
    l:
    r:4,5
    s:5
    l:
    r:5,6
    s:6
    l:
    r:6,7
    s:7
    l:
    r:7,8
    s:8
    l:
    r:0,2
    s: 3
    l: 0
    r:2,4
    s: 7
    l: 3
    r:4,6
    s:11
    l: 10
    r:6,8
    s: 15
    l: 21
    r:0,4
    s:10
    l: 0
    r:4,8
    s:26
    l: 10
    r:0,8
    s:36
    l:𝟎
    52
    range
    sum
    left
    log%
    𝐿
    left の計算⽅法
    • 左-⼦ … 親の left をコピー
    • 右-⼦ … 親の left + 左-⼦ の sum
    → 同様に並列化可能

    View full-size slide

  53. Scan の並列化︓Parallel Scan
    David Walker "Parallel Scans & Prefix Sums." COS 326, Princeton University
    1 2 3 4 5 6 7 8
    出⼒
    ⼊⼒
    r:0,1
    s:1
    l: 0
    r:1,2
    s:2
    l: 1
    r:2,3
    s:3
    l: 3
    r:3,4
    s:4
    l: 6
    r:4,5
    s:5
    l: 10
    r:5,6
    s:6
    l: 15
    r:6,7
    s:7
    l: 21
    r:7,8
    s:8
    l: 28
    r:0,2
    s: 3
    l: 0
    r:2,4
    s: 7
    l: 3
    r:4,6
    s:11
    l: 10
    r:6,8
    s: 15
    l: 21
    r:0,4
    s:10
    l: 0
    r:4,8
    s:26
    l: 10
    r:0,8
    s:36
    l:𝟎
    53
    range
    sum
    left
    log%
    𝐿
    left の計算⽅法
    • 左-⼦ … 親の left をコピー
    • 右-⼦ … 親の left + 左-⼦ の sum
    → 同様に並列化可能
    log%
    𝐿

    View full-size slide

  54. Scan の並列化︓Parallel Scan
    David Walker "Parallel Scans & Prefix Sums." COS 326, Princeton University
    𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
    出⼒
    ⼊⼒
    r:0,1
    s:1
    l: 𝟎
    r:1,2
    s:2
    l: 𝟏
    r:2,3
    s:3
    l: 𝟑
    r:3,4
    s:4
    l: 𝟔
    r:4,5
    s:5
    l: 𝟏𝟎
    r:5,6
    s:6
    l: 𝟏𝟓
    r:6,7
    s:7
    l: 𝟐𝟏
    r:7,8
    s:8
    l: 𝟐𝟖
    r:0,2
    s: 3
    l: 0
    r:2,4
    s: 7
    l: 3
    r:4,6
    s:11
    l: 10
    r:6,8
    s: 15
    l: 21
    r:0,4
    s:10
    l: 0
    r:4,8
    s:26
    l: 10
    r:0,8
    s:36
    l:𝟎
    54
    range
    sum
    left
    log%
    𝐿
    left の計算⽅法
    • 左-⼦ … 親の left をコピー
    • 右-⼦ … 親の left + 左-⼦ の sum
    → 同様に並列化可能
    log%
    𝐿

    View full-size slide

  55. Scan の並列化︓Parallel Scan
    David Walker "Parallel Scans & Prefix Sums." COS 326, Princeton University
    𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
    ⼊⼒
    r:0,1
    s:1
    l: 𝟎
    r:1,2
    s:2
    l: 𝟏
    r:2,3
    s:3
    l: 𝟑
    r:3,4
    s:4
    l: 𝟔
    r:4,5
    s:5
    l: 𝟏𝟎
    r:5,6
    s:6
    l: 𝟏𝟓
    r:6,7
    s:7
    l: 𝟐𝟏
    r:7,8
    s:8
    l: 𝟐𝟖
    r:0,2
    s: 3
    l: 0
    r:2,4
    s: 7
    l: 3
    r:4,6
    s:11
    l: 10
    r:6,8
    s: 15
    l: 21
    r:0,4
    s:10
    l: 0
    r:4,8
    s:26
    l: 10
    r:0,8
    s:36
    l:𝟎
    55
    range
    sum
    left
    log%
    𝐿
    left の計算⽅法
    • 左-⼦ … 親の left をコピー
    • 右-⼦ … 親の left + 左-⼦ の sum
    → 同様に並列化可能
    log%
    𝐿
    出⼒ 𝟏 𝟑 𝟔 𝟏𝟎 𝟏𝟓 𝟐𝟏 𝟐𝟖 𝟑𝟔

    View full-size slide

  56. Kernel fusion / 活性値の再計算︓ HBM ↔ SRAM の移動を減らす
    • GPU におけるメモリ階層
    • HBM … ⼤容量・低速
    • SRAM … ⼩容量・⾼速
    • HBM ↔ SRAM の⾏き来が⼤きなボトルネックとなる
    • Kernel fusion を利⽤→ HBM ↔ SRAM の移動を減らす
    • 通常︓Scan の⼊⼒を HBM に貯め → SRAM で計算 → 出⼒を HBM へ
    • Mamba: Scan の⼊⼒の構築から SRAM を利⽤
    • 活性値の再計算
    • Backward 計算のために活性値は通常 HBM に保存
    • Backward 計算時︓HBM ↔ SRAM の⾏き来が発⽣
    • SRAM で活性値を再計算する⽅が⾼速
    56
    Discretize
    ℎ!
    "!
    $!
    Mechanism
    GPU
    SRAM
    GPU HBM
    ∆!
    endently map each channel (e.g. = 5) of an input to output through a higher
    Ms avoid materializing this large e ective state ( , times batch size and sequence
    n paths requiring time-invariance: the ( , A, B, C) parameters are constant across
    ut-dependent dynamics, which also requires a careful hardware-aware algorithm to
    cient levels of the GPU memory hierarchy.

    View full-size slide

  57. Kernel fusion / 活性値の再計算︓ HBM ↔ SRAM の移動を減らす
    • GPU におけるメモリ階層
    • HBM … ⼤容量・低速
    • SRAM … ⼩容量・⾼速
    • HBM ↔ SRAM の⾏き来が⼤きなボトルネックとなる
    • Kernel fusion を利⽤→ HBM ↔ SRAM の移動を減らす
    • 通常︓Scan の⼊⼒を HBM に貯め → SRAM で計算 → 出⼒を HBM へ
    • Mamba: Scan の⼊⼒の構築から SRAM を利⽤
    • 活性値の再計算
    • Backward 計算のために活性値は通常 HBM に保存
    • Backward 計算時︓HBM ↔ SRAM の⾏き来が発⽣
    • SRAM で活性値を再計算する⽅が⾼速
    57
    Discretize
    ℎ!
    "!
    $!
    Mechanism
    GPU
    SRAM
    GPU HBM
    ∆!
    endently map each channel (e.g. = 5) of an input to output through a higher
    Ms avoid materializing this large e ective state ( , times batch size and sequence
    n paths requiring time-invariance: the ( , A, B, C) parameters are constant across
    ut-dependent dynamics, which also requires a careful hardware-aware algorithm to
    cient levels of the GPU memory hierarchy.

    View full-size slide

  58. Kernel fusion / 活性値の再計算︓ HBM ↔ SRAM の移動を減らす
    • GPU におけるメモリ階層
    • HBM … ⼤容量・低速
    • SRAM … ⼩容量・⾼速
    • HBM ↔ SRAM の⾏き来が⼤きなボトルネックとなる
    • Kernel fusion を利⽤→ HBM ↔ SRAM の移動を減らす
    • 通常︓Scan の⼊⼒を HBM に貯め → SRAM で計算 → 出⼒を HBM へ
    • Mamba: Scan の⼊⼒の構築から SRAM を利⽤
    • 活性値の再計算
    • Backward 計算のために活性値は通常 HBM に保存
    • Backward 計算時︓HBM ↔ SRAM の⾏き来が発⽣
    • SRAM で活性値を再計算する⽅が⾼速
    58
    Discretize
    ℎ!
    "!
    $!
    Mechanism
    GPU
    SRAM
    GPU HBM
    ∆!
    endently map each channel (e.g. = 5) of an input to output through a higher
    Ms avoid materializing this large e ective state ( , times batch size and sequence
    n paths requiring time-invariance: the ( , A, B, C) parameters are constant across
    ut-dependent dynamics, which also requires a careful hardware-aware algorithm to
    cient levels of the GPU memory hierarchy.

    View full-size slide

  59. Kernel fusion / 活性値の再計算︓ HBM ↔ SRAM の移動を減らす
    • GPU におけるメモリ階層
    • HBM … ⼤容量・低速
    • SRAM … ⼩容量・⾼速
    • HBM ↔ SRAM の⾏き来が⼤きなボトルネックとなる
    • Kernel fusion を利⽤→ HBM ↔ SRAM の移動を減らす
    • 通常︓Scan の⼊⼒を HBM に貯め → SRAM で計算 → 出⼒を HBM へ
    • Mamba: Scan の⼊⼒の構築から SRAM を利⽤
    • 活性値の再計算
    • Backward 計算のために活性値は通常 HBM に保存
    • Backward 計算時︓HBM ↔ SRAM の⾏き来が発⽣
    • SRAM で活性値を再計算する⽅が⾼速
    59
    Discretize
    ℎ!
    "!
    $!
    Mechanism
    GPU
    SRAM
    GPU HBM
    ∆!
    endently map each channel (e.g. = 5) of an input to output through a higher
    Ms avoid materializing this large e ective state ( , times batch size and sequence
    n paths requiring time-invariance: the ( , A, B, C) parameters are constant across
    ut-dependent dynamics, which also requires a careful hardware-aware algorithm to
    cient levels of the GPU memory hierarchy.

    View full-size slide

  60. Kernel fusion / 活性値の再計算︓ HBM ↔ SRAM の移動を減らす
    • GPU におけるメモリ階層
    • HBM … ⼤容量・低速
    • SRAM … ⼩容量・⾼速
    • HBM ↔ SRAM の⾏き来が⼤きなボトルネックとなる
    • Kernel fusion を利⽤→ HBM ↔ SRAM の移動を減らす
    • 通常︓Scan の⼊⼒を HBM に貯め → SRAM で計算 → 出⼒を HBM へ
    • Mamba: Scan の⼊⼒の構築から SRAM を利⽤
    • 活性値の再計算
    • Backward 計算のために活性値は通常 HBM に保存
    • Backward 計算時︓HBM ↔ SRAM の⾏き来が発⽣
    • SRAM で活性値を再計算する⽅が⾼速
    60
    Discretize
    ℎ!
    "!
    $!
    Mechanism
    GPU
    SRAM
    GPU HBM
    ∆!
    endently map each channel (e.g. = 5) of an input to output through a higher
    Ms avoid materializing this large e ective state ( , times batch size and sequence
    n paths requiring time-invariance: the ( , A, B, C) parameters are constant across
    ut-dependent dynamics, which also requires a careful hardware-aware algorithm to
    cient levels of the GPU memory hierarchy.

    View full-size slide

  61. Kernel fusion / 活性値の再計算︓ HBM ↔ SRAM の移動を減らす
    • GPU におけるメモリ階層
    • HBM … ⼤容量・低速
    • SRAM … ⼩容量・⾼速
    • HBM ↔ SRAM の⾏き来が⼤きなボトルネックとなる
    • Kernel fusion を利⽤→ HBM ↔ SRAM の移動を減らす
    • 通常︓Scan の⼊⼒を HBM に貯め → SRAM で計算 → 出⼒を HBM へ
    • Mamba: Scan の⼊⼒の構築から SRAM を利⽤
    • 活性値の再計算
    • Backward 計算のために活性値は通常 HBM に保存
    • Backward 計算時︓HBM ↔ SRAM の⾏き来が発⽣
    • SRAM で活性値を再計算する⽅が⾼速
    61
    Discretize
    ℎ!
    "!
    $!
    Mechanism
    GPU
    SRAM
    GPU HBM
    ∆!
    endently map each channel (e.g. = 5) of an input to output through a higher
    Ms avoid materializing this large e ective state ( , times batch size and sequence
    n paths requiring time-invariance: the ( , A, B, C) parameters are constant across
    ut-dependent dynamics, which also requires a careful hardware-aware algorithm to
    cient levels of the GPU memory hierarchy.

    View full-size slide

  62. Mamba 全体のアーキテクチャ
    62
    H3 Gated MLP Mamba
    Linear
    projection
    Sequence
    transformation
    Nonlinearity
    (activation or
    multiplication)
    X
    X X
    !
    X
    Conv
    SSM
    X !
    !
    Conv
    SSM

    Figure 3: (Architecture.) Our simpli ed block design combines the H3 block, which is the basis of most SSM architectures, with
    the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we simply repeat the Mamba block
    homogenously. Compared to the H3 block, Mamba replaces the rst multiplicative gate with an activation function. Compared to
    the MLP block, Mamba adds an SSM to the main branch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016;
    Ramachandran, Zoph, and Quoc V Le 2017).
    the matrix A) are much smaller in comparison. We repeat this block, interleaved with standard normalization
    and residual connections, to form the Mamba architecture. We always fix to = 2 in our experiments and use two
    stacks of the block to match the 12 2 parameters of a Transformer’s interleaved MHA (multi-head attention) and

    View full-size slide

  63. Mamba︓Selective Coping / Induction Heads
    63
    Model Arch. Layer Acc.
    S4 No gate S4 18.3
    - No gate S6 97.0
    H3 H3 S4 57.0
    Hyena H3 Hyena 30.1
    - H3 S6 99.7
    - Mamba S4 56.4
    - Mamba Hyena 28.4
    Mamba Mamba S6 99.8
    Table 1: (Selective Copying.)
    Accuracy for combinations of architectures
    and inner sequence layers.
    Table 2: (Induction Heads.) Models are trained on sequence length
    28 = 256, and tested on increasing sequence lengths of 26 = 64 up to
    220 = 1048576. Full numbers in Table 11.
    4.1.2 Induction Heads
    Induction heads (Olsson et al. 2022) is a simple task from the mechanistic interpretability lens (Elhage et al. 2021)
    that is surprisingly predictive of the in-context learning ability of LLMs. It requires models to perform associative
    ?
    Output
    ng Selective Copying
    Input
    Induction Heads
    els that do not need to look at the actual inputs
    on of the Copying task involves constant spacing between input and output elements and is
    els such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    uts and requires time-varying models that can selectively remember or ignore inputs depending
    e Induction Heads task is an example of associative recall that requires retrieving an answer
    LMs.
    nts structured ◊ matrix
    B)
    : recurrence or convolution
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Input
    Output
    ?
    Output
    Copying Selective Copying
    Input
    Induction Heads
    Solution
    Perfectly solved by LTI (e.g. convolutional) models that do not need to look at the actual inputs
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    d MLP Mamba
    Linear
    projection
    Sequence
    transformation
    Nonlinearity
    (activation or
    multiplication)
    X
    !
    X
    !
    !
    Conv
    SSM
    gn combines the H3 block, which is the basis of most SSM architectures, with
    s. Instead of interleaving these two blocks, we simply repeat the Mamba block
    eplaces the rst multiplicative gate with an activation function. Compared to
    anch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016;
    n. We repeat this block, interleaved with standard normalization
    architecture. We always fix to = 2 in our experiments and use two
    ters of a Transformer’s interleaved MHA (multi-head attention) and
    ation function (Hendrycks and Gimpel 2016; Ramachandran, Zoph,
    Gated MLP becomes the popular “SwiGLU” variant (Chowdhery
    ). Finally, we additionally use an optional normalization layer (we
    MLP Mamba
    Linear
    projection
    Sequence
    transformation
    Nonlinearity
    (activation or
    multiplication)
    X
    !
    X
    !
    !
    Conv
    SSM
    n combines the H3 block, which is the basis of most SSM architectures, with
    Instead of interleaving these two blocks, we simply repeat the Mamba block
    places the rst multiplicative gate with an activation function. Compared to
    nch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016;
    n. We repeat this block, interleaved with standard normalization
    rchitecture. We always fix to = 2 in our experiments and use two
    ers of a Transformer’s interleaved MHA (multi-head attention) and
    tion function (Hendrycks and Gimpel 2016; Ramachandran, Zoph,
    Gated MLP becomes the popular “SwiGLU” variant (Chowdhery
    Finally, we additionally use an optional normalization layer (we

    View full-size slide

  64. Mamba︓Selecitve SSM により動的な推論が可能に
    64
    Model Arch. Layer Acc.
    S4 No gate S4 18.3
    - No gate S6 97.0
    H3 H3 S4 57.0
    Hyena H3 Hyena 30.1
    - H3 S6 99.7
    - Mamba S4 56.4
    - Mamba Hyena 28.4
    Mamba Mamba S6 99.8
    Table 1: (Selective Copying.)
    Accuracy for combinations of architectures
    and inner sequence layers.
    Table 2: (Induction Heads.) Models are trained on sequence length
    28 = 256, and tested on increasing sequence lengths of 26 = 64 up to
    220 = 1048576. Full numbers in Table 11.
    4.1.2 Induction Heads
    Induction heads (Olsson et al. 2022) is a simple task from the mechanistic interpretability lens (Elhage et al. 2021)
    that is surprisingly predictive of the in-context learning ability of LLMs. It requires models to perform associative
    ?
    Output
    ng Selective Copying
    Input
    Induction Heads
    els that do not need to look at the actual inputs
    on of the Copying task involves constant spacing between input and output elements and is
    els such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    uts and requires time-varying models that can selectively remember or ignore inputs depending
    e Induction Heads task is an example of associative recall that requires retrieving an answer
    LMs.
    nts structured ◊ matrix
    B)
    : recurrence or convolution
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Input
    Output
    ?
    Output
    Copying Selective Copying
    Input
    Induction Heads
    Solution
    Perfectly solved by LTI (e.g. convolutional) models that do not need to look at the actual inputs
    Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is
    easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task
    has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending
    on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer
    based on context, a key ability for LLMs.
    Algorithm 1 SSM (S4)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , )
    3: C ( , )
    4: ( ) ( )
    5: A, B ( , ) ( , A, B)
    6: (A, B, C)( )
    Time-invariant: recurrence or convolution
    7: return
    Algorithm 2 SSM + Selection (S6)
    Input: ( , , )
    Output: ( , , )
    1: A ( , )
    Represents structured ◊ matrix
    2: B ( , , ) ( )
    3: C ( , , ) ( )
    4: ( , , ) ( + ( ))
    5: A, B ( , , , ) ( , A, B)
    6: (A, B, C)( )
    Time-varying: recurrence (scan) only
    7: return
    Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making
    several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout.
    d MLP Mamba
    Linear
    projection
    Sequence
    transformation
    Nonlinearity
    (activation or
    multiplication)
    X
    !
    X
    !
    !
    Conv
    SSM
    gn combines the H3 block, which is the basis of most SSM architectures, with
    s. Instead of interleaving these two blocks, we simply repeat the Mamba block
    eplaces the rst multiplicative gate with an activation function. Compared to
    anch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016;
    n. We repeat this block, interleaved with standard normalization
    architecture. We always fix to = 2 in our experiments and use two
    ters of a Transformer’s interleaved MHA (multi-head attention) and
    ation function (Hendrycks and Gimpel 2016; Ramachandran, Zoph,
    Gated MLP becomes the popular “SwiGLU” variant (Chowdhery
    ). Finally, we additionally use an optional normalization layer (we
    MLP Mamba
    Linear
    projection
    Sequence
    transformation
    Nonlinearity
    (activation or
    multiplication)
    X
    !
    X
    !
    !
    Conv
    SSM
    n combines the H3 block, which is the basis of most SSM architectures, with
    Instead of interleaving these two blocks, we simply repeat the Mamba block
    places the rst multiplicative gate with an activation function. Compared to
    nch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016;
    n. We repeat this block, interleaved with standard normalization
    rchitecture. We always fix to = 2 in our experiments and use two
    ers of a Transformer’s interleaved MHA (multi-head attention) and
    tion function (Hendrycks and Gimpel 2016; Ramachandran, Zoph,
    Gated MLP becomes the popular “SwiGLU” variant (Chowdhery
    Finally, we additionally use an optional normalization layer (we

    View full-size slide

  65. Mamba︓Transformer と同等程度の Scaling 則が⾒られる
    65
    Figure 4: (Scaling Laws.) Models of size 125 to 1.3 parameters, trained on the Pile. Mamba scales better than all other
    attention-free models and is the rst to match the performance of a very strong “Transformer++” recipe that has now become
    standard, particularly as the sequence length grows.
    architectures (e.g. rotary embedding, SwiGLU MLP, RMSNorm instead of LayerNorm, no linear bias, and higher
    learning rates). We also compare against other recent subquadratic architectures (Figure 4). All model details are
    in Appendix E.2.

    View full-size slide

  66. Figure 4: (Scaling Laws.) Models of size 125 to 1.3 parameters, trained on the Pile. Mamba scales better than all other
    attention-free models and is the rst to match the performance of a very strong “Transformer++” recipe that has now become
    standard, particularly as the sequence length grows.
    architectures (e.g. rotary embedding, SwiGLU MLP, RMSNorm instead of LayerNorm, no linear bias, and higher
    learning rates). We also compare against other recent subquadratic architectures (Figure 4). All model details are
    in Appendix E.2.
    Mamba︓Transformer と同等程度の Scaling 則が⾒られる
    66
    横軸は FLOPs → Mamba のハードウェア
    最適化により他のモデルが不利に扱われてい
    ることはない (cf. 横軸が計算時間)

    View full-size slide

  67. Mamba︓1/2のモデルサイズで Transformer を凌駕
    67
    Table 3: (Zero-shot Evaluations.) Best results for each size in bold. We compare against open source LMs with various tokenizers,
    trained for up to 300B tokens. Pile refers to the validation split, comparing only against models trained on the same dataset and
    tokenizer (GPT-NeoX-20B). For each model size, Mamba is best-in-class on every single evaluation result, and generally matches
    baselines at twice the model size.
    Model Token. Pile LAMBADA LAMBADA HellaSwag PIQA Arc-E Arc-C WinoGrande Average
    ppl ppl acc acc acc acc acc acc acc
    Hybrid H3-130M GPT2 — 89.48 25.77 31.7 64.2 44.4 24.2 50.6 40.1
    Pythia-160M NeoX 29.64 38.10 33.0 30.2 61.4 43.2 24.1 51.9 40.6
    Mamba-130M NeoX 10.56 16.07 44.3 35.3 64.5 48.0 24.3 51.9 44.7
    Hybrid H3-360M GPT2 — 12.58 48.0 41.5 68.1 51.4 24.7 54.1 48.0
    Pythia-410M NeoX 9.95 10.84 51.4 40.6 66.9 52.1 24.6 53.8 48.2
    Mamba-370M NeoX 8.28 8.14 55.6 46.5 69.5 55.1 28.0 55.3 50.0
    Pythia-1B NeoX 7.82 7.92 56.1 47.2 70.7 57.0 27.1 53.5 51.9
    Mamba-790M NeoX 7.33 6.02 62.7 55.1 72.1 61.2 29.5 56.1 57.1
    GPT-Neo 1.3B GPT2 — 7.50 57.2 48.9 71.1 56.2 25.9 54.9 52.4
    Hybrid H3-1.3B GPT2 — 11.25 49.6 52.6 71.3 59.2 28.1 56.9 53.0
    OPT-1.3B OPT — 6.64 58.0 53.7 72.4 56.7 29.6 59.5 55.0
    Pythia-1.4B NeoX 7.51 6.08 61.7 52.1 71.0 60.5 28.5 57.2 55.2
    RWKV-1.5B NeoX 7.70 7.04 56.4 52.5 72.4 60.5 29.4 54.6 54.3
    Mamba-1.4B NeoX 6.80 5.04 64.9 59.1 74.2 65.5 32.8 61.5 59.7
    GPT-Neo 2.7B GPT2 — 5.63 62.2 55.8 72.1 61.1 30.2 57.6 56.5
    Hybrid H3-2.7B GPT2 — 7.92 55.7 59.7 73.3 65.6 32.3 61.4 58.0
    OPT-2.7B OPT — 5.12 63.6 60.6 74.8 60.8 31.3 61.0 58.7
    Pythia-2.8B NeoX 6.73 5.04 64.7 59.3 74.0 64.1 32.9 59.7 59.1
    RWKV-3B NeoX 7.00 5.24 63.9 59.6 73.7 67.8 33.1 59.6 59.6
    Mamba-2.8B NeoX 6.22 4.23 69.2 66.1 75.2 69.7 36.3 63.5 63.3
    GPT-J-6B GPT2 – 4.10 68.3 66.3 75.4 67.0 36.6 64.1 63.0
    OPT-6.7B OPT – 4.25 67.7 67.2 76.3 65.6 34.9 65.5 62.9
    Pythia-6.9B NeoX 6.51 4.45 67.1 64.0 75.2 67.3 35.5 61.3 61.7
    RWKV-7.4B NeoX 6.31 4.38 67.2 65.5 76.1 67.8 37.5 61.0 62.5
    total of 220 1 tokens per batch. Models were trained for 10 gradient steps for a total of 10 tokens.

    View full-size slide

  68. Mamba︓まとめ
    • S4 の課題であった⼊⼒ 𝑥 に応じた推論を Selecitve SSM により実現
    • 畳み込みによる並列化の代わりに,GPU を考慮した⾼速化を実現
    • Parallel scan
    • Kernel fusion
    • 活性値の再計算
    • Transformer に迫る性能を記録
    • まだまだ分からないことが多い
    • 2.8B 以上にスケールする?
    • 学習の不安定性は?
    • Mamba における Chinchilla Scaling 則は?
    • ハイパラの決め⽅は Transformer とどう違う?
    68
    Project
    Discretize
    !!
    ℎ!"# ℎ!
    "!
    #
    $!
    %!
    Selection Mechanism
    GPU
    SRAM
    GPU HBM
    ∆!
    Selective State Space Model
    with Hardware-aware State Expansion
    Figure 1: (Overview.) Structured SSMs independently map each channel (e.g. = 5) of an input to output through a higher
    dimensional latent state (e.g. = 4). Prior SSMs avoid materializing this large e ective state ( , times batch size and sequence
    length ) through clever alternate computation paths requiring time-invariance: the ( , A, B, C) parameters are constant across
    time. Our selection mechanism adds back input-dependent dynamics, which also requires a careful hardware-aware algorithm to
    only materialize the expanded states in more e cient levels of the GPU memory hierarchy.
    2 State Space Models
    Structured state space sequence models (S4) are a recent class of sequence models for deep learning that are
    broadly related to RNNs, and CNNs, and classical state space models. They are inspired by a particular continuous
    system (1) that maps a 1-dimensional function or sequence ( ) ( ) through an implicit latent state
    ( ) .
    Concretely, S4 models are defined with four parameters ( , A, B, C), which define a sequence-to-sequence trans-
    formation in two stages.
    ( ) = A ( ) + B ( ) (1a) = A 1
    + B (2a) = (C , C , … , C , … ) (3a)
    H3 Gated MLP Mamba
    X
    X X
    !
    Conv
    SSM
    X !
    !
    Conv
    SSM

    Figure 3: (Architecture.) Our simpli ed block design combines the H3 block, which is the basis of m
    the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we sim
    homogenously. Compared to the H3 block, Mamba replaces the rst multiplicative gate with an activa
    the MLP block, Mamba adds an SSM to the main branch. For we use the SiLU / Swish activation (H
    Ramachandran, Zoph, and Quoc V Le 2017).
    the matrix A) are much smaller in comparison. We repeat this block, interleaved with
    and residual connections, to form the Mamba architecture. We always fix to = 2 in our
    stacks of the block to match the 12 2 parameters of a Transformer’s interleaved MHA (m
    MLP blocks. We use the SiLU / Swish activation function (Hendrycks and Gimpel 201
    and Quoc V Le 2017), motivated so that the Gated MLP becomes the popular “SwiGL
    et al. 2023; Shazeer 2020; Touvron et al. 2023). Finally, we additionally use an optiona
    choose LayerNorm (J. L. Ba, Kiros, and Hinton 2016)), motivated by RetNet’s usage of a
    H3 Gated MLP Mamba
    X
    X X
    !
    Conv
    SSM
    X !
    !
    Conv
    SSM

    Figure 3: (Architecture.) Our simpli ed block design combines the H3 block, which is the basis of m
    the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we sim
    homogenously. Compared to the H3 block, Mamba replaces the rst multiplicative gate with an activa
    the MLP block, Mamba adds an SSM to the main branch. For we use the SiLU / Swish activation (H
    Ramachandran, Zoph, and Quoc V Le 2017).
    the matrix A) are much smaller in comparison. We repeat this block, interleaved with
    and residual connections, to form the Mamba architecture. We always fix to = 2 in our e
    stacks of the block to match the 12 2 parameters of a Transformer’s interleaved MHA (m
    MLP blocks. We use the SiLU / Swish activation function (Hendrycks and Gimpel 2016
    and Quoc V Le 2017), motivated so that the Gated MLP becomes the popular “SwiGL
    et al. 2023; Shazeer 2020; Touvron et al. 2023). Finally, we additionally use an optional
    choose LayerNorm (J. L. Ba, Kiros, and Hinton 2016)), motivated by RetNet’s usage of a

    View full-size slide

  69. Mamba @ Kotoba Tech︓⼤規模分散学習⽤ライブラリの開発
    • kotomamba
    • https://github.com/kotoba-tech/kotomamba
    • Base on kotoba-recipes (based on llama-recipes (Meta))
    & Mamba implementation by Tri Dao & Albert Gu
    • FSDP による⼤規模分散学習
    • 事前学習と継続学習(fine-tuning)の両⽅に対応
    • 🤗 Transformers 形式で配布されている任意の Mamba モデルを学習可能
    • 近⽇テックブログを公開予定です︕
    69
    Fujii-san and I is
    leading the project

    View full-size slide

  70. Mamba @ Kotoba Tech: ABCI グランドチャレンジ
    • 東北⼤坂⼝研・東⼯⼤横⽥研・岡崎研・Kotoba Tech のチームで
    ABCI グランドチャレンジに参加予定
    • V-week: 128 V-nodes (NVIDIA V100 GPU x 512) を1週間専有
    • 2.8B の⽇本語の Mamba ⾔語モデルを訓練予定
    • kotomamba による⼤規模並列学習
    70

    View full-size slide

  71. Mamba @ Kotoba Tech: 予備実験 for ABCI グラチャレ
    71
    • ハイパラに関しては Mamba 論⽂準拠
    • 130M︓学習が上⼿くいくことを確認
    • 150B tokens for en-Pile
    • 220B tokens for ⽇英混合コーパス
    • lr を2倍にするなど試したが,
    スパイクせず
    • 1.4B
    • 80B tokens ほど学習したが
    スパイクなしで安定
    • 2.8B
    • 今のところスパイクせず

    View full-size slide

  72. Mamba @ Kotoba Tech: 予備実験 for ABCI グラチャレ
    • 3.0B → 学習がやや不安定に
    • Mamba 論⽂にはない設定(論⽂では 2.8B が上限)
    • レイヤー数を 64 (2.8B) → 70 (3.0B) に増やした
    • 仮説︓”縦と横” のバランスが重要? (次元数なども同様に増やす必要あり?)
    72

    View full-size slide

  73. Mamba @ Kotoba Tech: 予備実験 for ABCI グラチャレ
    73
    🥲 ABCI 上で FLOPs 数が不安定
    • ABCI の問題? ライブラリ起因?
    Mambaの実装起因? → 恐らく独⾃実装の dataloader 起因
    • GPU の温度は問題なし
    • ⼩島・笠井・栗⽥が参加したグラチャレでも同様のトラブルあり
    → reboot により解決した

    View full-size slide

  74. さらに知りたい⽅へ / 参考資料
    SSM / S4
    • The Annotated S4
    • Do we need Attention? - Linear RNNs and State Space Models (SSMs) for NLP
    • MedAI #41: Efficiently Modeling Long Sequences with Structured State Spaces | Albert Gu
    • HiPPO/S4解説 - Morpho Tech Blog - モルフォ
    • [解説資料] Hyena Hierarchy: Towards Larger Convolutional Language Models
    Mamba
    • Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)
    • Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math
    Parallel scan
    • GPU Gems 3 Chapter 39. Parallel Prefix Sum (Scan) with CUDA
    • Parallel Scans & Prefix Sums
    Implementation
    • https://github.com/state-spaces/mamba
    • https://github.com/alxndrTL/mamba.py
    • https://github.com/johnma2006/mamba-minimal
    74

    View full-size slide