Deep State Space Models 101 / Mamba

Deep State Space Models 101 Hiroto Kurita 2024/1/19 Kotoba Technologies
Seminar Series 1

⾃⼰紹介 • 栗⽥宙⼈（くりたひろと） • Kotoba Technologies では Research
/ ML Engineer を担当 • 東北⼤学坂⼝研（TohokuNLP）修⼠2年 2 X: @hiroto_kurita / https://kurita.dev

本トークについて • ⽬的︓⻑系列処理に強い Deep State Space Models をざっくり理解する • Mamba[Gu&Dao+ʼ23]
に注⽬ • テクニカルすぎる話題は割愛 • 概要 • State Space Models / S4 [Gu+ʼ22] の導⼊ • Mamba • Kotoba Tech. での Mamba の取り組み 3 ※引⽤先が明⽰されていない⾃作以外の図表は [Gu&Dao+ʼ23] 及び [Gu+ʼ22] からの引⽤です

State Space Models (SSMs) は系列→系列変換器 • ⼊出⼒︓ 𝑥!, 𝑥",
⋯ , 𝑥#$" → 𝑦!, 𝑦", ⋯ , 𝑦#$" , 𝑥% ∈ ℝ, 𝑦% ∈ ℝ • 実際には⼊出⼒はベクトルの集合 (容易に拡張可能) • Attention も SSM も同様 • SSM は Attention のようにニューラルモデルに組み込み可能 𝑥! 𝑥" 𝑥& 𝑦! 𝑦" 𝑦& 𝑥! 𝑥" 𝑥& 𝑦! 𝑦" 𝑦& Normalization Linear SSM Layer ︙ ︙ Attention SSM 4

SSMs は Transformer と RNN の良いとこ取りを⽬指す • Transformer (Attention に注⽬)
• 推論︓😫 𝑂(𝐿&) • 訓練︓😊 並列化可能 • RNN • 推論︓😊 𝑂(𝐿) • 訓練︓😫 並列化不可 • SSM • 推論︓😊 𝑂(𝐿) • 訓練︓😊 並列化可能 𝑥! 𝑥" 𝑥& 𝑦! 𝑦" 𝑦& 𝑥! 𝑥" 𝑥& 𝒉! 𝒉" 𝒉& 𝑦! 𝑦" 𝑦& Attention SSM (≈ Linear RNN) 5

• 推論︓😫 𝑂(𝐿&) • 訓練︓😊 並列化可能 • RNN • 推論︓😊 𝑂(𝐿) • 訓練︓😫 並列化不可 • SSM • 推論︓😊 𝑂(𝐿) • 訓練︓😊 並列化可能 𝑦! 𝑦" 𝑦& Attention 𝑥" 𝑥! 𝑥& 過去⼊⼒を全て⾒に⾏く 𝑥! 𝑥" 𝑥& 𝒉! 𝒉" 𝒉& 𝑦! 𝑦" 𝑦& SSM (≈ Linear RNN) 6

• 推論︓😫 𝑂(𝐿&) • 訓練︓😊 並列化可能 • RNN • 推論︓😊 𝑂(𝐿) • 訓練︓😫 並列化不可 • SSM • 推論︓😊 𝑂(𝐿) • 訓練︓😊 並列化可能現在の⼊⼒と 1つ前の”状態” のみを⾒る 𝑦! 𝑦" 𝑦& Attention 𝑥" 𝑥! 𝑥& 過去⼊⼒を全て⾒に⾏く 𝑥! 𝑥" 𝑥& 𝒉! 𝒉" 𝒉& 𝑦! 𝑦" 𝑦& SSM (≈ Linear RNN) 7

SSMs︓定式化（連続時間） • ⼊⼒ 𝑥 𝑡 ∈ ℝを状態 ℎ 𝑡 ∈
𝑅( を通して出⼒ 𝑦 𝑡 ∈ ℝに変換 • 制御⼯学や時系列分析などで古くから利⽤ • 𝑨 ~ 𝑫 はパラメータ • 時間変化せず, 𝑥 𝑡 に対しても不変 • 以降 𝑫 = 𝟎 として議論 • 連続時間で定義されている関数 𝑥 ↦ 𝑦 の mapping だと⾒做せる • 系列 (⾔語…) を扱う場合は離散化する必要あり ! ℎ! 𝑡 = 𝑨ℎ 𝑡 + 𝑩𝑥(𝑡) 𝑦 𝑡 = 𝑪ℎ 𝑡 + 𝑫𝑥(𝑡) SSM 𝑥(𝑡) 𝑦(𝑡) 𝑫 = 𝟎 8

SSMs︓定式化（連続時間） • ⼊⼒ 𝑥 𝑡 ∈ ℝを状態 ℎ 𝑡 ∈
𝑅( を通して出⼒ 𝑦 𝑡 ∈ ℝに変換 • 制御⼯学や時系列分析などで古くから利⽤ • 𝑨 ~ 𝑫 はパラメータ • 時間変化せず, 𝑥 𝑡 に対しても不変 • 以降 𝑫 = 𝟎 として議論 • 連続時間で定義されている関数 𝑥 ↦ 𝑦 の mapping だと⾒做せる • 系列 (⾔語…) を扱う場合は離散化する必要あり ! ℎ! 𝑡 = 𝑨ℎ 𝑡 + 𝑩𝑥(𝑡) 𝑦 𝑡 = 𝑪ℎ 𝑡 + 𝑫𝑥(𝑡) SSM 𝑥(𝑡) 𝑦(𝑡) 𝑫 = 𝟎 スキップ接続と⾒做して省略 9

SSMs︓離散化により再帰的な表現に ! ℎ! 𝑡 = 𝑨ℎ 𝑡 + 𝑩𝑥(𝑡) 𝑦
𝑡 = 𝑪ℎ 𝑡 . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 離散化連続表現離散・再帰表現 𝑥! 𝑥" 𝒉! 𝑦! 𝑦" 𝑦& SSM (≈ Linear RNN) 𝑥& 𝒉" 𝒉& 現在の⼊⼒と 1つ前の”状態” のみを⾒る ℎ! = - 𝑨ℎ" + - 𝑩𝑥! 𝑦! = - 𝑪ℎ! Euler 法による離散化の例 ℎ! 𝑡 = lim "→$ % &'" (% & " → ℎ! 𝑡 ≈ % &'" (% & " を利⽤をすると ℎ! 𝑡 + Δ = 𝑨Δℎ 𝑡 + ℎ 𝑡 + 𝑩Δ𝑥 𝑡 = 𝑨Δ + 𝑰 ℎ 𝑡 + 𝑩Δ𝑥 𝑡 = 2 𝑨ℎ 𝑡 + 2 𝑩𝑥(𝑡), 2 𝑨 ≔ 𝑨Δ + 𝑰 , 2 𝑩 = 𝑩Δ 𝑥&, 𝑦&, ℎ& ≔ 𝑥 𝑡𝛥 , 𝑦 𝑡𝛥 , ℎ 𝑡𝛥 と置くと, ℎ& = 2 𝑨ℎ&() + 2 𝑩𝑥& , 𝑦& = 2 𝑪ℎ& (2 𝑪 ≔ 𝑪) 10

SSMs 訓練時︓線形性により畳み込み演算で⾼速・並列化可能に • 😫 RNN は⼀般に並列化不可 • 😊 SSMs は線形性により畳み込み計算で並列化可能に
. ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" ℎ# = - 𝑨ℎ$" + - 𝑩𝑥# = - 𝑩𝑥# ℎ" = - 𝑨ℎ# + - 𝑩𝑥" = - 𝑨- 𝑩𝑥# + - 𝑩𝑥" ℎ! = - 𝑨ℎ" + - 𝑩𝑥! = - 𝑨- 𝑨- 𝑩𝑥# + - 𝑨- 𝑩𝑥" + - 𝑩𝑥! 𝑦# = - 𝑪ℎ# = - 𝑪- 𝑩𝑥# 𝑦" = - 𝑪ℎ" = - 𝑪- 𝑨- 𝑩𝑥# + - 𝑪- 𝑩𝑥" 𝑦! = - 𝑪ℎ! = - 𝑪- 𝑨- 𝑨- 𝑩𝑥# + - 𝑪- 𝑨- 𝑩𝑥" + - 𝑪- 𝑩𝑥! 𝑦% = - 𝑪ℎ% = - 𝑪- 𝑨% - 𝑩𝑥# + - 𝑪- 𝑨%$" - 𝑩𝑥" + ⋯ + - 𝑪- 𝑩𝑥% ℎ$" = 𝟎 とすると 𝑦 = 𝑥 ∗ - 𝐾 - 𝐾 = (- 𝑪- 𝑩, - 𝑪- 𝑨- 𝑩, ⋯ , - 𝑪- 𝑨% - 𝑩, ⋯ ) 1次元の畳み込みとして表現可能 13

. ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" ℎ# = - 𝑨ℎ$" + - 𝑩𝑥# = - 𝑩𝑥# ℎ" = - 𝑨ℎ# + - 𝑩𝑥" = - 𝑨- 𝑩𝑥# + - 𝑩𝑥" ℎ! = - 𝑨ℎ" + - 𝑩𝑥! = - 𝑨- 𝑨- 𝑩𝑥# + - 𝑨- 𝑩𝑥" + - 𝑩𝑥! 𝑦# = - 𝑪ℎ# = - 𝑪- 𝑩𝑥# 𝑦" = - 𝑪ℎ" = - 𝑪- 𝑨- 𝑩𝑥# + - 𝑪- 𝑩𝑥" 𝑦! = - 𝑪ℎ! = - 𝑪- 𝑨- 𝑨- 𝑩𝑥# + - 𝑪- 𝑨- 𝑩𝑥" + - 𝑪- 𝑩𝑥! 𝑦% = - 𝑪ℎ% = - 𝑪- 𝑨% - 𝑩𝑥# + - 𝑪- 𝑨%$" - 𝑩𝑥" + ⋯ + - 𝑪- 𝑩𝑥% ℎ$" = 𝟎 とすると 𝑦 = 𝑥 ∗ - 𝐾 - 𝐾 = (- 𝑪- 𝑩, - 𝑪- 𝑨- 𝑩, ⋯ , - 𝑪- 𝑨% - 𝑩, ⋯ ) 1次元の畳み込みとして表現可能 𝑥! 𝑥" 𝑦! 𝑦" カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 , 𝑪, 𝑨" , 𝑩 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 𝑦# 0 0 𝑥# 18

. ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" ℎ# = - 𝑨ℎ$" + - 𝑩𝑥# = - 𝑩𝑥# ℎ" = - 𝑨ℎ# + - 𝑩𝑥" = - 𝑨- 𝑩𝑥# + - 𝑩𝑥" ℎ! = - 𝑨ℎ" + - 𝑩𝑥! = - 𝑨- 𝑨- 𝑩𝑥# + - 𝑨- 𝑩𝑥" + - 𝑩𝑥! 𝑦# = - 𝑪ℎ# = - 𝑪- 𝑩𝑥# 𝑦" = - 𝑪ℎ" = - 𝑪- 𝑨- 𝑩𝑥# + - 𝑪- 𝑩𝑥" 𝑦! = - 𝑪ℎ! = - 𝑪- 𝑨- 𝑨- 𝑩𝑥# + - 𝑪- 𝑨- 𝑩𝑥" + - 𝑪- 𝑩𝑥! 𝑦% = - 𝑪ℎ% = - 𝑪- 𝑨% - 𝑩𝑥# + - 𝑪- 𝑨%$" - 𝑩𝑥" + ⋯ + - 𝑪- 𝑩𝑥% ℎ$" = 𝟎 とすると 𝑦 = 𝑥 ∗ - 𝐾 - 𝐾 = (- 𝑪- 𝑩, - 𝑪- 𝑨- 𝑩, ⋯ , - 𝑪- 𝑨% - 𝑩, ⋯ ) 1次元の畳み込みとして表現可能 𝑥! 𝑥" 𝑦! 𝑦" カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 , 𝑪, 𝑨" , 𝑩 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 𝑦# 0 0 𝑥# 積総和 19

. ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" ℎ# = - 𝑨ℎ$" + - 𝑩𝑥# = - 𝑩𝑥# ℎ" = - 𝑨ℎ# + - 𝑩𝑥" = - 𝑨- 𝑩𝑥# + - 𝑩𝑥" ℎ! = - 𝑨ℎ" + - 𝑩𝑥! = - 𝑨- 𝑨- 𝑩𝑥# + - 𝑨- 𝑩𝑥" + - 𝑩𝑥! 𝑦# = - 𝑪ℎ# = - 𝑪- 𝑩𝑥# 𝑦" = - 𝑪ℎ" = - 𝑪- 𝑨- 𝑩𝑥# + - 𝑪- 𝑩𝑥" 𝑦! = - 𝑪ℎ! = - 𝑪- 𝑨- 𝑨- 𝑩𝑥# + - 𝑪- 𝑨- 𝑩𝑥" + - 𝑪- 𝑩𝑥! 𝑦% = - 𝑪ℎ% = - 𝑪- 𝑨% - 𝑩𝑥# + - 𝑪- 𝑨%$" - 𝑩𝑥" + ⋯ + - 𝑪- 𝑩𝑥% ℎ$" = 𝟎 とすると 𝑦 = 𝑥 ∗ - 𝐾 - 𝐾 = (- 𝑪- 𝑩, - 𝑪- 𝑨- 𝑩, ⋯ , - 𝑪- 𝑨% - 𝑩, ⋯ ) 1次元の畳み込みとして表現可能 𝑥" 𝑦" カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 𝑦# 0 0 𝑥# 積総和 , 𝑪, 𝑨" , 𝑩 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 𝑥! 𝑦! 20

. ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" ℎ# = - 𝑨ℎ$" + - 𝑩𝑥# = - 𝑩𝑥# ℎ" = - 𝑨ℎ# + - 𝑩𝑥" = - 𝑨- 𝑩𝑥# + - 𝑩𝑥" ℎ! = - 𝑨ℎ" + - 𝑩𝑥! = - 𝑨- 𝑨- 𝑩𝑥# + - 𝑨- 𝑩𝑥" + - 𝑩𝑥! 𝑦# = - 𝑪ℎ# = - 𝑪- 𝑩𝑥# 𝑦" = - 𝑪ℎ" = - 𝑪- 𝑨- 𝑩𝑥# + - 𝑪- 𝑩𝑥" 𝑦! = - 𝑪ℎ! = - 𝑪- 𝑨- 𝑨- 𝑩𝑥# + - 𝑪- 𝑨- 𝑩𝑥" + - 𝑪- 𝑩𝑥! 𝑦% = - 𝑪ℎ% = - 𝑪- 𝑨% - 𝑩𝑥# + - 𝑪- 𝑨%$" - 𝑩𝑥" + ⋯ + - 𝑪- 𝑩𝑥% ℎ$" = 𝟎 とすると 𝑦 = 𝑥 ∗ - 𝐾 - 𝐾 = (- 𝑪- 𝑩, - 𝑪- 𝑨- 𝑩, ⋯ , - 𝑪- 𝑨% - 𝑩, ⋯ ) 1次元の畳み込みとして表現可能 𝑦! カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 𝑦# 0 0 積総和 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 , 𝑪, 𝑨" , 𝑩 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 𝑦" 𝑥! 𝑥" 𝑥# 21

. ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" ℎ# = - 𝑨ℎ$" + - 𝑩𝑥# = - 𝑩𝑥# ℎ" = - 𝑨ℎ# + - 𝑩𝑥" = - 𝑨- 𝑩𝑥# + - 𝑩𝑥" ℎ! = - 𝑨ℎ" + - 𝑩𝑥! = - 𝑨- 𝑨- 𝑩𝑥# + - 𝑨- 𝑩𝑥" + - 𝑩𝑥! 𝑦# = - 𝑪ℎ# = - 𝑪- 𝑩𝑥# 𝑦" = - 𝑪ℎ" = - 𝑪- 𝑨- 𝑩𝑥# + - 𝑪- 𝑩𝑥" 𝑦! = - 𝑪ℎ! = - 𝑪- 𝑨- 𝑨- 𝑩𝑥# + - 𝑪- 𝑨- 𝑩𝑥" + - 𝑪- 𝑩𝑥! 𝑦% = - 𝑪ℎ% = - 𝑪- 𝑨% - 𝑩𝑥# + - 𝑪- 𝑨%$" - 𝑩𝑥" + ⋯ + - 𝑪- 𝑩𝑥% ℎ$" = 𝟎 とすると 𝑦 = 𝑥 ∗ - 𝐾 - 𝐾 = (- 𝑪- 𝑩, - 𝑪- 𝑨- 𝑩, ⋯ , - 𝑪- 𝑨% - 𝑩, ⋯ ) 1次元の畳み込みとして表現可能 𝑦! カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 𝑦# 0 0 積総和 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 , 𝑪, 𝑨" , 𝑩 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 𝑦" 𝑥! 𝑥" 𝑥# • 訓練時︓𝑥 が事前に分かる → - 𝐾 を事前計算可能 • - 𝐾 が計算出来れば畳み込みは並列化可能 • 畳み込みは FFT / iFFT 等により⾼速計算可能 22

SSMs まとめ︓推論時は RNN, 訓練時は Transformer のように動作 𝑦! カーネル - 𝐾
⼊⼒ 𝑥 出⼒ 𝑦 𝑦# 0 0 積総和 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 , 𝑪, 𝑨" , 𝑩 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 𝑦" 𝑥! 𝑥" 𝑥# • 訓練時︓𝑥 が事前に分かる → - 𝐾 を事前計算可能 • - 𝐾 が計算出来れば畳み込みは並列化可能 • 畳み込みは FFT / iFFT 等により⾼速計算可能 𝑥! 𝑥" 𝒉! 𝑦! 𝑦" 𝑦& SSM (≈ Linear RNN) 𝑥& 𝒉" 𝒉& 現在の⼊⼒と1つ前の状態のみを⾒る訓練時: Transformer (CNN) のように並列化可能推論時: RNN のように⾼速推論可能 6 ℎ: = 7 𝑨ℎ:$" + 7 𝑩𝑥: 𝑦: = 7 𝑪ℎ: 𝑦 = 𝑥 ∗ 7 𝐾 7 𝐾 = (7 𝑪7 𝑩, 7 𝑪7 𝑨7 𝑩, ⋯ , 7 𝑪7 𝑨; 7 𝑩, ⋯ ) 23

SSMs (≈ Linear RNN) の問題点 • 🥲 そのまま学習すると上⼿く⾏かない • ⾏列
7 𝑨 が⾮常に重要 • 🥲 畳み込みカーネル 7 𝐾 の計算が⼤変 . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 𝑦! カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 𝑦# 0 0 積総和 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 , 𝑪, 𝑨" , 𝑩 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 𝑦" 𝑥! 𝑥" 𝑥# 𝑦 = 𝑥 ∗ 7 𝐾 7 𝐾 = (7 𝑪7 𝑩, 7 𝑪7 𝑨7 𝑩, ⋯ , 7 𝑪7 𝑨; 7 𝑩, ⋯ ) • SSM では状態 ℎ に過去の情報を全て”記憶” させている • 行列 2 𝑨 は ℎ&() と ℎ& を結ぶ → 情報の”記憶”を司る • ランダム初期化からスタートすると上⼿く⾏かず 2 𝑨𝒌 の累乗計算が⼤変 24

Structured State Space Models (S4) [Gu+ʼ20] • 🥲 そのまま学習すると上⼿く⾏かない •
⾏列 7 𝑨 が⾮常に重要 → 過去の⼊⼒を良く記憶出来る HiPPO ⾏列の利⽤ • 🥲 畳み込みカーネル 7 𝐾 の計算が⼤変 → ⾼速な計算⼿法を導出 . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 𝑦! カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 𝑦# 0 0 積総和 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 , 𝑪, 𝑨" , 𝑩 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 𝑦" 𝑥! 𝑥" 𝑥# 𝑦 = 𝑥 ∗ 7 𝐾 7 𝐾 = (7 𝑪7 𝑩, 7 𝑪7 𝑨7 𝑩, ⋯ , 7 𝑪7 𝑨; 7 𝑩, ⋯ ) 2 𝑨𝒌 の累乗計算が⼤変 7 𝑨<; = − > 2𝑛 + 1 "/& 2𝑘 + 1 "/& 𝑛 + 1 0 (𝑛 > 𝑘) (𝑛 = 𝑘) (𝑛 < 𝑘) 25

Structured State Space Models (S4) [Gu+ʼ20] • 🥲 そのまま学習すると上⼿く⾏かない •
⾏列 7 𝑨 が⾮常に重要 → 過去の⼊⼒を良く記憶出来る HiPPO ⾏列の利⽤ • 🥲 畳み込みカーネル 7 𝐾 の計算が⼤変 → ⾼速な計算⼿法を導出 . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 𝑦! カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 𝑦# 0 0 積総和 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 , 𝑪, 𝑨" , 𝑩 , 𝑪, 𝑨, 𝑩 , 𝑪, 𝑩 𝑦" 𝑥! 𝑥" 𝑥# 𝑦 = 𝑥 ∗ 7 𝐾 7 𝐾 = (7 𝑪7 𝑩, 7 𝑪7 𝑨7 𝑩, ⋯ , 7 𝑪7 𝑨; 7 𝑩, ⋯ ) 2 𝑨𝒌 の累乗計算が⼤変 7 𝑨<; = − > 2𝑛 + 1 "/& 2𝑘 + 1 "/& 𝑛 + 1 0 (𝑛 > 𝑘) (𝑛 = 𝑘) (𝑛 < 𝑘) 26 両者の理論・導出は今回のトークでは省略

S4 の問題点︓/ 𝑨, / 𝑩, / 𝑪 が⼊⼒ 𝑥 に⾮依存→”動的”な推論が不可
. ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 27

𝑥! 0 カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 1 0 0 0 0 0 𝑥# 積総和 0 0 0 - 𝑨, - 𝑩, - 𝑪が固定の通常の SSM でも解ける 28 . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ"

𝑥! 0 0 カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 1 0 0 0 0 0 𝑥# 積総和 0 0 0 29 - 𝑨, - 𝑩, - 𝑪が固定の通常の SSM でも解ける . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ"

𝑥! 0 0 𝑥# カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 1 0 0 0 0 0 𝑥# 積総和 0 0 0 30 - 𝑨, - 𝑩, - 𝑪が固定の通常の SSM でも解ける . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ"

𝑥! 0 0 𝑥# カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 1 0 0 0 0 0 𝑥# 積総和 𝑥! 0 0 0 31 - 𝑨, - 𝑩, - 𝑪が固定の通常の SSM でも解ける . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ"

𝑥! 0 0 𝑥# カーネル - 𝐾 ⼊⼒ 𝑥 出⼒ 𝑦 1 0 0 0 0 0 𝑥# 積総和 𝑥! 0 0 0 • In-Context Learning に近い • ⼊⼒に応じた動的な推論が必要 • - 𝑨, - 𝑩, - 𝑪 が固定 → 🥲 解けない 32 - 𝑨, - 𝑩, - 𝑪が固定の通常の SSM でも解ける . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ"

Mamba: ⼊⼒ 𝑥 に応じて動的に / 𝑨, / 𝑩, / 𝑪
を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 33 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数

を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 34 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数次元ごとに個別の SSM (ℝ → ℝ) を適⽤するイメージ

を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 35 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数次元ごとに個別の SSM (ℝ → ℝ) を適⽤するイメージ 𝑥 に依存せず使いまわし

を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 36 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数次元ごとに個別の SSM (ℝ → ℝ) を適⽤するイメージ 𝑥 に依存せず使いまわしステップ幅 Δ も学習

を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 37 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数次元ごとに個別の SSM (ℝ → ℝ) を適⽤するイメージ 𝑥 に依存せず使いまわしステップ幅 Δ も学習時間依存しないので畳み込みも可能(並列化可)

を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 38 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数時間依存しないので畳み込みも可能(並列化可)

を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 39 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数時間依存しないので畳み込みも可能(並列化可) 𝒙 に線形変換を噛ませる (𝐷 → 𝑁)

を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 40 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数時間依存しないので畳み込みも可能(並列化可) 𝒙 に線形変換を噛ませる (𝐷 → 𝑁) Δ を経由して 2 𝑨 も 𝒙 に依存

を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 41 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数時間依存しないので畳み込みも可能(並列化可) 𝒙 に線形変換を噛ませる (𝐷 → 𝑁) Δ を経由して 2 𝑨 も 𝒙 に依存 • ⼊⼒ 𝒙 と時間に依存に︕ • 🥲 畳み込み不可(並列化不可) • どのように⾼速化すれば?

を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 42 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数時間依存しないので畳み込みも可能(並列化可) 𝒙 に線形変換を噛ませる (𝐷 → 𝑁) Δ を経由して 2 𝑨 も 𝒙 に依存 • ⼊⼒ 𝒙 と時間に依存に︕ • 🥲 畳み込み不可(並列化不可) • どのように⾼速化すれば? Mamba では⾼速化のため以下を⼯夫 • Parallel scan • Kernel fusion • 活性値の再計算

SSM の再帰計算は Scan と⾒做せる • パラメータ 7 𝑨, 7 𝑩,
7 𝑪 が⼊⼒ 𝑥: に依存 → 畳み込み計算不可 • SSM の再帰計算は Scan と⾒做せる • Scan は並列計算可能 5 6 2 3 4 7 1 8 15 21 3 6 10 28 1 36 ⼊⼒出⼒ Scan (累積和の計算) 43

SSM の再帰計算は Scan と⾒做せる • パラメータ 7 𝑨, 7 𝑩,
7 𝑪 が⼊⼒ 𝑥: に依存 → 畳み込み計算不可 • SSM の再帰計算は Scan と⾒做せる • Scan は並列計算可能 5 6 2 3 4 7 1 8 15 21 3 6 10 28 1 36 ⼊⼒出⼒ Scan (累積和の計算) 𝑥$ 𝑥% 𝑥! 𝑥" 𝑥& 𝑥' 𝑥# 𝑥( ℎ$ ℎ% ℎ! ℎ" ℎ& ℎ' ℎ# ℎ( ⼊⼒ 𝑥 状態 ℎ SSM の再帰計算 ≈ Scan ; ℎ& = 2 𝑨ℎ&() + 2 𝑩𝑥& 𝑦& = 2 𝑪ℎ& ⼊⼒と1つ前の出⼒(状態) から現在の出⼒(状態)を計算 44

Scan の並列化︓Parallel Scan David Walker "Parallel Scans & Prefix Sums."
COS 326, Princeton University 1 2 3 4 5 6 7 8 出⼒⼊⼒ r:0,1 s:1 l: r:1,2 s:2 l: r:2,3 s:3 l: r:3,4 s:4 l: r:4,5 s:5 l: r:5,6 s:6 l: r:6,7 s:7 l: r:7,8 s:8 l: r:0,2 s: l: r:2,4 s: l: r:4,6 s: l: r:6,8 s: l: r:0,4 s: l: r:4,8 s: l: r:0,8 s: l: 45 range sum left • ⼊⼒をペアに分割→⼆分⽊上で総和を計算 • 深さが同じペア同⼠の和計算は並列化可能

COS 326, Princeton University 1 2 3 4 5 6 7 8 出⼒⼊⼒ r:0,1 s:1 l: r:1,2 s:2 l: r:2,3 s:3 l: r:3,4 s:4 l: r:4,5 s:5 l: r:5,6 s:6 l: r:6,7 s:7 l: r:7,8 s:8 l: r:0,2 s: 3 l: r:2,4 s: 7 l: r:4,6 s:11 l: r:6,8 s: 15 l: r:0,4 s: l: r:4,8 s: l: r:0,8 s: l: 46 range sum left • ⼊⼒をペアに分割→⼆分⽊上で総和を計算 • 深さが同じペア同⼠の和計算は並列化可能

COS 326, Princeton University 1 2 3 4 5 6 7 8 出⼒⼊⼒ r:0,1 s:1 l: r:1,2 s:2 l: r:2,3 s:3 l: r:3,4 s:4 l: r:4,5 s:5 l: r:5,6 s:6 l: r:6,7 s:7 l: r:7,8 s:8 l: r:0,2 s: 3 l: r:2,4 s: 7 l: r:4,6 s:11 l: r:6,8 s: 15 l: r:0,4 s:10 l: r:4,8 s:26 l: r:0,8 s: l: 47 range sum left • ⼊⼒をペアに分割→⼆分⽊上で総和を計算 • 深さが同じペア同⼠の和計算は並列化可能

COS 326, Princeton University 1 2 3 4 5 6 7 8 出⼒⼊⼒ r:0,1 s:1 l: r:1,2 s:2 l: r:2,3 s:3 l: r:3,4 s:4 l: r:4,5 s:5 l: r:5,6 s:6 l: r:6,7 s:7 l: r:7,8 s:8 l: r:0,2 s: 3 l: r:2,4 s: 7 l: r:4,6 s:11 l: r:6,8 s: 15 l: r:0,4 s:10 l: r:4,8 s:26 l: r:0,8 s:36 l: 48 range sum left • ⼊⼒をペアに分割→⼆分⽊上で総和を計算 • 深さが同じペア同⼠の和計算は並列化可能

COS 326, Princeton University 1 2 3 4 5 6 7 8 出⼒⼊⼒ r:0,1 s:1 l: r:1,2 s:2 l: r:2,3 s:3 l: r:3,4 s:4 l: r:4,5 s:5 l: r:5,6 s:6 l: r:6,7 s:7 l: r:7,8 s:8 l: r:0,2 s: 3 l: r:2,4 s: 7 l: r:4,6 s:11 l: r:6,8 s: 15 l: r:0,4 s:10 l: r:4,8 s:26 l: r:0,8 s:𝟑𝟔 l: 49 range sum left log% 𝐿 • ⼊⼒をペアに分割→⼆分⽊上で総和を計算 • 深さが同じペア同⼠の和計算は並列化可能

COS 326, Princeton University 1 2 3 4 5 6 7 8 出⼒⼊⼒ r:0,1 s:1 l: r:1,2 s:2 l: r:2,3 s:3 l: r:3,4 s:4 l: r:4,5 s:5 l: r:5,6 s:6 l: r:6,7 s:7 l: r:7,8 s:8 l: r:0,2 s: 3 l: r:2,4 s: 7 l: r:4,6 s:11 l: r:6,8 s: 15 l: r:0,4 s:10 l: r:4,8 s:26 l: r:0,8 s:36 l:𝟎 50 range sum left log% 𝐿 left の計算⽅法 • 左-⼦ … 親の left をコピー • 右-⼦ … 親の left + 左-⼦の sum → 同様に並列化可能

COS 326, Princeton University 1 2 3 4 5 6 7 8 出⼒⼊⼒ r:0,1 s:1 l: r:1,2 s:2 l: r:2,3 s:3 l: r:3,4 s:4 l: r:4,5 s:5 l: r:5,6 s:6 l: r:6,7 s:7 l: r:7,8 s:8 l: r:0,2 s: 3 l: r:2,4 s: 7 l: r:4,6 s:11 l: r:6,8 s: 15 l: r:0,4 s:10 l: 0 r:4,8 s:26 l: 10 r:0,8 s:36 l:𝟎 51 range sum left log% 𝐿 left の計算⽅法 • 左-⼦ … 親の left をコピー • 右-⼦ … 親の left + 左-⼦の sum → 同様に並列化可能

COS 326, Princeton University 1 2 3 4 5 6 7 8 出⼒⼊⼒ r:0,1 s:1 l: r:1,2 s:2 l: r:2,3 s:3 l: r:3,4 s:4 l: r:4,5 s:5 l: r:5,6 s:6 l: r:6,7 s:7 l: r:7,8 s:8 l: r:0,2 s: 3 l: 0 r:2,4 s: 7 l: 3 r:4,6 s:11 l: 10 r:6,8 s: 15 l: 21 r:0,4 s:10 l: 0 r:4,8 s:26 l: 10 r:0,8 s:36 l:𝟎 52 range sum left log% 𝐿 left の計算⽅法 • 左-⼦ … 親の left をコピー • 右-⼦ … 親の left + 左-⼦の sum → 同様に並列化可能

COS 326, Princeton University 1 2 3 4 5 6 7 8 出⼒⼊⼒ r:0,1 s:1 l: 0 r:1,2 s:2 l: 1 r:2,3 s:3 l: 3 r:3,4 s:4 l: 6 r:4,5 s:5 l: 10 r:5,6 s:6 l: 15 r:6,7 s:7 l: 21 r:7,8 s:8 l: 28 r:0,2 s: 3 l: 0 r:2,4 s: 7 l: 3 r:4,6 s:11 l: 10 r:6,8 s: 15 l: 21 r:0,4 s:10 l: 0 r:4,8 s:26 l: 10 r:0,8 s:36 l:𝟎 53 range sum left log% 𝐿 left の計算⽅法 • 左-⼦ … 親の left をコピー • 右-⼦ … 親の left + 左-⼦の sum → 同様に並列化可能 log% 𝐿

COS 326, Princeton University 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖 出⼒⼊⼒ r:0,1 s:1 l: 𝟎 r:1,2 s:2 l: 𝟏 r:2,3 s:3 l: 𝟑 r:3,4 s:4 l: 𝟔 r:4,5 s:5 l: 𝟏𝟎 r:5,6 s:6 l: 𝟏𝟓 r:6,7 s:7 l: 𝟐𝟏 r:7,8 s:8 l: 𝟐𝟖 r:0,2 s: 3 l: 0 r:2,4 s: 7 l: 3 r:4,6 s:11 l: 10 r:6,8 s: 15 l: 21 r:0,4 s:10 l: 0 r:4,8 s:26 l: 10 r:0,8 s:36 l:𝟎 54 range sum left log% 𝐿 left の計算⽅法 • 左-⼦ … 親の left をコピー • 右-⼦ … 親の left + 左-⼦の sum → 同様に並列化可能 log% 𝐿

COS 326, Princeton University 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖 ⼊⼒ r:0,1 s:1 l: 𝟎 r:1,2 s:2 l: 𝟏 r:2,3 s:3 l: 𝟑 r:3,4 s:4 l: 𝟔 r:4,5 s:5 l: 𝟏𝟎 r:5,6 s:6 l: 𝟏𝟓 r:6,7 s:7 l: 𝟐𝟏 r:7,8 s:8 l: 𝟐𝟖 r:0,2 s: 3 l: 0 r:2,4 s: 7 l: 3 r:4,6 s:11 l: 10 r:6,8 s: 15 l: 21 r:0,4 s:10 l: 0 r:4,8 s:26 l: 10 r:0,8 s:36 l:𝟎 55 range sum left log% 𝐿 left の計算⽅法 • 左-⼦ … 親の left をコピー • 右-⼦ … 親の left + 左-⼦の sum → 同様に並列化可能 log% 𝐿 出⼒ 𝟏 𝟑 𝟔 𝟏𝟎 𝟏𝟓 𝟐𝟏 𝟐𝟖 𝟑𝟔

Kernel fusion / 活性値の再計算︓ HBM ↔ SRAM の移動を減らす • GPU
におけるメモリ階層 • HBM … ⼤容量・低速 • SRAM … ⼩容量・⾼速 • HBM ↔ SRAM の⾏き来が⼤きなボトルネックとなる • Kernel fusion を利⽤→ HBM ↔ SRAM の移動を減らす • 通常︓Scan の⼊⼒を HBM に貯め → SRAM で計算 → 出⼒を HBM へ • Mamba: Scan の⼊⼒の構築から SRAM を利⽤ • 活性値の再計算 • Backward 計算のために活性値は通常 HBM に保存 • Backward 計算時︓HBM ↔ SRAM の⾏き来が発⽣ • SRAM で活性値を再計算する⽅が⾼速 56 Discretize ℎ! "! $! Mechanism GPU SRAM GPU HBM ∆! endently map each channel (e.g. = 5) of an input to output through a higher Ms avoid materializing this large e ective state ( , times batch size and sequence n paths requiring time-invariance: the ( , A, B, C) parameters are constant across ut-dependent dynamics, which also requires a careful hardware-aware algorithm to cient levels of the GPU memory hierarchy.

Mamba 全体のアーキテクチャ 62 H3 Gated MLP Mamba Linear projection Sequence
transformation Nonlinearity (activation or multiplication) X X X ! X Conv SSM X ! ! Conv SSM ⨂ Figure 3: (Architecture.) Our simpli ed block design combines the H3 block, which is the basis of most SSM architectures, with the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we simply repeat the Mamba block homogenously. Compared to the H3 block, Mamba replaces the rst multiplicative gate with an activation function. Compared to the MLP block, Mamba adds an SSM to the main branch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; Ramachandran, Zoph, and Quoc V Le 2017). the matrix A) are much smaller in comparison. We repeat this block, interleaved with standard normalization and residual connections, to form the Mamba architecture. We always ﬁx to = 2 in our experiments and use two stacks of the block to match the 12 2 parameters of a Transformer’s interleaved MHA (multi-head attention) and

Mamba︓Selective Coping / Induction Heads 63 Model Arch. Layer Acc.
S4 No gate S4 18.3 - No gate S6 97.0 H3 H3 S4 57.0 Hyena H3 Hyena 30.1 - H3 S6 99.7 - Mamba S4 56.4 - Mamba Hyena 28.4 Mamba Mamba S6 99.8 Table 1: (Selective Copying.) Accuracy for combinations of architectures and inner sequence layers. Table 2: (Induction Heads.) Models are trained on sequence length 28 = 256, and tested on increasing sequence lengths of 26 = 64 up to 220 = 1048576. Full numbers in Table 11. 4.1.2 Induction Heads Induction heads (Olsson et al. 2022) is a simple task from the mechanistic interpretability lens (Elhage et al. 2021) that is surprisingly predictive of the in-context learning ability of LLMs. It requires models to perform associative ? Output ng Selective Copying Input Induction Heads els that do not need to look at the actual inputs on of the Copying task involves constant spacing between input and output elements and is els such as linear recurrences and global convolutions. (Right Top) The Selective Copying task uts and requires time-varying models that can selectively remember or ignore inputs depending e Induction Heads task is an example of associative recall that requires retrieving an answer LMs. nts structured ◊ matrix B) : recurrence or convolution Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Input Output ? Output Copying Selective Copying Input Induction Heads Solution Perfectly solved by LTI (e.g. convolutional) models that do not need to look at the actual inputs Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. d MLP Mamba Linear projection Sequence transformation Nonlinearity (activation or multiplication) X ! X ! ! Conv SSM gn combines the H3 block, which is the basis of most SSM architectures, with s. Instead of interleaving these two blocks, we simply repeat the Mamba block eplaces the rst multiplicative gate with an activation function. Compared to anch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; n. We repeat this block, interleaved with standard normalization architecture. We always fix to = 2 in our experiments and use two ters of a Transformer’s interleaved MHA (multi-head attention) and ation function (Hendrycks and Gimpel 2016; Ramachandran, Zoph, Gated MLP becomes the popular “SwiGLU” variant (Chowdhery ). Finally, we additionally use an optional normalization layer (we MLP Mamba Linear projection Sequence transformation Nonlinearity (activation or multiplication) X ! X ! ! Conv SSM n combines the H3 block, which is the basis of most SSM architectures, with Instead of interleaving these two blocks, we simply repeat the Mamba block places the rst multiplicative gate with an activation function. Compared to nch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; n. We repeat this block, interleaved with standard normalization rchitecture. We always fix to = 2 in our experiments and use two ers of a Transformer’s interleaved MHA (multi-head attention) and tion function (Hendrycks and Gimpel 2016; Ramachandran, Zoph, Gated MLP becomes the popular “SwiGLU” variant (Chowdhery Finally, we additionally use an optional normalization layer (we

Mamba︓Selecitve SSM により動的な推論が可能に 64 Model Arch. Layer Acc. S4 No
gate S4 18.3 - No gate S6 97.0 H3 H3 S4 57.0 Hyena H3 Hyena 30.1 - H3 S6 99.7 - Mamba S4 56.4 - Mamba Hyena 28.4 Mamba Mamba S6 99.8 Table 1: (Selective Copying.) Accuracy for combinations of architectures and inner sequence layers. Table 2: (Induction Heads.) Models are trained on sequence length 28 = 256, and tested on increasing sequence lengths of 26 = 64 up to 220 = 1048576. Full numbers in Table 11. 4.1.2 Induction Heads Induction heads (Olsson et al. 2022) is a simple task from the mechanistic interpretability lens (Elhage et al. 2021) that is surprisingly predictive of the in-context learning ability of LLMs. It requires models to perform associative ? Output ng Selective Copying Input Induction Heads els that do not need to look at the actual inputs on of the Copying task involves constant spacing between input and output elements and is els such as linear recurrences and global convolutions. (Right Top) The Selective Copying task uts and requires time-varying models that can selectively remember or ignore inputs depending e Induction Heads task is an example of associative recall that requires retrieving an answer LMs. nts structured ◊ matrix B) : recurrence or convolution Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Input Output ? Output Copying Selective Copying Input Induction Heads Solution Perfectly solved by LTI (e.g. convolutional) models that do not need to look at the actual inputs Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. d MLP Mamba Linear projection Sequence transformation Nonlinearity (activation or multiplication) X ! X ! ! Conv SSM gn combines the H3 block, which is the basis of most SSM architectures, with s. Instead of interleaving these two blocks, we simply repeat the Mamba block eplaces the rst multiplicative gate with an activation function. Compared to anch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; n. We repeat this block, interleaved with standard normalization architecture. We always fix to = 2 in our experiments and use two ters of a Transformer’s interleaved MHA (multi-head attention) and ation function (Hendrycks and Gimpel 2016; Ramachandran, Zoph, Gated MLP becomes the popular “SwiGLU” variant (Chowdhery ). Finally, we additionally use an optional normalization layer (we MLP Mamba Linear projection Sequence transformation Nonlinearity (activation or multiplication) X ! X ! ! Conv SSM n combines the H3 block, which is the basis of most SSM architectures, with Instead of interleaving these two blocks, we simply repeat the Mamba block places the rst multiplicative gate with an activation function. Compared to nch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; n. We repeat this block, interleaved with standard normalization rchitecture. We always fix to = 2 in our experiments and use two ers of a Transformer’s interleaved MHA (multi-head attention) and tion function (Hendrycks and Gimpel 2016; Ramachandran, Zoph, Gated MLP becomes the popular “SwiGLU” variant (Chowdhery Finally, we additionally use an optional normalization layer (we

Mamba︓Transformer と同等程度の Scaling 則が⾒られる 65 Figure 4: (Scaling Laws.) Models
of size 125 to 1.3 parameters, trained on the Pile. Mamba scales better than all other attention-free models and is the rst to match the performance of a very strong “Transformer++” recipe that has now become standard, particularly as the sequence length grows. architectures (e.g. rotary embedding, SwiGLU MLP, RMSNorm instead of LayerNorm, no linear bias, and higher learning rates). We also compare against other recent subquadratic architectures (Figure 4). All model details are in Appendix E.2.

Figure 4: (Scaling Laws.) Models of size 125 to 1.3
parameters, trained on the Pile. Mamba scales better than all other attention-free models and is the rst to match the performance of a very strong “Transformer++” recipe that has now become standard, particularly as the sequence length grows. architectures (e.g. rotary embedding, SwiGLU MLP, RMSNorm instead of LayerNorm, no linear bias, and higher learning rates). We also compare against other recent subquadratic architectures (Figure 4). All model details are in Appendix E.2. Mamba︓Transformer と同等程度の Scaling 則が⾒られる 66 横軸は FLOPs → Mamba のハードウェア最適化により他のモデルが不利に扱われていることはない (cf. 横軸が計算時間)

Mamba︓1/2のモデルサイズで Transformer を凌駕 67 Table 3: (Zero-shot Evaluations.) Best results
for each size in bold. We compare against open source LMs with various tokenizers, trained for up to 300B tokens. Pile refers to the validation split, comparing only against models trained on the same dataset and tokenizer (GPT-NeoX-20B). For each model size, Mamba is best-in-class on every single evaluation result, and generally matches baselines at twice the model size. Model Token. Pile LAMBADA LAMBADA HellaSwag PIQA Arc-E Arc-C WinoGrande Average ppl ppl acc acc acc acc acc acc acc Hybrid H3-130M GPT2 — 89.48 25.77 31.7 64.2 44.4 24.2 50.6 40.1 Pythia-160M NeoX 29.64 38.10 33.0 30.2 61.4 43.2 24.1 51.9 40.6 Mamba-130M NeoX 10.56 16.07 44.3 35.3 64.5 48.0 24.3 51.9 44.7 Hybrid H3-360M GPT2 — 12.58 48.0 41.5 68.1 51.4 24.7 54.1 48.0 Pythia-410M NeoX 9.95 10.84 51.4 40.6 66.9 52.1 24.6 53.8 48.2 Mamba-370M NeoX 8.28 8.14 55.6 46.5 69.5 55.1 28.0 55.3 50.0 Pythia-1B NeoX 7.82 7.92 56.1 47.2 70.7 57.0 27.1 53.5 51.9 Mamba-790M NeoX 7.33 6.02 62.7 55.1 72.1 61.2 29.5 56.1 57.1 GPT-Neo 1.3B GPT2 — 7.50 57.2 48.9 71.1 56.2 25.9 54.9 52.4 Hybrid H3-1.3B GPT2 — 11.25 49.6 52.6 71.3 59.2 28.1 56.9 53.0 OPT-1.3B OPT — 6.64 58.0 53.7 72.4 56.7 29.6 59.5 55.0 Pythia-1.4B NeoX 7.51 6.08 61.7 52.1 71.0 60.5 28.5 57.2 55.2 RWKV-1.5B NeoX 7.70 7.04 56.4 52.5 72.4 60.5 29.4 54.6 54.3 Mamba-1.4B NeoX 6.80 5.04 64.9 59.1 74.2 65.5 32.8 61.5 59.7 GPT-Neo 2.7B GPT2 — 5.63 62.2 55.8 72.1 61.1 30.2 57.6 56.5 Hybrid H3-2.7B GPT2 — 7.92 55.7 59.7 73.3 65.6 32.3 61.4 58.0 OPT-2.7B OPT — 5.12 63.6 60.6 74.8 60.8 31.3 61.0 58.7 Pythia-2.8B NeoX 6.73 5.04 64.7 59.3 74.0 64.1 32.9 59.7 59.1 RWKV-3B NeoX 7.00 5.24 63.9 59.6 73.7 67.8 33.1 59.6 59.6 Mamba-2.8B NeoX 6.22 4.23 69.2 66.1 75.2 69.7 36.3 63.5 63.3 GPT-J-6B GPT2 – 4.10 68.3 66.3 75.4 67.0 36.6 64.1 63.0 OPT-6.7B OPT – 4.25 67.7 67.2 76.3 65.6 34.9 65.5 62.9 Pythia-6.9B NeoX 6.51 4.45 67.1 64.0 75.2 67.3 35.5 61.3 61.7 RWKV-7.4B NeoX 6.31 4.38 67.2 65.5 76.1 67.8 37.5 61.0 62.5 total of 220 1 tokens per batch. Models were trained for 10 gradient steps for a total of 10 tokens.

Mamba︓まとめ • S4 の課題であった⼊⼒ 𝑥 に応じた推論を Selecitve SSM により実現 •
畳み込みによる並列化の代わりに，GPU を考慮した⾼速化を実現 • Parallel scan • Kernel fusion • 活性値の再計算 • Transformer に迫る性能を記録 • まだまだ分からないことが多い • 2.8B 以上にスケールする? • 学習の不安定性は? • Mamba における Chinchilla Scaling 則は? • ハイパラの決め⽅は Transformer とどう違う? 68 Project Discretize !! ℎ!"# ℎ! "! # $! %! Selection Mechanism GPU SRAM GPU HBM ∆! Selective State Space Model with Hardware-aware State Expansion Figure 1: (Overview.) Structured SSMs independently map each channel (e.g. = 5) of an input to output through a higher dimensional latent state (e.g. = 4). Prior SSMs avoid materializing this large e ective state ( , times batch size and sequence length ) through clever alternate computation paths requiring time-invariance: the ( , A, B, C) parameters are constant across time. Our selection mechanism adds back input-dependent dynamics, which also requires a careful hardware-aware algorithm to only materialize the expanded states in more e cient levels of the GPU memory hierarchy. 2 State Space Models Structured state space sequence models (S4) are a recent class of sequence models for deep learning that are broadly related to RNNs, and CNNs, and classical state space models. They are inspired by a particular continuous system (1) that maps a 1-dimensional function or sequence ( ) ( ) through an implicit latent state ( ) . Concretely, S4 models are defined with four parameters ( , A, B, C), which define a sequence-to-sequence transformation in two stages. ( ) = A ( ) + B ( ) (1a) = A 1 + B (2a) = (C , C , … , C , … ) (3a) H3 Gated MLP Mamba X X X ! Conv SSM X ! ! Conv SSM ⨂ Figure 3: (Architecture.) Our simpli ed block design combines the H3 block, which is the basis of m the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we sim homogenously. Compared to the H3 block, Mamba replaces the rst multiplicative gate with an activa the MLP block, Mamba adds an SSM to the main branch. For we use the SiLU / Swish activation (H Ramachandran, Zoph, and Quoc V Le 2017). the matrix A) are much smaller in comparison. We repeat this block, interleaved with and residual connections, to form the Mamba architecture. We always fix to = 2 in our stacks of the block to match the 12 2 parameters of a Transformer’s interleaved MHA (m MLP blocks. We use the SiLU / Swish activation function (Hendrycks and Gimpel 201 and Quoc V Le 2017), motivated so that the Gated MLP becomes the popular “SwiGL et al. 2023; Shazeer 2020; Touvron et al. 2023). Finally, we additionally use an optiona choose LayerNorm (J. L. Ba, Kiros, and Hinton 2016)), motivated by RetNet’s usage of a H3 Gated MLP Mamba X X X ! Conv SSM X ! ! Conv SSM ⨂ Figure 3: (Architecture.) Our simpli ed block design combines the H3 block, which is the basis of m the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we sim homogenously. Compared to the H3 block, Mamba replaces the rst multiplicative gate with an activa the MLP block, Mamba adds an SSM to the main branch. For we use the SiLU / Swish activation (H Ramachandran, Zoph, and Quoc V Le 2017). the matrix A) are much smaller in comparison. We repeat this block, interleaved with and residual connections, to form the Mamba architecture. We always fix to = 2 in our e stacks of the block to match the 12 2 parameters of a Transformer’s interleaved MHA (m MLP blocks. We use the SiLU / Swish activation function (Hendrycks and Gimpel 2016 and Quoc V Le 2017), motivated so that the Gated MLP becomes the popular “SwiGL et al. 2023; Shazeer 2020; Touvron et al. 2023). Finally, we additionally use an optional choose LayerNorm (J. L. Ba, Kiros, and Hinton 2016)), motivated by RetNet’s usage of a

Mamba @ Kotoba Tech︓⼤規模分散学習⽤ライブラリの開発 • kotomamba • https://github.com/kotoba-tech/kotomamba • Base
on kotoba-recipes (based on llama-recipes (Meta)) & Mamba implementation by Tri Dao & Albert Gu • FSDP による⼤規模分散学習 • 事前学習と継続学習(fine-tuning)の両⽅に対応 • 🤗 Transformers 形式で配布されている任意の Mamba モデルを学習可能 • 近⽇テックブログを公開予定です︕ 69 Fujii-san and I is leading the project

Mamba @ Kotoba Tech: ABCI グランドチャレンジ • 東北⼤坂⼝研・東⼯⼤横⽥研・岡崎研・Kotoba Tech のチームで
ABCI グランドチャレンジに参加予定 • V-week: 128 V-nodes (NVIDIA V100 GPU x 512) を1週間専有 • 2.8B の⽇本語の Mamba ⾔語モデルを訓練予定 • kotomamba による⼤規模並列学習 70

Mamba @ Kotoba Tech: 予備実験 for ABCI グラチャレ 71 •
ハイパラに関しては Mamba 論⽂準拠 • 130M︓学習が上⼿くいくことを確認 • 150B tokens for en-Pile • 220B tokens for ⽇英混合コーパス • lr を2倍にするなど試したが，スパイクせず • 1.4B • 80B tokens ほど学習したがスパイクなしで安定 • 2.8B • 今のところスパイクせず

Mamba @ Kotoba Tech: 予備実験 for ABCI グラチャレ • 3.0B
→ 学習がやや不安定に • Mamba 論⽂にはない設定（論⽂では 2.8B が上限) • レイヤー数を 64 (2.8B) → 70 (3.0B) に増やした • 仮説︓”縦と横” のバランスが重要? (次元数なども同様に増やす必要あり?) 72

Mamba @ Kotoba Tech: 予備実験 for ABCI グラチャレ 73 🥲
ABCI 上で FLOPs 数が不安定 • ABCI の問題? ライブラリ起因? Mambaの実装起因? → 恐らく独⾃実装の dataloader 起因 • GPU の温度は問題なし • ⼩島・笠井・栗⽥が参加したグラチャレでも同様のトラブルあり → reboot により解決した

さらに知りたい⽅へ / 参考資料 SSM / S4 • The Annotated S4
• Do we need Attention? - Linear RNNs and State Space Models (SSMs) for NLP • MedAI #41: Efficiently Modeling Long Sequences with Structured State Spaces | Albert Gu • HiPPO/S4解説 - Morpho Tech Blog - モルフォ • [解説資料] Hyena Hierarchy: Towards Larger Convolutional Language Models Mamba • Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained) • Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math Parallel scan • GPU Gems 3 Chapter 39. Parallel Prefix Sum (Scan) with CUDA • Parallel Scans & Prefix Sums Implementation • https://github.com/state-spaces/mamba • https://github.com/alxndrTL/mamba.py • https://github.com/johnma2006/mamba-minimal 74

Deep State Space Models 101 / Mamba

Deep State Space Models 101 / Mamba

Other Decks in Research

Featured

Transcript