Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Paper Introduction] Mamba: Linear-Time Sequenc...

Avatar for Rei Ando Rei Ando
June 12, 2025
110

[Paper Introduction] Mamba: Linear-Time Sequence Modeling with Selective State Spaces

2025/6/12
Paper Introduction @Tanichu-lab.
https://sites.google.com/view/tanichu-lab-ku/home-jp

Avatar for Rei Ando

Rei Ando

June 12, 2025
Tweet

Transcript

  1. Paper Information 2 Title: “Mamba: Linear-Time Sequence Modeling with Selective

    State Spaces” Author: Albert Gu, Tri Dao Pub. data: Dec/1/2023 link: https://arxiv.org/abs/2312.00752
  2. Contents 3 1.Background 2.Problem 3.Research Goal 4.State Space Model 5.Mamba

    1. Overview 2. Selection Mechanism 3. Detail Setting 4. Hardware Aware Algorithm 5. Block 6. Experiment 1. Selective Copying task 2. Induction Head task 3. Language Modeling 4. Speed & Memory Perf 5. Ablation 7. Conclusion
  3. Background Sequential Modeling is the Example video data, text data,

    audio data and so on. 4 The main stream of this field.. Transformer [Vaswani+ 2017] • One of the Deep Learning models • the key component “Attention” is critical to capture features in sequential data Amazingly high performance for modeling → It is used as Foundation Model in various field
  4. Problem 5 Transformer is often chosen in Foundation Model But,

    it also has drawbacks Transformerʼs limitation • It canʼt consider the data outside of the window • The computing cost increases quadratic with window size 𝒪 𝑛! Very large computation cost is required to realize high performance
  5. Research Goal 6 Main Goal: Realize a new SSM method

    such that.. • low computation cost • high performance To overcome the drawback of Transformer, The authors pay attention to State Space Model(SSM) Performance Efficient SSM ◯ ◯ Transformer ◎ △ Comparing them …
  6. State Space Model (SSM) 7 ℎ′ 𝑡 = 𝐴ℎ 𝑡

    + 𝐵𝑥 𝑡 𝑦 𝑡 = 𝐶ℎ 𝑡 ℎ! ℎ!"# 𝑥! 𝑥!"# 𝑦!"# 𝑦! Framework for analyzing and modeling sequential dynamics ℎ! = ̅ 𝐴ℎ!"# + , 𝐵𝑥! 𝑦! = 𝐶ℎ! ̅ 𝐴 = 𝑓$ ∆, 𝐴 , 𝐵 = 𝑓% Δ, 𝐴, 𝐵 Basic Formulation Time-scale Discretized Formulation 𝑥 :input 𝑦 :output ℎ :intermediate state
  7. Mamba - Overview 8 To achieve higher performance, more efficient

    than Transformer, New style of SSM is proposed → Mamba Key used methods • Selection Mechanism • Hardware aware Algorithm
  8. Mamba - Selection Mechanism 9 A fundamental problem of sequence

    modeling is compressing context into a smaller state → Distinguish key-elements of the data or noise to compress = Selection Mechanism 𝐵 = 𝑠% 𝑥 , 𝐶 = 𝑠& 𝑥 , ∆= 𝑠∆ 𝑥 Static Parameters 𝐵, 𝐶, ∆ Dynamic Parameters (≈ Attention)
  9. Mamba - Detail Setting 10 ̅ 𝐴 = 𝑒𝑥𝑝 ∆𝐴

    , 𝐵 = ∆𝐴 "# 𝑒𝑥𝑝 ∆𝐴 − 𝐼 ∆𝐵 Discretization Method: Zero-Order Hold (ZOH) ∆→ 0 ⇒ ̅ 𝐴 → 𝐼, , 𝐵 → 0 ⟹ ℎ! = ℎ!"# ∆→ ∞ ⇒ ̅ 𝐴 → 0, , 𝐵 → −𝐴"#𝐵 ⟹ ℎ! = −𝐴"#𝐵𝑥! Ignore the input Forget the past states Interpretation of ∆ 𝑎() = = − 𝑖 + 1 0 𝑖 = 𝑗 𝑖 ≠ 𝑗 Definition Method of 𝐴: S4D-Real ※ 𝐵, ∆ are dynamic ・・・Gated Architecture realized by ∆
  10. Mamba - Hardware Aware Algorithm 11 ① load 𝑥, ∆,

    𝐴, 𝐵, 𝐶 to SRAM ② Compute ̅ 𝐴, , 𝐵 ③ Compute ℎ, 𝑦 with Scan Algorithm ④ write 𝑦 back to HBM 𝑊∆ , 𝑊" , 𝑊# HBM SRAM GPU have two types of the memory • HBM • SRAM Using them efficiency is important ∆, 𝐴, 𝐵, 𝐶 ̅ 𝐴, ) 𝐵 𝑥$:& 𝑦$:& 𝑦$:& 𝑥$:& 𝑊 ℎ$:& 𝒪 𝐵𝐿𝐷 + 𝐷𝑁 𝒪 𝐵𝐿𝐷 Data Loading with 𝒪 𝐵𝐿𝐷 instead of 𝒪 𝐵𝐿𝐷𝑁 → speed up & save memory note: ℎ is not written back to HBM and recomputed when backward : Large / Slow : Small / Fast Adopted Method ① ② ④ ③
  11. Mamba - Block 12 Input Build “Mamba Block” with Selective

    SSM 1d Conv Selective SSM 𝜎 Out Proj Output Out Proj: 1 layer Linear 1d Conv: 1 layer Convolution oriented to sequence length dimension 𝜎 : Activation function (SiLU / Swish)
  12. Experience 13 Verify Mambaʻs ability to sequential modeling Tasks •

    Selective Copying • Induction Head • Language Modeling • DNA Modeling • Audio Modeling and Generation : Pass : Pass Also, verify ü speed and memory performance ü key methodsʼ effectiveness with ablation studies
  13. Experience - Selective Copying task 14 Result Selective Copying task

    The valid tokens are deployed randomly in the sequence of invalid tokens ・・・invalid token ・・・valid token ? sequence length ? kinds of valid tokens → verify the modelʼs ability to remember and ignore tokens Including Selection Mechanism: S6
  14. Experience - Induction Head task 15 Induction Head task The

    sequence having repeated pattern → Predict the next token Settings • Vocab size : 16 • Sequence length in Training : 2* in Testing : 2+, … , 2,- ≈ In Context Learning in LLM Result High Accuracy even if long length ★
  15. Experience - Language Modeling 16 ※ Showing only case of

    a parameter size dataset: Pile Verify the performance of Mamba on popular downstream zero-shot evaluation tasks Scaling law is also affirmed
  16. Experience - Speed & Memory 17 Training Inference • High

    Speed both training and inference : 𝒪 𝑛 • Save Memory consumption OOM: Out of Memory
  17. Experience - Ablation 18 Ø Selection Mechanism Ø Dynamic Parameters

    parameters are either static “ ” or dynamic “✔” → dynamic ∆ is the most critical → Selection Mechanism (S6) is effective in both model
  18. Conclusion 19 A new sequential modeling method Mamba is proposed

    → High performance for modeling, speed and memory Key elements • Selection Mechanism • Hardware aware Algorithm
  19. Appendix - Scan Algorithm 20 in case of 𝑡 =

    5: 𝑀# , 𝑀, , 𝑀. , 𝑀/ 𝑀# , 𝑀# 𝑀, , 𝑀, 𝑀. , 𝑀. 𝑀/ 𝑀#, 𝑀# 𝑀,, 𝑀#𝑀,𝑀., 𝑀#𝑀,𝑀. 𝑀/ 1. multiple the matrix on 1 step left side 2. multiple the matrix on 2 step left side ℎ!0# 1 = ̅ 𝐴 , 𝐵𝑥! 0 1 ℎ! 1 = 𝑀! ℎ! 1 state equation ℎ! 1 = L 12# !"# 𝑀1 ℎ# 1 - '($ )*$ 𝑀' computing is required Parallel computing method Note: Not indicated in the original paper