[Paper Introduction] Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-Time Sequence Modeling with Selective State Spaces Symbol Emergence
Systems Lab. Rei Ando 1

Paper Information 2 Title: “Mamba: Linear-Time Sequence Modeling with Selective
State Spaces” Author: Albert Gu, Tri Dao Pub. data: Dec/1/2023 link: https://arxiv.org/abs/2312.00752

Contents 3 1.Background 2.Problem 3.Research Goal 4.State Space Model 5.Mamba
1. Overview 2. Selection Mechanism 3. Detail Setting 4. Hardware Aware Algorithm 5. Block 6. Experiment 1. Selective Copying task 2. Induction Head task 3. Language Modeling 4. Speed & Memory Perf 5. Ablation 7. Conclusion

Background Sequential Modeling is the Example video data, text data,
audio data and so on. 4 The main stream of this field.. Transformer [Vaswani+ 2017] • One of the Deep Learning models • the key component “Attention” is critical to capture features in sequential data Amazingly high performance for modeling → It is used as Foundation Model in various field

Problem 5 Transformer is often chosen in Foundation Model But,
it also has drawbacks Transformerʼs limitation • It canʼt consider the data outside of the window • The computing cost increases quadratic with window size 𝒪 𝑛! Very large computation cost is required to realize high performance

Research Goal 6 Main Goal: Realize a new SSM method
such that.. • low computation cost • high performance To overcome the drawback of Transformer, The authors pay attention to State Space Model(SSM) Performance Efficient SSM ◯ ◯ Transformer ◎ △ Comparing them …

State Space Model (SSM) 7 ℎ′ 𝑡 = 𝐴ℎ 𝑡
+ 𝐵𝑥 𝑡 𝑦 𝑡 = 𝐶ℎ 𝑡 ℎ! ℎ!"# 𝑥! 𝑥!"# 𝑦!"# 𝑦! Framework for analyzing and modeling sequential dynamics ℎ! = ̅ 𝐴ℎ!"# + , 𝐵𝑥! 𝑦! = 𝐶ℎ! ̅ 𝐴 = 𝑓$ ∆, 𝐴 , 𝐵 = 𝑓% Δ, 𝐴, 𝐵 Basic Formulation Time-scale Discretized Formulation 𝑥 :input 𝑦 :output ℎ :intermediate state

Mamba - Overview 8 To achieve higher performance, more efficient
than Transformer, New style of SSM is proposed → Mamba Key used methods • Selection Mechanism • Hardware aware Algorithm

Mamba - Selection Mechanism 9 A fundamental problem of sequence
modeling is compressing context into a smaller state → Distinguish key-elements of the data or noise to compress = Selection Mechanism 𝐵 = 𝑠% 𝑥 , 𝐶 = 𝑠& 𝑥 , ∆= 𝑠∆ 𝑥 Static Parameters 𝐵, 𝐶, ∆ Dynamic Parameters (≈ Attention)

Mamba - Detail Setting 10 ̅ 𝐴 = 𝑒𝑥𝑝 ∆𝐴
, 𝐵 = ∆𝐴 "# 𝑒𝑥𝑝 ∆𝐴 − 𝐼 ∆𝐵 Discretization Method: Zero-Order Hold (ZOH) ∆→ 0 ⇒ ̅ 𝐴 → 𝐼, , 𝐵 → 0 ⟹ ℎ! = ℎ!"# ∆→ ∞ ⇒ ̅ 𝐴 → 0, , 𝐵 → −𝐴"#𝐵 ⟹ ℎ! = −𝐴"#𝐵𝑥! Ignore the input Forget the past states Interpretation of ∆ 𝑎() = = − 𝑖 + 1 0 𝑖 = 𝑗 𝑖 ≠ 𝑗 Definition Method of 𝐴: S4D-Real ※ 𝐵, ∆ are dynamic ・・・Gated Architecture realized by ∆

Mamba - Hardware Aware Algorithm 11 ① load 𝑥, ∆,
𝐴, 𝐵, 𝐶 to SRAM ② Compute ̅ 𝐴, , 𝐵 ③ Compute ℎ, 𝑦 with Scan Algorithm ④ write 𝑦 back to HBM 𝑊∆ , 𝑊" , 𝑊# HBM SRAM GPU have two types of the memory • HBM • SRAM Using them efficiency is important ∆, 𝐴, 𝐵, 𝐶 ̅ 𝐴, ) 𝐵 𝑥$:& 𝑦$:& 𝑦$:& 𝑥$:& 𝑊 ℎ$:& 𝒪 𝐵𝐿𝐷 + 𝐷𝑁 𝒪 𝐵𝐿𝐷 Data Loading with 𝒪 𝐵𝐿𝐷 instead of 𝒪 𝐵𝐿𝐷𝑁 → speed up & save memory note: ℎ is not written back to HBM and recomputed when backward : Large / Slow : Small / Fast Adopted Method ① ② ④ ③

Mamba - Block 12 Input Build “Mamba Block” with Selective
SSM 1d Conv Selective SSM 𝜎 Out Proj Output Out Proj: 1 layer Linear 1d Conv: 1 layer Convolution oriented to sequence length dimension 𝜎 : Activation function (SiLU / Swish)

Experience 13 Verify Mambaʻs ability to sequential modeling Tasks •
Selective Copying • Induction Head • Language Modeling • DNA Modeling • Audio Modeling and Generation : Pass : Pass Also, verify ü speed and memory performance ü key methodsʼ effectiveness with ablation studies

Experience - Selective Copying task 14 Result Selective Copying task
The valid tokens are deployed randomly in the sequence of invalid tokens ・・・invalid token ・・・valid token ? sequence length ? kinds of valid tokens → verify the modelʼs ability to remember and ignore tokens Including Selection Mechanism: S6

Experience - Induction Head task 15 Induction Head task The
sequence having repeated pattern → Predict the next token Settings • Vocab size : 16 • Sequence length in Training : 2* in Testing : 2+, … , 2,- ≈ In Context Learning in LLM Result High Accuracy even if long length ★

Experience - Language Modeling 16 ※ Showing only case of
a parameter size dataset: Pile Verify the performance of Mamba on popular downstream zero-shot evaluation tasks Scaling law is also affirmed

Experience - Speed & Memory 17 Training Inference • High
Speed both training and inference : 𝒪 𝑛 • Save Memory consumption OOM: Out of Memory

Experience - Ablation 18 Ø Selection Mechanism Ø Dynamic Parameters
parameters are either static “ ” or dynamic “✔” → dynamic ∆ is the most critical → Selection Mechanism (S6) is eﬀective in both model

Conclusion 19 A new sequential modeling method Mamba is proposed
→ High performance for modeling, speed and memory Key elements • Selection Mechanism • Hardware aware Algorithm

Appendix - Scan Algorithm 20 in case of 𝑡 =
5: 𝑀# , 𝑀, , 𝑀. , 𝑀/ 𝑀# , 𝑀# 𝑀, , 𝑀, 𝑀. , 𝑀. 𝑀/ 𝑀#, 𝑀# 𝑀,, 𝑀#𝑀,𝑀., 𝑀#𝑀,𝑀. 𝑀/ 1. multiple the matrix on 1 step left side 2. multiple the matrix on 2 step left side ℎ!0# 1 = ̅ 𝐴 , 𝐵𝑥! 0 1 ℎ! 1 = 𝑀! ℎ! 1 state equation ℎ! 1 = L 12# !"# 𝑀1 ℎ# 1 - '($ )*$ 𝑀' computing is required Parallel computing method Note: Not indicated in the original paper

[Paper Introduction] Mamba: Linear-Time Sequenc...

[Paper Introduction] Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Rei Ando

More Decks by Rei Ando

Featured

Transcript

Mamba: Linear-Time Sequence Modeling with Selective State Spaces Symbol Emergence

Paper Information 2 Title: “Mamba: Linear-Time Sequence Modeling with Selective

Contents 3 1.Background 2.Problem 3.Research Goal 4.State Space Model 5.Mamba

Background Sequential Modeling is the Example video data, text data,

Problem 5 Transformer is often chosen in Foundation Model But,

Research Goal 6 Main Goal: Realize a new SSM method

State Space Model (SSM) 7 ℎ′ 𝑡 = 𝐴ℎ 𝑡

Mamba - Overview 8 To achieve higher performance, more efficient

Mamba - Selection Mechanism 9 A fundamental problem of sequence

Mamba - Detail Setting 10 ̅ 𝐴 = 𝑒𝑥𝑝 ∆𝐴

Mamba - Hardware Aware Algorithm 11 ① load 𝑥, ∆,

Mamba - Block 12 Input Build “Mamba Block” with Selective

Experience 13 Verify Mambaʻs ability to sequential modeling Tasks •

Experience - Selective Copying task 14 Result Selective Copying task

Experience - Induction Head task 15 Induction Head task The

Experience - Language Modeling 16 ※ Showing only case of

Experience - Speed & Memory 17 Training Inference • High

Experience - Ablation 18 Ø Selection Mechanism Ø Dynamic Parameters

Conclusion 19 A new sequential modeling method Mamba is proposed

Appendix - Scan Algorithm 20 in case of 𝑡 =