[Paper Introduction] Mamba: Linear-Time Sequence Modeling with Selective State Spaces

by Rei Ando

Slide 1

Slide 1 text

Mamba: Linear-Time Sequence Modeling with Selective State Spaces Symbol Emergence Systems Lab. Rei Ando 1

Slide 2

Slide 2 text

Paper Information 2 Title: “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” Author: Albert Gu, Tri Dao Pub. data: Dec/1/2023 link: https://arxiv.org/abs/2312.00752

Slide 3

Slide 3 text

Contents 3 1.Background 2.Problem 3.Research Goal 4.State Space Model 5.Mamba 1. Overview 2. Selection Mechanism 3. Detail Setting 4. Hardware Aware Algorithm 5. Block 6. Experiment 1. Selective Copying task 2. Induction Head task 3. Language Modeling 4. Speed & Memory Perf 5. Ablation 7. Conclusion

Slide 4

Slide 4 text

Background Sequential Modeling is the Example video data, text data, audio data and so on. 4 The main stream of this field.. Transformer [Vaswani+ 2017] • One of the Deep Learning models • the key component “Attention” is critical to capture features in sequential data Amazingly high performance for modeling → It is used as Foundation Model in various field

Slide 5

Slide 5 text

Problem 5 Transformer is often chosen in Foundation Model But, it also has drawbacks Transformerʼs limitation • It canʼt consider the data outside of the window • The computing cost increases quadratic with window size 𝒪 𝑛! Very large computation cost is required to realize high performance

Slide 6

Slide 6 text

Research Goal 6 Main Goal: Realize a new SSM method such that.. • low computation cost • high performance To overcome the drawback of Transformer, The authors pay attention to State Space Model(SSM) Performance Efficient SSM ◯ ◯ Transformer ◎ △ Comparing them …

Slide 7

Slide 7 text

State Space Model (SSM) 7 ℎ′ 𝑡 = 𝐴ℎ 𝑡 + 𝐵𝑥 𝑡 𝑦 𝑡 = 𝐶ℎ 𝑡 ℎ! ℎ!"# 𝑥! 𝑥!"# 𝑦!"# 𝑦! Framework for analyzing and modeling sequential dynamics ℎ! = ̅ 𝐴ℎ!"# + , 𝐵𝑥! 𝑦! = 𝐶ℎ! ̅ 𝐴 = 𝑓$ ∆, 𝐴 , 𝐵 = 𝑓% Δ, 𝐴, 𝐵 Basic Formulation Time-scale Discretized Formulation 𝑥 :input 𝑦 :output ℎ :intermediate state

Slide 8

Slide 8 text

Mamba - Overview 8 To achieve higher performance, more efficient than Transformer, New style of SSM is proposed → Mamba Key used methods • Selection Mechanism • Hardware aware Algorithm

Slide 9

Slide 9 text

Mamba - Selection Mechanism 9 A fundamental problem of sequence modeling is compressing context into a smaller state → Distinguish key-elements of the data or noise to compress = Selection Mechanism 𝐵 = 𝑠% 𝑥 , 𝐶 = 𝑠& 𝑥 , ∆= 𝑠∆ 𝑥 Static Parameters 𝐵, 𝐶, ∆ Dynamic Parameters (≈ Attention)

Slide 10

Slide 10 text

Mamba - Detail Setting 10 ̅ 𝐴 = 𝑒𝑥𝑝 ∆𝐴 , 𝐵 = ∆𝐴 "# 𝑒𝑥𝑝 ∆𝐴 − 𝐼 ∆𝐵 Discretization Method: Zero-Order Hold (ZOH) ∆→ 0 ⇒ ̅ 𝐴 → 𝐼, , 𝐵 → 0 ⟹ ℎ! = ℎ!"# ∆→ ∞ ⇒ ̅ 𝐴 → 0, , 𝐵 → −𝐴"#𝐵 ⟹ ℎ! = −𝐴"#𝐵𝑥! Ignore the input Forget the past states Interpretation of ∆ 𝑎() = = − 𝑖 + 1 0 𝑖 = 𝑗 𝑖 ≠ 𝑗 Definition Method of 𝐴: S4D-Real ※ 𝐵, ∆ are dynamic ・・・Gated Architecture realized by ∆

Slide 11

Slide 11 text

Mamba - Hardware Aware Algorithm 11 ① load 𝑥, ∆, 𝐴, 𝐵, 𝐶 to SRAM ② Compute ̅ 𝐴, , 𝐵 ③ Compute ℎ, 𝑦 with Scan Algorithm ④ write 𝑦 back to HBM 𝑊∆ , 𝑊" , 𝑊# HBM SRAM GPU have two types of the memory • HBM • SRAM Using them efficiency is important ∆, 𝐴, 𝐵, 𝐶 ̅ 𝐴, ) 𝐵 𝑥$:& 𝑦$:& 𝑦$:& 𝑥$:& 𝑊 ℎ$:& 𝒪 𝐵𝐿𝐷 + 𝐷𝑁 𝒪 𝐵𝐿𝐷 Data Loading with 𝒪 𝐵𝐿𝐷 instead of 𝒪 𝐵𝐿𝐷𝑁 → speed up & save memory note: ℎ is not written back to HBM and recomputed when backward : Large / Slow : Small / Fast Adopted Method ① ② ④ ③

Slide 12

Slide 12 text

Mamba - Block 12 Input Build “Mamba Block” with Selective SSM 1d Conv Selective SSM 𝜎 Out Proj Output Out Proj: 1 layer Linear 1d Conv: 1 layer Convolution oriented to sequence length dimension 𝜎 : Activation function (SiLU / Swish)

Slide 13

Slide 13 text

Experience 13 Verify Mambaʻs ability to sequential modeling Tasks • Selective Copying • Induction Head • Language Modeling • DNA Modeling • Audio Modeling and Generation : Pass : Pass Also, verify ü speed and memory performance ü key methodsʼ effectiveness with ablation studies

Slide 14

Slide 14 text

Experience - Selective Copying task 14 Result Selective Copying task The valid tokens are deployed randomly in the sequence of invalid tokens ・・・invalid token ・・・valid token ? sequence length ? kinds of valid tokens → verify the modelʼs ability to remember and ignore tokens Including Selection Mechanism: S6

Slide 15

Slide 15 text

Experience - Induction Head task 15 Induction Head task The sequence having repeated pattern → Predict the next token Settings • Vocab size : 16 • Sequence length in Training : 2* in Testing : 2+, … , 2,- ≈ In Context Learning in LLM Result High Accuracy even if long length ★

Slide 16

Slide 16 text

Experience - Language Modeling 16 ※ Showing only case of a parameter size dataset: Pile Verify the performance of Mamba on popular downstream zero-shot evaluation tasks Scaling law is also affirmed

Slide 17

Slide 17 text

Experience - Speed & Memory 17 Training Inference • High Speed both training and inference : 𝒪 𝑛 • Save Memory consumption OOM: Out of Memory

Slide 18

Slide 18 text

Experience - Ablation 18 Ø Selection Mechanism Ø Dynamic Parameters parameters are either static “ ” or dynamic “✔” → dynamic ∆ is the most critical → Selection Mechanism (S6) is eﬀective in both model

Slide 19

Slide 19 text

Conclusion 19 A new sequential modeling method Mamba is proposed → High performance for modeling, speed and memory Key elements • Selection Mechanism • Hardware aware Algorithm

Slide 20

Slide 20 text

Appendix - Scan Algorithm 20 in case of 𝑡 = 5: 𝑀# , 𝑀, , 𝑀. , 𝑀/ 𝑀# , 𝑀# 𝑀, , 𝑀, 𝑀. , 𝑀. 𝑀/ 𝑀#, 𝑀# 𝑀,, 𝑀#𝑀,𝑀., 𝑀#𝑀,𝑀. 𝑀/ 1. multiple the matrix on 1 step left side 2. multiple the matrix on 2 step left side ℎ!0# 1 = ̅ 𝐴 , 𝐵𝑥! 0 1 ℎ! 1 = 𝑀! ℎ! 1 state equation ℎ! 1 = L 12# !"# 𝑀1 ℎ# 1 - '($ )*$ 𝑀' computing is required Parallel computing method Note: Not indicated in the original paper