Deep State Space Models 101 / Mamba

Slide 1

Slide 1 text

Deep State Space Models 101 Hiroto Kurita 2024/1/19 Kotoba Technologies Seminar Series 1

Slide 33

Slide 33 text

Mamba: ⼊⼒ 𝑥 に応じて動的に / 𝑨, / 𝑩, / 𝑪 を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 33 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数

Slide 34

Slide 34 text

Mamba: ⼊⼒ 𝑥 に応じて動的に / 𝑨, / 𝑩, / 𝑪 を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 34 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数次元ごとに個別の SSM (ℝ → ℝ) を適⽤するイメージ

Slide 35

Slide 35 text

Mamba: ⼊⼒ 𝑥 に応じて動的に / 𝑨, / 𝑩, / 𝑪 を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 35 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数次元ごとに個別の SSM (ℝ → ℝ) を適⽤するイメージ 𝑥 に依存せず使いまわし

Slide 36

Slide 36 text

Mamba: ⼊⼒ 𝑥 に応じて動的に / 𝑨, / 𝑩, / 𝑪 を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 36 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数次元ごとに個別の SSM (ℝ → ℝ) を適⽤するイメージ 𝑥 に依存せず使いまわしステップ幅 Δ も学習

Slide 37

Slide 37 text

Mamba: ⼊⼒ 𝑥 に応じて動的に / 𝑨, / 𝑩, / 𝑪 を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 37 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数次元ごとに個別の SSM (ℝ → ℝ) を適⽤するイメージ 𝑥 に依存せず使いまわしステップ幅 Δ も学習時間依存しないので畳み込みも可能(並列化可)

Slide 38

Slide 38 text

Mamba: ⼊⼒ 𝑥 に応じて動的に / 𝑨, / 𝑩, / 𝑪 を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 38 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数時間依存しないので畳み込みも可能(並列化可)

Slide 39

Slide 39 text

Mamba: ⼊⼒ 𝑥 に応じて動的に / 𝑨, / 𝑩, / 𝑪 を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 39 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数時間依存しないので畳み込みも可能(並列化可) 𝒙 に線形変換を噛ませる (𝐷 → 𝑁)

Slide 40

Slide 40 text

Mamba: ⼊⼒ 𝑥 に応じて動的に / 𝑨, / 𝑩, / 𝑪 を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 40 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数時間依存しないので畳み込みも可能(並列化可) 𝒙 に線形変換を噛ませる (𝐷 → 𝑁) Δ を経由して 2 𝑨 も 𝒙 に依存

Slide 41

Slide 41 text

Mamba: ⼊⼒ 𝑥 に応じて動的に / 𝑨, / 𝑩, / 𝑪 を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 41 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数時間依存しないので畳み込みも可能(並列化可) 𝒙 に線形変換を噛ませる (𝐷 → 𝑁) Δ を経由して 2 𝑨 も 𝒙 に依存 • ⼊⼒ 𝒙 と時間に依存に︕ • 🥲 畳み込み不可(並列化不可) • どのように⾼速化すれば?

Slide 42

Slide 42 text

Mamba: ⼊⼒ 𝑥 に応じて動的に / 𝑨, / 𝑩, / 𝑪 を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 42 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数時間依存しないので畳み込みも可能(並列化可) 𝒙 に線形変換を噛ませる (𝐷 → 𝑁) Δ を経由して 2 𝑨 も 𝒙 に依存 • ⼊⼒ 𝒙 と時間に依存に︕ • 🥲 畳み込み不可(並列化不可) • どのように⾼速化すれば? Mamba では⾼速化のため以下を⼯夫 • Parallel scan • Kernel fusion • 活性値の再計算

Slide 63

Slide 63 text

Mamba︓Selective Coping / Induction Heads 63 Model Arch. Layer Acc. S4 No gate S4 18.3 - No gate S6 97.0 H3 H3 S4 57.0 Hyena H3 Hyena 30.1 - H3 S6 99.7 - Mamba S4 56.4 - Mamba Hyena 28.4 Mamba Mamba S6 99.8 Table 1: (Selective Copying.) Accuracy for combinations of architectures and inner sequence layers. Table 2: (Induction Heads.) Models are trained on sequence length 28 = 256, and tested on increasing sequence lengths of 26 = 64 up to 220 = 1048576. Full numbers in Table 11. 4.1.2 Induction Heads Induction heads (Olsson et al. 2022) is a simple task from the mechanistic interpretability lens (Elhage et al. 2021) that is surprisingly predictive of the in-context learning ability of LLMs. It requires models to perform associative ? Output ng Selective Copying Input Induction Heads els that do not need to look at the actual inputs on of the Copying task involves constant spacing between input and output elements and is els such as linear recurrences and global convolutions. (Right Top) The Selective Copying task uts and requires time-varying models that can selectively remember or ignore inputs depending e Induction Heads task is an example of associative recall that requires retrieving an answer LMs. nts structured ◊ matrix B) : recurrence or convolution Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Input Output ? Output Copying Selective Copying Input Induction Heads Solution Perfectly solved by LTI (e.g. convolutional) models that do not need to look at the actual inputs Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. d MLP Mamba Linear projection Sequence transformation Nonlinearity (activation or multiplication) X ! X ! ! Conv SSM gn combines the H3 block, which is the basis of most SSM architectures, with s. Instead of interleaving these two blocks, we simply repeat the Mamba block eplaces the rst multiplicative gate with an activation function. Compared to anch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; n. We repeat this block, interleaved with standard normalization architecture. We always fix to = 2 in our experiments and use two ters of a Transformer’s interleaved MHA (multi-head attention) and ation function (Hendrycks and Gimpel 2016; Ramachandran, Zoph, Gated MLP becomes the popular “SwiGLU” variant (Chowdhery ). Finally, we additionally use an optional normalization layer (we MLP Mamba Linear projection Sequence transformation Nonlinearity (activation or multiplication) X ! X ! ! Conv SSM n combines the H3 block, which is the basis of most SSM architectures, with Instead of interleaving these two blocks, we simply repeat the Mamba block places the rst multiplicative gate with an activation function. Compared to nch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; n. We repeat this block, interleaved with standard normalization rchitecture. We always fix to = 2 in our experiments and use two ers of a Transformer’s interleaved MHA (multi-head attention) and tion function (Hendrycks and Gimpel 2016; Ramachandran, Zoph, Gated MLP becomes the popular “SwiGLU” variant (Chowdhery Finally, we additionally use an optional normalization layer (we

Slide 64

Slide 64 text

Mamba︓Selecitve SSM により動的な推論が可能に 64 Model Arch. Layer Acc. S4 No gate S4 18.3 - No gate S6 97.0 H3 H3 S4 57.0 Hyena H3 Hyena 30.1 - H3 S6 99.7 - Mamba S4 56.4 - Mamba Hyena 28.4 Mamba Mamba S6 99.8 Table 1: (Selective Copying.) Accuracy for combinations of architectures and inner sequence layers. Table 2: (Induction Heads.) Models are trained on sequence length 28 = 256, and tested on increasing sequence lengths of 26 = 64 up to 220 = 1048576. Full numbers in Table 11. 4.1.2 Induction Heads Induction heads (Olsson et al. 2022) is a simple task from the mechanistic interpretability lens (Elhage et al. 2021) that is surprisingly predictive of the in-context learning ability of LLMs. It requires models to perform associative ? Output ng Selective Copying Input Induction Heads els that do not need to look at the actual inputs on of the Copying task involves constant spacing between input and output elements and is els such as linear recurrences and global convolutions. (Right Top) The Selective Copying task uts and requires time-varying models that can selectively remember or ignore inputs depending e Induction Heads task is an example of associative recall that requires retrieving an answer LMs. nts structured ◊ matrix B) : recurrence or convolution Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Input Output ? Output Copying Selective Copying Input Induction Heads Solution Perfectly solved by LTI (e.g. convolutional) models that do not need to look at the actual inputs Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. d MLP Mamba Linear projection Sequence transformation Nonlinearity (activation or multiplication) X ! X ! ! Conv SSM gn combines the H3 block, which is the basis of most SSM architectures, with s. Instead of interleaving these two blocks, we simply repeat the Mamba block eplaces the rst multiplicative gate with an activation function. Compared to anch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; n. We repeat this block, interleaved with standard normalization architecture. We always fix to = 2 in our experiments and use two ters of a Transformer’s interleaved MHA (multi-head attention) and ation function (Hendrycks and Gimpel 2016; Ramachandran, Zoph, Gated MLP becomes the popular “SwiGLU” variant (Chowdhery ). Finally, we additionally use an optional normalization layer (we MLP Mamba Linear projection Sequence transformation Nonlinearity (activation or multiplication) X ! X ! ! Conv SSM n combines the H3 block, which is the basis of most SSM architectures, with Instead of interleaving these two blocks, we simply repeat the Mamba block places the rst multiplicative gate with an activation function. Compared to nch. For we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; n. We repeat this block, interleaved with standard normalization rchitecture. We always fix to = 2 in our experiments and use two ers of a Transformer’s interleaved MHA (multi-head attention) and tion function (Hendrycks and Gimpel 2016; Ramachandran, Zoph, Gated MLP becomes the popular “SwiGLU” variant (Chowdhery Finally, we additionally use an optional normalization layer (we

Slide 67

Slide 67 text

Mamba︓1/2のモデルサイズで Transformer を凌駕 67 Table 3: (Zero-shot Evaluations.) Best results for each size in bold. We compare against open source LMs with various tokenizers, trained for up to 300B tokens. Pile refers to the validation split, comparing only against models trained on the same dataset and tokenizer (GPT-NeoX-20B). For each model size, Mamba is best-in-class on every single evaluation result, and generally matches baselines at twice the model size. Model Token. Pile LAMBADA LAMBADA HellaSwag PIQA Arc-E Arc-C WinoGrande Average ppl ppl acc acc acc acc acc acc acc Hybrid H3-130M GPT2 — 89.48 25.77 31.7 64.2 44.4 24.2 50.6 40.1 Pythia-160M NeoX 29.64 38.10 33.0 30.2 61.4 43.2 24.1 51.9 40.6 Mamba-130M NeoX 10.56 16.07 44.3 35.3 64.5 48.0 24.3 51.9 44.7 Hybrid H3-360M GPT2 — 12.58 48.0 41.5 68.1 51.4 24.7 54.1 48.0 Pythia-410M NeoX 9.95 10.84 51.4 40.6 66.9 52.1 24.6 53.8 48.2 Mamba-370M NeoX 8.28 8.14 55.6 46.5 69.5 55.1 28.0 55.3 50.0 Pythia-1B NeoX 7.82 7.92 56.1 47.2 70.7 57.0 27.1 53.5 51.9 Mamba-790M NeoX 7.33 6.02 62.7 55.1 72.1 61.2 29.5 56.1 57.1 GPT-Neo 1.3B GPT2 — 7.50 57.2 48.9 71.1 56.2 25.9 54.9 52.4 Hybrid H3-1.3B GPT2 — 11.25 49.6 52.6 71.3 59.2 28.1 56.9 53.0 OPT-1.3B OPT — 6.64 58.0 53.7 72.4 56.7 29.6 59.5 55.0 Pythia-1.4B NeoX 7.51 6.08 61.7 52.1 71.0 60.5 28.5 57.2 55.2 RWKV-1.5B NeoX 7.70 7.04 56.4 52.5 72.4 60.5 29.4 54.6 54.3 Mamba-1.4B NeoX 6.80 5.04 64.9 59.1 74.2 65.5 32.8 61.5 59.7 GPT-Neo 2.7B GPT2 — 5.63 62.2 55.8 72.1 61.1 30.2 57.6 56.5 Hybrid H3-2.7B GPT2 — 7.92 55.7 59.7 73.3 65.6 32.3 61.4 58.0 OPT-2.7B OPT — 5.12 63.6 60.6 74.8 60.8 31.3 61.0 58.7 Pythia-2.8B NeoX 6.73 5.04 64.7 59.3 74.0 64.1 32.9 59.7 59.1 RWKV-3B NeoX 7.00 5.24 63.9 59.6 73.7 67.8 33.1 59.6 59.6 Mamba-2.8B NeoX 6.22 4.23 69.2 66.1 75.2 69.7 36.3 63.5 63.3 GPT-J-6B GPT2 – 4.10 68.3 66.3 75.4 67.0 36.6 64.1 63.0 OPT-6.7B OPT – 4.25 67.7 67.2 76.3 65.6 34.9 65.5 62.9 Pythia-6.9B NeoX 6.51 4.45 67.1 64.0 75.2 67.3 35.5 61.3 61.7 RWKV-7.4B NeoX 6.31 4.38 67.2 65.5 76.1 67.8 37.5 61.0 62.5 total of 220 1 tokens per batch. Models were trained for 10 gradient steps for a total of 10 tokens.

Slide 68

Slide 68 text

Mamba︓まとめ • S4 の課題であった⼊⼒ 𝑥 に応じた推論を Selecitve SSM により実現 • 畳み込みによる並列化の代わりに，GPU を考慮した⾼速化を実現 • Parallel scan • Kernel fusion • 活性値の再計算 • Transformer に迫る性能を記録 • まだまだ分からないことが多い • 2.8B 以上にスケールする? • 学習の不安定性は? • Mamba における Chinchilla Scaling 則は? • ハイパラの決め⽅は Transformer とどう違う? 68 Project Discretize !! ℎ!"# ℎ! "! # $! %! Selection Mechanism GPU SRAM GPU HBM ∆! Selective State Space Model with Hardware-aware State Expansion Figure 1: (Overview.) Structured SSMs independently map each channel (e.g. = 5) of an input to output through a higher dimensional latent state (e.g. = 4). Prior SSMs avoid materializing this large e ective state ( , times batch size and sequence length ) through clever alternate computation paths requiring time-invariance: the ( , A, B, C) parameters are constant across time. Our selection mechanism adds back input-dependent dynamics, which also requires a careful hardware-aware algorithm to only materialize the expanded states in more e cient levels of the GPU memory hierarchy. 2 State Space Models Structured state space sequence models (S4) are a recent class of sequence models for deep learning that are broadly related to RNNs, and CNNs, and classical state space models. They are inspired by a particular continuous system (1) that maps a 1-dimensional function or sequence ( ) ( ) through an implicit latent state ( ) . Concretely, S4 models are defined with four parameters ( , A, B, C), which define a sequence-to-sequence transformation in two stages. ( ) = A ( ) + B ( ) (1a) = A 1 + B (2a) = (C , C , … , C , … ) (3a) H3 Gated MLP Mamba X X X ! Conv SSM X ! ! Conv SSM ⨂ Figure 3: (Architecture.) Our simpli ed block design combines the H3 block, which is the basis of m the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we sim homogenously. Compared to the H3 block, Mamba replaces the rst multiplicative gate with an activa the MLP block, Mamba adds an SSM to the main branch. For we use the SiLU / Swish activation (H Ramachandran, Zoph, and Quoc V Le 2017). the matrix A) are much smaller in comparison. We repeat this block, interleaved with and residual connections, to form the Mamba architecture. We always fix to = 2 in our stacks of the block to match the 12 2 parameters of a Transformer’s interleaved MHA (m MLP blocks. We use the SiLU / Swish activation function (Hendrycks and Gimpel 201 and Quoc V Le 2017), motivated so that the Gated MLP becomes the popular “SwiGL et al. 2023; Shazeer 2020; Touvron et al. 2023). Finally, we additionally use an optiona choose LayerNorm (J. L. Ba, Kiros, and Hinton 2016)), motivated by RetNet’s usage of a H3 Gated MLP Mamba X X X ! Conv SSM X ! ! Conv SSM ⨂ Figure 3: (Architecture.) Our simpli ed block design combines the H3 block, which is the basis of m the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we sim homogenously. Compared to the H3 block, Mamba replaces the rst multiplicative gate with an activa the MLP block, Mamba adds an SSM to the main branch. For we use the SiLU / Swish activation (H Ramachandran, Zoph, and Quoc V Le 2017). the matrix A) are much smaller in comparison. We repeat this block, interleaved with and residual connections, to form the Mamba architecture. We always fix to = 2 in our e stacks of the block to match the 12 2 parameters of a Transformer’s interleaved MHA (m MLP blocks. We use the SiLU / Swish activation function (Hendrycks and Gimpel 2016 and Quoc V Le 2017), motivated so that the Gated MLP becomes the popular “SwiGL et al. 2023; Shazeer 2020; Touvron et al. 2023). Finally, we additionally use an optional choose LayerNorm (J. L. Ba, Kiros, and Hinton 2016)), motivated by RetNet’s usage of a

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text