を計算 Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs. Algorithm 1 SSM (S4) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , ) 3: C ( , ) 4: ( ) ( ) 5: A, B ( , ) ( , A, B) 6: (A, B, C)( ) Time-invariant: recurrence or convolution 7: return Algorithm 2 SSM + Selection (S6) Input: ( , , ) Output: ( , , ) 1: A ( , ) Represents structured ◊ matrix 2: B ( , , ) ( ) 3: C ( , , ) ( ) 4: ( , , ) ( + ( )) 5: A, B ( , , , ) ( , A, B) 6: (A, B, C)( ) Time-varying: recurrence (scan) only 7: return Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters , B, C functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension , meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2). This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next. We specifically choose ( ) = ( ), ( ) = ( ), ( ) = ( 1 ( )), and = , . ℎ" = / 𝑨ℎ"#$ + / 𝑩𝑥" 𝑦" = / 𝑪ℎ" 42 𝐵: バッチサイズ 𝐿: 系列長 𝐷: モデルの隠れ次元数 𝑁: SSM の状態 ℎ の次元数 時間依存しないので畳み込みも可能(並列化可) 𝒙 に線形変換を噛ませる (𝐷 → 𝑁) Δ を経由して 2 𝑨 も 𝒙 に依存 • ⼊⼒ 𝒙 と時間に依存に︕ • 🥲 畳み込み不可(並列化不可) • どのように⾼速化すれば? Mamba では⾼速化のため以下を⼯夫 • Parallel scan • Kernel fusion • 活性値の再計算