Slide 17
Slide 17 text
w બϝΧχζϜͷΈ
ೖྗʹԠͯͭ͡લͷঢ়ଶͱೖྗͷͲͪΒΛॏࢹ͢Δ͔Λબ
.BNCBɿ4FMFDUJWF44.T 4
ht
= Aht−1
+ Bxt
ht
= (1 − gt
)ht−1
+ gt
xt
gt
= σ(Linear(xt
))
• Recurrent Memory Transformer (Bulatov, Kuratov, and Burtsev 2023), a lightweight wrapper around a Transformer
backbone. It showed ability to generalize up to 1M sequences but only on synthetic memorization tasks; their main result
is similar to our Induction Heads extrapolation experiment (Table 2).
• LongNet (Ding et al. 2023), which claimed to scale to 1B length but only evaluated on length < 100 for actual tasks.
• Hyena and HyenaDNA (Nguyen, Poli, et al. 2023; Poli et al. 2023), which claimed to leverage up to 1M context. How-
ever, their experiments trained on proportionally more data at longer contexts, making it hard to conclude if quality
improvements at 1M context are due to context length or due to more data and computation.
• Sparse Transformer (Child et al. 2019) showed a proof-of-concept of using a strided sparse attention Transformer to
model audio waveforms of length 220 = 1048576, although did not discuss performance tradeos when controlling for
computation and model size.
In contrast, we believe this work presents one of the rst approaches to meaningfully demonstrate increasing performance
with longer context.
C Mechanics of Selective SSMs
Proof of Theorem 1. Consider a selective SSM (Algorithm 2) with # = 1, G = 1, H = 1,B = Linear(G),g = soplus. The
corresponding continuous-time SSM (1) is
⌘(C) = ⌘(C) + G(C)
which is also called a leaky integrator.
The discretization step size is
C
= g (Parameter + B (GC
))
= soplus(Parameter + Linear(GC
))
= soplus(Linear(GC
))
where we observe that the parameter can be viewed as a learnable bias and folded into the linear projection.
Now applying the zero-order hold (ZOH) discretization formulas:
GC
= exp( G) =
1
1 + exp(Linear(GC
))
= f( Linear(GC
))
= 1 f(Linear(GC
))
HC
= ( G) 1(exp( G) O) · H = (exp( G) O) = 1 G
= f(Linear(GC
)).
27
Thus the nal discrete recurrence (2a) is
6C
= f(Linear(GC
))
⌘C
= (1 6C
)⌘C 1
+ 6CGC
as desired. ⇤
D Hardware-aware Algorithm For Selective SSMs
ࣜมܗ
ͷ߹ɿ
gt
→ 0 ht
= ht−1
ೖྗΛແࢹ
l͋ʙz
z͑ʙzͱ͍ͬͨؒࢺΛޮతʹഉআ
ͷ߹ɿ
gt
→ 1 ht
= xt
ঢ়ଶΛϦηοτ
lͱ͜ΖͰzͷͷల։ʹ߹ΘͤͯϦηοτ͢Δ
͜ͱͰɺίϯςΩετΛਂ͘ཧղ