ࢄԽ͞Εͨ44.T࠶ؼతͳදݱͱͳΓɺ3//ʹྨࣅͨ͠ߏΛ࣋ͭ ˠશͯͷೖྗʹରͯ͠ҙػߏΛܭࢉ͢Δ5SBOTGPSNFSϕʔεͷϞσϧΑΓߴޮͳਪ͕Մೳ K Δ = [tk−1 , tk ] 44.TͷࢄԽ hk = ¯ Ahk−1 + ¯ Bxk yk = ¯ Chk ¯ A = exp(ΔA) ¯ B = (ΔA)−1(exp(ΔA) − I) ⋅ ΔB xt xt−1 xt+1 ht ht−1 ht+1 yt yt−1 yt+1 ¯ A ¯ A ¯ A ¯ A ¯ B ¯ B ¯ B C C C ʢ ࢄతͳ࣌ؒεςοϓʣ k ঢ়ଶํఔࣜɿ ग़ྗํఔࣜɿ
SSMs are time-invariant, meaning that their A, B, C, and are unrelated to the model input G. This would limit context-aware modeling, which leads to inferior performance of SSMs in certain tasks such as selective copying [55]. Table 1. Pros and cons of three primary architectures-RNNs, Transformers, and SSMs-in auto-regressive sequential modeling tasks. Comparison RNNs Transformers SSMs Training Speed Slow (Recurrent) Fast (Parallel) Fast (Convolutional) Inference Speed Fast (Recurrent) Slow (Quadratic-Time) Fast (Recurrent) Complexity $(!⇡2) $(!2⇡) $(!⇡2) Modeling Capabilities (Hidden State) (Attention) (Time-Invariance) Manuscript submitted to ACM x2 x1 x3 y2 y1 y3 "UUFOUJPO աڈͷೖྗશͯͱͷܭࢉ x2 x1 x3 h2 h1 h3 y2 y1 y3 ¯ A ¯ A ¯ B ¯ B ¯ B C C C x0 h0 y0 ¯ A ¯ B C 44.T ݱࡏͷೖྗͱҰͭલͷঢ়ଶ͔Βܭࢉ ࣗݾճؼλεΫʹ͓͚Δ֤Ϟσϧͷಛ x0 y0
γϯϓϧͳωοτϫʔΫΞʔΩςΫνϟ .BNCB#MPDL .BNCB<(VBOE%BP BS9JW> H3 Gated MLP Mamba Linear projection Sequence transformation Nonlinearity (activation or multiplication) X X X ! X Conv SSM X ! ! Conv SSM ⨂ Project Discretize !! ℎ!"# ℎ! "! # $! %! Selection Mechanism GPU SRAM GPU HBM ∆! Selective State Space Model with Hardware-aware State Expansion 4FMFDUJWF44.T .BNCB#MPDL ϋʔυΣΞʹదͨ͠ΞϧΰϦζϜ
+ Bxt ht = (1 − gt )ht−1 + gt xt gt = σ(Linear(xt )) • Recurrent Memory Transformer (Bulatov, Kuratov, and Burtsev 2023), a lightweight wrapper around a Transformer backbone. It showed ability to generalize up to 1M sequences but only on synthetic memorization tasks; their main result is similar to our Induction Heads extrapolation experiment (Table 2). • LongNet (Ding et al. 2023), which claimed to scale to 1B length but only evaluated on length < 100 for actual tasks. • Hyena and HyenaDNA (Nguyen, Poli, et al. 2023; Poli et al. 2023), which claimed to leverage up to 1M context. How- ever, their experiments trained on proportionally more data at longer contexts, making it hard to conclude if quality improvements at 1M context are due to context length or due to more data and computation. • Sparse Transformer (Child et al. 2019) showed a proof-of-concept of using a strided sparse attention Transformer to model audio waveforms of length 220 = 1048576, although did not discuss performance tradeos when controlling for computation and model size. In contrast, we believe this work presents one of the rst approaches to meaningfully demonstrate increasing performance with longer context. C Mechanics of Selective SSMs Proof of Theorem 1. Consider a selective SSM (Algorithm 2) with # = 1, G = 1, H = 1,B = Linear(G),g = soplus. The corresponding continuous-time SSM (1) is ⌘(C) = ⌘(C) + G(C) which is also called a leaky integrator. The discretization step size is C = g (Parameter + B (GC )) = soplus(Parameter + Linear(GC )) = soplus(Linear(GC )) where we observe that the parameter can be viewed as a learnable bias and folded into the linear projection. Now applying the zero-order hold (ZOH) discretization formulas: GC = exp( G) = 1 1 + exp(Linear(GC )) = f( Linear(GC )) = 1 f(Linear(GC )) HC = ( G) 1(exp( G) O) · H = (exp( G) O) = 1 G = f(Linear(GC )). 27 Thus the nal discrete recurrence (2a) is 6C = f(Linear(GC )) ⌘C = (1 6C )⌘C 1 + 6CGC as desired. ⇤ D Hardware-aware Algorithm For Selective SSMs ࣜมܗ ͷ߹ɿ gt → 0 ht = ht−1 ೖྗΛແࢹ l͋ʙz z͑ʙzͱ͍ͬͨؒࢺΛޮతʹഉআ ͷ߹ɿ gt → 1 ht = xt ঢ়ଶΛϦηοτ lͱ͜ΖͰzͷͷల։ʹ߹ΘͤͯϦηοτ͢Δ ͜ͱͰɺίϯςΩετΛਂ͘ཧղ
Linear projection Sequence transformation Nonlinearity (activation or multiplication) X X X ! X Conv SSM X ! ! Conv SSM ⨂ Figure 3: (Architecture.) Our simplied block design combines the H3 block, which is the basis of most SSM architectures, with the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we simply repeat the Mamba block homogenously. Compared to the H3 block, Mamba replaces the rst multiplicative gate with an activation function. Compared to the MLP block, Mamba adds an SSM to the main branch. For f we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; Ramachandran, Zoph, and Quoc V Le 2017). -JOFS1SPKFDUJPO -JOFS1SPKFDUJPO -JOFS1SPKFDUJPO -JOFS1SPKFDUJPO -JOFS1SPKFDUJPO -JOFS1SPKFDUJPO -JOFS1SPKFDUJPO -JOFS1SPKFDUJPO -JOFS1SPKFDUJPO -JOFS1SPKFDUJPO ήʔτػߏ ɿ4J-6 4JHNPJE-JOFBS6OJU Λ༻ σ() ೖྗใͷऔࣺબ ΤϯδχΞϦϯάख๏ͰߴԽ
ۭന τϦΨʔ^ ೖྗγʔέϯεɿ<I F M M P ۭന ۭന ۭന τϦΨʔ> ظ͞ΕΔग़ྗɿ<ۭന ۭന ۭന ۭന ۭന ۭന ۭന I F M M P> M A. L A. S4 No gate S4 18.3 - No gate S6 97.0 H3 H3 S4 57.0 Hyena H3 Hyena 30.1 - H3 S6 99.7 - Mamba S4 56.4 - Mamba Hyena 28.4 Mamba Mamba S6 99.8 Table 1: (Selective Copying.) Accuracy for combinations of architectures and inner sequence layers. Table 2: (Induction Heads.) Models are trained on sequence length 28 = 256, and tested on increasing sequence lengths of 26 = 64 up to 220 = 1048576. Full numbers in Table 11. ˠ.BNCB 4FMFDUJWF44. ʹΑΓಈతͳਪ が Մೳʹ *ODPOUFYUMFBSOJOHͷೳྗΛධՁ͢ΔͨΊͷ߹λεΫ ೖྗγʔέϯεɿ"#"#" ύλʔϯͷൃݟɿγʔέϯεl"#z͕܁Γฦ͞Ε͍ͯΔ Ϟσϧͷظ͞ΕΔग़ྗɿ࣍ʹདྷΔͷl#z ϞσϧͷλεΫɿ͜ͷύλʔϯΛೝࣝ͠ɺl#zΛग़ྗ M A. L A. S4 No gate S4 18.3 - No gate S6 97.0 H3 H3 S4 57.0 Hyena H3 Hyena 30.1 - H3 S6 99.7 - Mamba S4 56.4 - Mamba Hyena 28.4 Mamba Mamba S6 99.8 Table 1: (Selective Copying.) Accuracy for combinations of architectures and inner sequence layers. Table 2: (Induction Heads.) Models are trained on sequence length 28 = 256, and tested on increasing sequence lengths of 26 = 64 up to 220 = 1048576. Full numbers in Table 11.
projection Sequence transformation Nonlinearity (activation, normalization, multiplication) X ! ! Conv SSM X ! Conv SSM A A N Y Y Sequential Mamba Block ! .BNCB .BNCB -JOFBS -JOFBS -JOFBS -JOFBS -JOFBS -JOFBS -JOFBS w .BNCB#MPDLͷػೳΛ͞Βʹ֦ு 44.ͷܭࢉΛߦྻੵΞϧΰϦζϜͱͯ͠࠶ఆٛ͢Δ͜ͱͰ(16্ͰͷܭࢉΛվળ 5SBOTGPSNFSͷҙػߏͱ4FMFDUJWF44.Λ4USVDUVSFEߦྻͰ౷Ұతʹදݱ w ҙػߏͱ4FMFDUJWF44.ֶ͕తʹՁˠ5SBOTGPSNFSͷςΫχοΫΛ44.ʹಋೖՄೳ .BNCB<%BPBOE(V *$.-> ୯ҰͷઢܗࣹӨͰ Λฒྻʹࢉग़ A, X, B, C Λࢉग़͢ΔઢܗࣹӨͷޙʹ Λࢉग़͢ΔઢܗࣹӨΛద༻ X A, B, C /PSN'PSNFSΞʔΩςΫνϟ ʹج͍ͮͯਖ਼نԽΛಋೖ X B C X B C Linear projection Sequence transformation Nonlinearity (activation, normalization, multiplication) X ! ! Conv SSM X ! Conv SSM A A N Y Y !
ɿΫϥετʔΫϯΛτʔΫϯͷઌ಄ʹ࿈݁ %PVCMFDMBTTUPLFO ɿΫϥετʔΫϯΛτʔΫϯͷઌ಄ͱඌʹ࿈݁ .JEEMFDMBTTUPLFO ɿΫϥετʔΫϯΛτʔΫϯͷதԝʹՃ 7JNɿը૾ྨλεΫͷͨΊͷग़ྗઃܭ Classification strategy ImageNet top-1 acc. Mean pool 73.9 Max pool 73.4 Head class token 75.2 Double class token 74.3 Middle class token 76.1 Table 5. Ablation study on the classification design. The default setting for Vim is marked in blue . modeling power as Transfor putation complexity. Benefi designs of Mamba, the infe age of Vim are significantly cessing high-resolution imag dard computer vision benchm ing power and high efficiency great potential to be the next In future works, Vim with ing with position embedding tasks such as mask image ˠ.JEEMFDMBTTUPLFO͕࠷ߴਫ਼Λୡ
Qian Zhang2, Xinlong Wang3, Wenyu Liu1, Xinggang Wang1 1 Huazhong University of Science and Technology 2 Horizon Robotics 3 Beijing Academy of Artificial Intelligence Code & Models: hustvl/Vim 42 43 44 45 46 Detection mAP (%) 36 37 38 39 40 Ins. Seg. mAP (%) 71 73 75 77 Classification Top-1 Acc. (%) 38 39 40 41 Sem. Seg. mIoU (%) (a) Accuracy Comparison 1 1.4 1.8 2.2 2.6 512 640 738 1024 1248 FPS w/ log scale Resolution DeiT-Ti Vim-Ti 2.54 2.25 2.05 1.57 1.26 2.29 2.07 1.91 1.71 (b) Speed Comparison 0 20 40 60 80 512 640 738 1024 1248 GPU Memory (GB) Resolution DeiT-Ti Vim-Ti 4.56 4.22 12.48 8.13 11.14 8.09 5.03 40.09 OOM (c) GPU Memory Comparison 3.32 DeiT-Ti Vim-Ti Faster Smaller 2.8× faster -86.8% memory Figure 1. Performance and efficiency comparisons between DeiT [59] and our Vim model. For the accuracy comparison, we first pretrain DeiT and Vim on IN1K classification dataset [9], then we finetune the generic backbones on different downstream dense prediction tasks, i.e., semantic segmentation, object detection, instance segmentation. Results show that the proposed Vim outperforms DeiT on both pretraining and finetuning tasks. Vim is also more computation and memory efficient than DeiT in dealing with high-resolution images. For example, Vim is 2.8⇥ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248⇥1248, i.e., 6084 tokens per image. Abstract tion & memory efficiency. For example, Vim is 2.8⇥ faster 9417v2 [cs.CV] 10 Feb 2024
.BNCBΛ$7λεΫʹల։ ํͷγʔέϯεϞσϦϯάΛಋೖ ·ͱΊɿ.BNCB 7JTJPO.BNCB 7JN 2 Fig. 1 The statistics of Mamba-based papers released to date on vision tasks, spanning di↵erent modalities including Image, Video, Point Cloud, and Multi-Modal of image patches, have demonstrated remarkable mod- eling capabilities across various visual tasks (Liu et al, 2021). Self-attention enables ViTs to capture long-range dependencies within images, providing a significant ad- vantage over traditional CNNs that rely on local recep- tive fields. This capability allows ViTs to exhibit robust and can concept for sequ models CNNs. for proc can be their ab parallel been wi putatio state re the adv these li matrice transfor corpora et al, 20 et al, 2 ever, SS context the e c 2017). I pose to nism int propaga scan pa e cient .BNCBͷޮՌΛൃشͰ͖ΔϋʔυΣΞͷબผॏཁ ˠ$7λεΫ༻ͷ.BNCBؔ࿈ͷจ͕૿Ճத ҙ