Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning to (Learn at Test Time): RNNs with Exp...

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Avatar for Hiroto Kurita

Hiroto Kurita

August 22, 2025
Tweet

More Decks by Hiroto Kurita

Other Decks in Research

Transcript

  1. ಡΈख: ܀ా ஦ਓ (౦๺େֶ) @ ୈ17ճ࠷ઌ୺NLPษڧձ Learning to (Learn at

    Test Time): RNNs with 
 Expressive Hidden States Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin
 ICML 2025 (Spotlight poster) https://arxiv.org/abs/2407.04620 ※ ಛʹஅΓ͕ͳ͍৔߹͸ɺਤද΍σʔλ͸঺հ࿦จ͔ΒҾ༻͍ͯ͠·͢
  2. ֓ཁ • ϞσϧΞʔΩςΫνϟఏҊܥ࿦จɽRNN ͷվળ΍͍͖ͬͯ • Test Time Training (TTT)ʹΑΓɼೖྗτʔΫϯͨͪΛRNNͷӅΕঢ়ଶʹ”͏·͘” ѹॖ͢Δํ๏Λߟ͑Δ

    • બఆཧ༝ • Modern RNN / SSM (hybrid ΞʔΩςΫνϟؚΉ) ͷ࣮૷͕ਐΜͰ͍Δɽࠃ಺ͩ ͱ PFNࣾͷ PLaMo 2ɽCartesia ͷ TTS Ϟσϧ΋ SSM ͱ͍͏ᷚɽ • TTT Λ࢖͏ͱ͍͏ɼࠓ·Ͱͷ SSM ͱ͸গ͠ҧͬͨΞϓϩʔνΛऔΔ 2
  3. ෮शɿRNN ͸௚લͷঢ়ଶͱݱࡏͷೖྗΛ࢖ͬͯঢ়ଶΛߋ৽ 3 ঢ়ଶ ೖྗ ग़ྗ s0 s1 st−1 st

    x1 xt−1 xt z1 zt−1 zt … = θzs st +θzx xt = σ(θss st−1 +θsx xt ) 🟦 : ޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔϞσϧͷॏΈ
  4. ෮शɿRNN ͸௚લͷঢ়ଶͱݱࡏͷೖྗΛ࢖ͬͯঢ়ଶΛߋ৽ 4 ঢ়ଶ ೖྗ ग़ྗ s0 s1 st−1 st

    x1 xt−1 xt z1 zt−1 zt … = θzs st +θzx xt = σ(θss st−1 +θsx xt ) 🟦 : ޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔϞσϧͷॏΈ ܭࢉɿ ϝϞϦɿ O(1) O(1) 🥰
  5. ෮शɿRNN ͸௚લͷঢ়ଶͱݱࡏͷೖྗΛ࢖ͬͯঢ়ଶΛߋ৽ 5 ঢ়ଶ ೖྗ ग़ྗ s0 s1 st−1 st

    x1 xt−1 xt z1 zt−1 zt … = θzs st +θzx xt = σ(θss st−1 +θsx xt ) 🟦 : ޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔϞσϧͷॏΈ ܭࢉɿ ϝϞϦɿ O(1) O(1) Self-attention: ܭࢉɿ ϝϞϦɿ O(t) O(t) 🥰 😖
  6. 😢 (Modern) RNN ͸௕ܥྻʹऑ͍ 7 🟥 Mamba: ࠷ॳ͸͍͍ײ͡ 🟥 Mamba:

    ޙ൒ͰανΔ RNN ͸௕ܥྻͰͦ͜ਅՁ ͕ൃش͞ΕΔͷʹ….
  7. 😢 (Modern) RNN ͸௕ܥྻʹऑ͍ 8 🟥 Mamba: ࠷ॳ͸͍͍ײ͡ 🟥 Mamba:

    ޙ൒ͰανΔ RNN ͸௕ܥྻͰͦ͜ਅՁ ͕ൃش͞ΕΔͷʹ…. 🟦🍊: ఏҊख๏ 🤔 աڈจ຺ͷେྔͷτʔΫϯɼ ͲͷΑ͏ʹѹॖ͢Ε͹ʁ
  8. 💡ࣗݾڭࢣ͋Γֶश ≈ ֶशσʔλͷѹॖ 10 ֶशσʔληοτ Ϟσϧ (ॏΈ) (ࣗݾڭࢣ͋Γ)ֶश E.g., ࣍୯ޠ༧ଌ

    E.g., Wikipedia x1 x2 x3 … ίϯςΩετ಺ͷ ֤τʔΫϯ ֶशσʔληοτ Ϟσϧ (ॏΈ) (ࣗݾڭࢣ͋Γ)ֶश ԿΒ͔ͷλεΫ ≈ѹॖ൛ֶशσʔλ Q: ೔ຊͷट౎͸? A: ౦ژ 💡 จ຺಺ͷ֤τʔΫϯͷ ྑ͍ѹॖʹͳΔ͸ͣ 💡 RNN ͷঢ়ଶΛϞσϧʢͷॏ Έʣͱݟͯɼจ຺಺ͷτʔΫϯ Λֶश͞Ε͹ྑ͍ͷͰ͸ʁ
  9. ఏҊख๏ɿঢ়ଶ = খ͞ͳػցֶशϞσϧͱݟͯɼޯ഑߱Լ๏ͰॏΈߋ৽ 11 ঢ়ଶ ೖྗ ग़ྗ W0 W1 Wt−1

    Wt x1 xt−1 xt z1 zt−1 zt … = Wt−1 −η∇l(xt ; Wt−1 ) = f(xt ; Wt ) 🟦 : ࣮ࡍͷޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔॏΈɽ֎ଆϧʔϓ 🟥 : લ޲͖ܭࢉதͷޯ഑߱Լ๏ʹΑΓߋ৽͞ΕΔ ͷॏΈɽ಺ଆϧʔϓɽ f : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅ 1εςοϓͷঢ়ଶߋ৽ = 1εςοϓͷޯ഑߱Լ
  10. ఏҊख๏ɿঢ়ଶ = খ͞ͳػցֶशϞσϧͱݟͯɼޯ഑߱Լ๏ͰॏΈߋ৽ 12 ঢ়ଶ ೖྗ ग़ྗ W0 W1 Wt−1

    Wt x1 xt−1 xt z1 zt−1 zt … = Wt−1 −η∇l(xt ; Wt−1 ) = f(xt ; Wt ) 🟦 : ࣮ࡍͷޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔॏΈɽ֎ଆϧʔϓ 🟥 : લ޲͖ܭࢉதͷޯ഑߱Լ๏ʹΑΓߋ৽͞ΕΔ ͷॏΈɽ಺ଆϧʔϓɽ f : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅ 1εςοϓͷঢ়ଶߋ৽ = 1εςοϓͷޯ഑߱Լ લϖʔδͷʮֶश=ѹॖʯ ͷؾ࣋ͪ
  11. ఏҊख๏ɿঢ়ଶ = খ͞ͳػցֶशϞσϧͱݟͯɼޯ഑߱Լ๏ͰॏΈߋ৽ 13 ঢ়ଶ ೖྗ ग़ྗ W0 W1 Wt−1

    Wt x1 xt−1 xt z1 zt−1 zt … = Wt−1 −η∇l(xt ; Wt−1 ) = f(xt ; Wt ) 🟦 : ࣮ࡍͷޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔॏΈɽ֎ଆϧʔϓ 🟥 : લ޲͖ܭࢉதͷޯ഑߱Լ๏ʹΑΓߋ৽͞ΕΔ ͷॏΈɽ಺ଆϧʔϓɽ f : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅ 1εςοϓͷঢ়ଶߋ৽ = 1εςοϓͷޯ഑߱Լ 🟥 લ޲͖ܭࢉͰޯ഑߱Լ ͍ͤͯ͞Δ͜ͱʹ஫ҙ લϖʔδͷʮֶश=ѹॖʯ ͷؾ࣋ͪ
  12. ۩ମతͳࣗݾڭࢣ͋ΓλεΫ ͷઃܭ l 14 Wt−1 Wt xt zt = Wt−1

    −η∇l(xt ; Wt−1 ) = f(xt ; Wt ) ঢ়ଶ ೖྗ ग़ྗ 🟦 : ࣮ࡍͷޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔॏΈɽ֎ଆϧʔϓ 🟥 : લ޲͖ܭࢉதͷޯ഑߱Լ๏ʹΑΓߋ৽͞ΕΔ ͷॏΈɽ಺ଆϧʔϓɽ f : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅
  13. ۩ମతͳࣗݾڭࢣ͋ΓλεΫ ͷઃܭ l 15 Wt−1 Wt xt zt = Wt−1

    −η∇l(xt ; Wt−1 ) = f(xt ; Wt ) ঢ়ଶ ೖྗ ग़ྗ l(xt ; Wt−1 ) = ∥f(˜ xt ; Wt−1 ) − xt ∥2 🟦 : ࣮ࡍͷޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔॏΈɽ֎ଆϧʔϓ 🟥 : લ޲͖ܭࢉதͷޯ഑߱Լ๏ʹΑΓߋ৽͞ΕΔ ͷॏΈɽ಺ଆϧʔϓɽ f : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅ ೖྗ Λ࠶ߏ੒͢ΔλεΫ x
  14. ۩ମతͳࣗݾڭࢣ͋ΓλεΫ ͷઃܭ l 16 Wt−1 Wt xt zt = Wt−1

    −η∇l(xt ; Wt−1 ) = f(xt ; Wt ) ঢ়ଶ ೖྗ ग़ྗ l(xt ; Wt−1 ) = ∥f(θK xt ; Wt−1 ) − xt ∥2 ೖྗ Λ࠶ߏ੒͢ΔλεΫ x 🟦 : ࣮ࡍͷޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔॏΈɽ֎ଆϧʔϓ 🟥 : લ޲͖ܭࢉதͷޯ഑߱Լ๏ʹΑΓߋ৽͞ΕΔ ͷॏΈɽ಺ଆϧʔϓɽ f : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅ ௿࣍ݩʹࣹӨͯ͠յ͢
  15. ۩ମతͳࣗݾڭࢣ͋ΓλεΫ ͷઃܭ l 17 Wt−1 Wt xt zt = Wt−1

    −η∇l(xt ; Wt−1 ) = f(xt ; Wt ) ঢ়ଶ ೖྗ ग़ྗ l(xt ; Wt−1 ) = ∥f(θK xt ; Wt−1 ) − xt ∥2 ೖྗ Λ࠶ߏ੒͢ΔλεΫ x 🟦 : ࣮ࡍͷޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔॏΈɽ֎ଆϧʔϓ 🟥 : લ޲͖ܭࢉதͷޯ഑߱Լ๏ʹΑΓߋ৽͞ΕΔ ͷॏΈɽ಺ଆϧʔϓɽ f : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅ ௿࣍ݩʹࣹӨͯ͠յ͢ ೖྗͷͲͷಛ௃͕େࣄ͔
  16. ۩ମతͳࣗݾڭࢣ͋ΓλεΫ ͷઃܭ l 18 Wt−1 Wt xt zt = Wt−1

    −η∇l(xt ; Wt−1 ) = f(xt ; Wt ) ঢ়ଶ ೖྗ ग़ྗ l(xt ; Wt−1 ) = ∥f(θK xt ; Wt−1 )−θV xt ∥2 ೖྗ Λ࠶ߏ੒͢ΔλεΫ x 🟦 : ࣮ࡍͷޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔॏΈɽ֎ଆϧʔϓ 🟥 : લ޲͖ܭࢉதͷޯ഑߱Լ๏ʹΑΓߋ৽͞ΕΔ ͷॏΈɽ಺ଆϧʔϓɽ f : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅ ௿࣍ݩʹࣹӨͯ͠յ͢ ೖྗͷͲͷಛ௃͕େࣄ͔ ͲͷΑ͏ͳϥϕϧΛ࡞Ε͹ྑ͍͔
  17. ۩ମతͳࣗݾڭࢣ͋ΓλεΫ ͷઃܭ l 19 Wt−1 Wt xt zt = Wt−1

    −η∇l(xt ; Wt−1 ) ঢ়ଶ ೖྗ ग़ྗ l(xt ; Wt−1 ) = ∥f(θK xt ; Wt−1 )−θV xt ∥2 ೖྗ Λ࠶ߏ੒͢ΔλεΫ x 🟦 : ࣮ࡍͷޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔॏΈɽ֎ଆϧʔϓ 🟥 : લ޲͖ܭࢉதͷޯ഑߱Լ๏ʹΑΓߋ৽͞ΕΔ ͷॏΈɽ಺ଆϧʔϓɽ f : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅ ௿࣍ݩʹࣹӨͯ͠յ͢ ೖྗͷͲͷಛ௃͕େࣄ͔ ͲͷΑ͏ͳϥϕϧΛ࡞Ε͹ྑ͍͔ ࣍୯ޠ༧ଌʹޮ͘ ͷಛ௃Λநग़ xt = f(θQ xt ; Wt )
  18. ۩ମతͳࣗݾڭࢣ͋ΓλεΫ ͷઃܭ l 20 Wt−1 Wt xt zt = Wt−1

    −ηθlr (xt )∇l(xt ; Wt−1 ) ঢ়ଶ ೖྗ ग़ྗ l(xt ; Wt−1 ) = ∥f(θK xt ; Wt−1 )−θV xt ∥2 ೖྗ Λ࠶ߏ੒͢ΔλεΫ x 🟦 : ࣮ࡍͷޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔॏΈɽ֎ଆϧʔϓ 🟥 : લ޲͖ܭࢉதͷޯ഑߱Լ๏ʹΑΓߋ৽͞ΕΔ ͷॏΈɽ಺ଆϧʔϓɽ f : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅ ௿࣍ݩʹࣹӨͯ͠յ͢ ೖྗͷͲͷಛ௃͕େࣄ͔ ͲͷΑ͏ͳϥϕϧΛ࡞Ε͹ྑ͍͔ ࣍୯ޠ༧ଌʹޮ͘ ͷಛ௃Λநग़ xt = f(θQ xt ; Wt ) ೖྗʹԠֶͯ͡श཰Λௐ੔
  19. ۩ମతͳࣗݾڭࢣ͋ΓλεΫ ͷઃܭ l 21 Wt−1 Wt xt zt = Wt−1

    −ηθlr (xt )∇l(xt ; Wt−1 ) ঢ়ଶ ೖྗ ग़ྗ l(xt ; Wt−1 ) = ∥f(θK xt ; Wt−1 )−θV xt ∥2 ೖྗ Λ࠶ߏ੒͢ΔλεΫ x 🟦 : ࣮ࡍͷޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔॏΈɽ֎ଆϧʔϓ 🟥 : લ޲͖ܭࢉதͷޯ഑߱Լ๏ʹΑΓߋ৽͞ΕΔ ͷॏΈɽ಺ଆϧʔϓɽ f : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅ ௿࣍ݩʹࣹӨͯ͠յ͢ ೖྗͷͲͷಛ௃͕େࣄ͔ ͲͷΑ͏ͳϥϕϧΛ࡞Ε͹ྑ͍͔ ࣍୯ޠ༧ଌʹޮ͘ ͷಛ௃Λநग़ xt 🟦: ͲͷΑ͏ͳࣗݾڭࢣ͋Γֶश(=ίϯ ςΩετѹॖ)Λલ޲͖ܭࢉͰߦ͑͹ɼ ࣍୯ޠ༧ଌʹ༗ޮ͔? ΛֶͿɽ (λεΫࣗମͷબ୒΍, ֶͼํΛֶशʣ = f(θQ xt ; Wt ) ೖྗʹԠֶͯ͡श཰Λௐ੔
  20. ۩ମతͳࣗݾڭࢣ͋ΓλεΫ ͷઃܭ l 22 Wt−1 Wt xt zt = Wt−1

    −ηθlr (xt )∇l(xt ; Wt−1 ) ঢ়ଶ ೖྗ ग़ྗ l(xt ; Wt−1 ) = ∥f(θK xt ; Wt−1 )−θV xt ∥2 ೖྗ Λ࠶ߏ੒͢ΔλεΫ x 🟦 : ࣮ࡍͷޡࠩٯ఻೻๏Ͱߋ৽͞ΕΔॏΈɽ֎ଆϧʔϓ 🟥 : લ޲͖ܭࢉதͷޯ഑߱Լ๏ʹΑΓߋ৽͞ΕΔ ͷॏΈɽ಺ଆϧʔϓɽ f : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅ ௿࣍ݩʹࣹӨͯ͠յ͢ ೖྗͷͲͷಛ௃͕େࣄ͔ ͲͷΑ͏ͳϥϕϧΛ࡞Ε͹ྑ͍͔ ࣍୯ޠ༧ଌʹޮ͘ ͷಛ௃Λநग़ xt 🟦: ͲͷΑ͏ͳࣗݾڭࢣ͋Γֶश(=ίϯ ςΩετѹॖ)Λલ޲͖ܭࢉͰߦ͑͹ɼ ࣍୯ޠ༧ଌʹ༗ޮ͔? ΛֶͿɽ (λεΫࣗମͷબ୒΍, ֶͼํΛֶशʣ = f(θQ xt ; Wt ) ೖྗʹԠֶͯ͡श཰Λௐ੔
  21. Ϟσϧͷશମ૾ 24 TTT layer LayerNorm / Conv TTT layer Conv

    LayerNorm Gate Transformer backbone Mamba backbone ock, the basic building block for Transformers. The sequence modeling block nts: the Transformer backbone and Mamba backbone. Middle: TTT layer The LN before O comes from NormFormer [60]. Right: TTT layer in the [25] and Griffin [18]. Following these two architectures, ω here is GELU [29]. rameters of the gate without changing the embedding dimension, we simply TTT layer LayerNorm / Conv TTT layer Conv LayerNorm Gate Transformer backbone Mamba backbone the basic building block for Transformers. The sequence modeling block the Transformer backbone and Mamba backbone. Middle: TTT layer he LN before O comes from NormFormer [60]. Right: TTT layer in the ] and Griffin [18]. Following these two architectures, ω here is GELU [29]. eters of the gate without changing the embedding dimension, we simply projection. : ΛॏΈͱ͢Δখ͞ͳϞσϧɽE.g., ઢܗ૚ɼMLPɼ… f( ⋅ ; W⋅ ) W⋅ Test Time Training (TTT) Layer
  22. ࣮ݧɿ୹ܥྻͰͷεέʔϦϯά 25 Figure 10. Evaluations for context lengths 2k and

    8k on the Pile. Details in Subsection 3.1. TTT-Linear has comparable performance as Mamba at 2k context, and better performance at 8k. (M): Mamba, (T): Transformer backbone Linear/MLP: ͷϞσϧ f
  23. ࣮ݧɿ୹ܥྻͰͷεέʔϦϯά 26 Figure 10. Evaluations for context lengths 2k and

    8k on the Pile. Details in Subsection 3.1. TTT-Linear has comparable performance as Mamba at 2k context, and better performance at 8k. (M): Mamba, (T): Transformer backbone Linear/MLP: ͷϞσϧ f TTT-Linear/MLP (T) ͕গ͠ѱ͍ ଞ͸ಉ౳
  24. ࣮ݧɿ୹ܥྻͰͷεέʔϦϯά 27 Figure 10. Evaluations for context lengths 2k and

    8k on the Pile. Details in Subsection 3.1. TTT-Linear has comparable performance as Mamba at 2k context, and better performance at 8k. (M): Mamba, (T): Transformer backbone Linear/MLP: ͷϞσϧ f TTT-Linear/MLP (T) ͕গ͠ѱ͍ ଞ͸ಉ౳ TTT-Linear/MLP (M) ͕Ϥγ. ௕ܥྻͩͱྑ ͘ͳΓ͕ͪ TTT-Linear/MLP (T) ͸ͦ͜·Ͱ Mamba ʹಧ͔ͣ
  25. ࣮ݧɿ௕ܥྻͰͷεέʔϦϯά 28 on Books. Details in Subsection 3.2. Our complete

    results g Transformer finetuning, are in Figure 15 (in Appendix). TTT-MLP (T) ͕͍͍ײ͡ɽ௕ܥྻͰ͸ Transformer backbone ͷར఺͋Δ͔΋? 😅 ͜ͷลΓͩͱ΄΅ಉ͡…
  26. ࣮ݧɿ௕ܥྻͰͷεέʔϦϯά 29 on Books. Details in Subsection 3.2. Our complete

    results g Transformer finetuning, are in Figure 15 (in Appendix). TTT-MLP (T) ͕͍͍ײ͡ɽ௕ܥྻͰ͸ Transformer backbone ͷར఺͋Δ͔΋? 😅 ͜ͷลΓͩͱ΄΅ಉ͡… Sequence modeling block MLP block LayerNorm LayerNorm TTT layer LayerNorm / Conv TTT layer Conv LayerNorm Gate Residual block Transformer backbone Mamba backbone Figure 13. Left: A residual block, the basic building block for Transformers. The sequence modeling block is instantiated into two variants: the Transformer backbone and Mamba backbone. Middle: TTT layer in the Transformer backbone. The LN before O comes from NormFormer [60]. Right: TTT layer in the backbone inspired by Mamba [25] and Griffin [18]. Following these two architectures, ω here is GELU [29]. To accommodate the extra parameters of the gate without changing the embedding dimension, we simply combine εK and εQ into a single projection. ৗʹ TTT-MLP (T) > TTT-Linear (T) Ͱ TTT-MLP (M) ≈ TTT-Linear (M) ͳͷ͸ɼ Mamba ͷ 1D conv ͕͍͍࢓ࣄ͍ͯ͠Δ͔΋. N-gram తͳಛ௃ΛݟΔ໾ׂ
  27. ࣮ݧɿprefill / decode ଎౓ 30 Figure 12. Latency on an

    NVIDIA A100 GPU with 80G HBM and PCIe connections. pre fi ll / decode ͱ΋ʹܥྻ͕৳ͼͯ΋ const.
  28. ײ૝ • աڈจ຺Λ͍͔ʹѹॖ͢Δ͔ͱ͍͏࿩͸ࠓ·Ͱͷ Modern RNN / SSM Ͱ ΋۷ΒΕ͍͕ͯͨɼTest Time

    Training ͱ݁ͼ෇͚Δͷ͸໨৽͘͠ײͨ͡ • Modern RNN / SSM ͩͱɼRNN ͷঢ়ଶભҠߦྻͷߏ଄ࣗମʹண໨͢Δ ݚڀ͕ଟ͔ͬͨ? S4, Mamba ͳͲ • RNN ʹݶΒͣɼTTT ͸Test time compute scaling ͱͷ૬ੑ΋ྑͦ͞͏ɽ • ARC Challenge Ͱ TTT Λ࢖͏ͱਖ਼౴཰͕άϯͱ্͕Δͱ͍͏ݚڀ͕ ICML 2025 ʹ࠾୒ [Akyürek+’25] • (ֶश࣌ʹ) ೗Կʹ GPU Λແବͳ͘࢖͑Δ͔͕՝୊͔ 31