The Memory Wall Thesis

AI infrastructure's primary bottleneck has shifted from compute (FLOPs) to memory (HBM capacity). Modern LLMs generate massive KV caches (state) that quickly overflow the ~80GB HBM on GPUs, forcing systems to either crash (OOM), evict context (causing expensive recompute), or refuse long contexts. WEKA solves this by creating a new architectural tier—"G1.5"—between GPU HBM (G1) and CPU DRAM (G2), effectively turning fast NVMe storage into an extension of GPU memory
Traditional inference "pins" user sessions to specific GPUs because moving the KV cache is too expensive. WEKA's Augmented Memory Grid (AMG) externalizes state into a persistent, shared pool accessible via RDMA/GPUDirect, allowing any GPU to serve any request. This transforms GPUs from "stateful pets" into "stateless cattle"

WEKA enables a trade of "Storage Dollars" (cheap, ~$0.10/GB) for "Compute Dollars" (expensive, ~$110/GB for HBM). By avoiding the "recompute tax"—regenerating KV caches from scratch—the system recovers GPU cycles that would otherwise burn electricity producing heat rather than tokens

WEKA reframes their solution not as "faster storage" but as a "Token Warehouse"—a memory-class buffer for "Work in Progress" (active KV caches) rather than "Finished Goods" (datasets/checkpoints). This shifts the paradigm from "storage as archive" to "storage as working memory"

Traditional inference economics punish long context (it consumes scarce HBM and triggers recompute). AMG inverts this: long context becomes a recoverable asset stored in cheap NVMe, while the cost of holding it approaches zero. This converts expensive "Salmon Tokens" (new input requiring compute) into cheap "Grey Tokens" (cached input requiring only bandwidth)

dyb PRO

February 01, 2026

Tweet

More Decks by dyb

Prepare for the permanent, universal gig economy

PRO

0

10

The 90% Problem: Why AI Finally Works for Heavy Industry

PRO

0

22

AI adoption curve speed is unprecedented

PRO

0

23

The Four Collapses - how AI is dismantling the Digital Economy's Foundation

PRO

0

30

Google Update as APT

PRO

0

20

AI is in a sedimentation phase

PRO

0

20

Recursive Self-Improvement (RSI) AI systems iteratively redesigning and enhancing themselves

PRO

0

13

How Good is AI now at Math

PRO

0

22

The Case Against Anthropic's Constitutional Approach

PRO

0

14

Other Decks in Technology

See All in Technology

Devinを導入したら予想外の人たちに好評だった

0

880

ソフトウェアアーキテクトのための意思決定術: Create Decision Readiness—The Real Skill Behind Architectural Decision

PRO

30

9k

GitLab Duo Agent Platform + Local LLMサービングで幸せになりたい

0

110

パネルディスカッション資料 (at Tableau Now! - 2026-02-26)

yoshitakaarakawa

0

1.1k

どこで打鍵するのが良い？ IaCの実行基盤選定について

PRO

2

170

Introduction to Sansan, inc / Sansan Global Development Center, Inc.

PRO

0

3k

Oracle Base Database Service 技術詳細

oracle4engineer

PRO

15

95k

生成AI活用によるPRレビュー改善の歩み

PRO

5

2k

オンプレとGoogle Cloudを安全に繋ぐための、セキュア通信の勘所

3

1.1k

研究開発部メンバーの働き⽅ / Sansan R&D Profile

PRO

4

22k

Oracle Database@Azure：サービス概要のご紹介

oracle4engineer

PRO

4

1.1k

類似画像検索モデルの開発ノウハウ

PRO

3

900

Featured

See All Featured

Why Your Marketing Sucks and What You Can Do About It - Sophie Logan

0

99

Collaborative Software Design: How to facilitate domain modelling decisions

0

150

The Curious Case for Waylosing

0

260

Six Lessons from altMBA

29

4.2k

How to Talk to Developers About Accessibility

2

140

AI Search: Implications for SEO and How to Move Forward - #ShenzhenSEOConference

1

1.1k

Claude Code どこまでも/ Claude Code Everywhere

63

53k

Leo the Paperboy

4

1.5k

A Tale of Four Properties

162

24k

Max Prin - Stacking Signals: How International SEO Comes Together (And Falls Apart)

PRO

0

110

How to Create Impact in a Changing Tech Landscape [PerfNow 2023]

55

3.3k

The Illustrated Guide to Node.js - THAT Conference 2024

1

280

Transcript