act1-costs.pdf - Speaker Deck

Embed

Start on current slide

Slide 1

Slide 1 text

CLAUDE CODE COSTS Act I — How the billing actually works Why every Claude Code turn re-sends the whole conversation — and what the prompt cache does about it. Grounded in measurement: every number captured on the wire from Claude Code 2.1.150.

Slide 2

Slide 2 text

The roadmap: from one HTTP request to your bill 1 One stateless request — the server remembers nothing 2 What's in a request & response — tools, system, messages — and the usage block 3 What it costs — the four numbers and their multipliers 4 The fix: caching — pay the write once, read warm after 5 How Claude Code caches — 3 breakpoints + the sliding window 6 What breaks it — tool churn, dynamic tools, big tool bursts 7 Wrap — the mental models + a cheat sheet COLOR KEY — highlight colors used on every log slide ahead structure / byte 0 warm · cheap (0.1×) write (2×) cold · expensive

Slide 3

Slide 3 text

What you'll be able to do after this ▸ Explain in one sentence why a Claude Code conversation gets more expensive every turn — and why that isn't a bug. ▸ Predict whether a change (new MCP server, model switch, a “compression” proxy) helps or hurts — before you ship it. ▸ Read a usage block and tell a warm session from one silently rebuilding its cache every turn. ▸ Pick the right cost-reduction tool for the cost line you actually want to attack. ▸ Recognize the most expensive mistakes and apply the fix for each.

Slide 4

Slide 4 text

Key terms — the vocabulary we'll use throughout token ≈ ¾ of a word — the unit everything is counted and billed in. prefix the leading bytes of the request, from position 0 onward. prompt cache stores the model's computed state for a prefix so it isn't recomputed. cache breakpoint a marker meaning “cache everything up to here.” cache_read tokens served warm from the cache — billed 0.1×. cache_creation tokens written to the cache this turn — billed 2×. Those multipliers (0.1×, 2×) are relative to the normal, uncached input-token price — the base rate, “1×”. The full rate card comes in Part 3.

Slide 5

Slide 5 text

How this was measured: a logging proxy on the wire harness · bodies-only reverse proxy · usage from SSE 1 $ export ANTHROPIC_BASE_URL=http://127.0.0.1:8799 2 $ claude --strict-mcp-config --model claude-sonnet-4-6 \ 3 --input-format stream-json -p 4 5 # proxy.py: Claude Code ──▶ 127.0.0.1:8799 ──▶ api.anthropic.com 6 # logs request+response BODIES only — never headers / OAuth token 7 # strips Accept-Encoding so the SSE is plain text 8 9 # usage read from each response SSE: message_start / message_delta ANTHROPIC_BASE_URL points Claude Code at a local HTTP proxy — no TLS interception, the OAuth token passes through untouched. Bodies only — every number in this deck is real per- call usage from the response SSE. WHY The API exposes no field that says “your cache was busted.” Every conclusion here is inferred from token-count deltas between near-identical requests. Note: Claude Code's displayed /cost under-reports — it prices 1h writes at the 5-min 1.25× rate; Anthropic bills 2×. [capture] CC 2.1.150

Slide 6

Slide 6 text

PART 1 One stateless request What does Claude Code actually send — and why does it forget?

Slide 7

Slide 7 text

PART 1 · THE ONE FACT Claude Code is a stateless API client. The request is the only state. Every turn re-sends the entire context — tools, system prompt, and the whole conversation so far. WHY The server keeps no session and no memory: it reads the request, replies, and forgets everything. The “memory” you feel in a chat is the client re-sending history each turn — which is exactly why cost grows with the conversation. [docs + measured]

Slide 8

Slide 8 text

Mental model 1: the amnesiac contractor You hand over the FULL dossier tools allowed + standing instructions + the entire project history so far Brilliant advice the contractor reads everything and answers …then forgets the entire engagement no notes, no memory — next time you bring the same dossier plus today's page WHY Everything expensive about Claude Code follows from “the dossier gets re-read, in full, every single time.” There is no server-side conversation to amend — only the next, longer request. [docs + measured]

Slide 9

Slide 9 text

Proof: the server recalls nothing SECRET CARRIED IN PAYLOAD user: "My secret number is 42." asst: "Acknowledged." user: "What is my secret number?" ──────────────────── reply: 42 PRIOR TURN OMITTED user: "What is my secret number?" (no prior turns in the payload) ──────────────────── reply: I don't know. vs The model knows only what the current request carries. Omit the prior turn and the secret is gone — there is no session to recall it from. WHY Statelessness made visible: memory lives only in the request's message array, never on the server. [capture] CC 2.1.150

Slide 10

Slide 10 text

Thinking is physically resent, too captured continuation · assistant turn [17] 1 messages[17] // assistant turn 2 content[0]: thinking (signature=yes, redacted) 3 content[1]: text 4 content[2]: tool_use → Grep 5 messages[18] // user turn 6 content[0]: tool_result The thinking block is in the request body — the client sends the model's own prior reasoning back to it. WHY It isn't just visible messages: the dossier includes the model's prior reasoning, and parts of the dossier cost money. We introduce thinking blocks in Act II — what they contain, how they're billed, and what happens to them on a model switch. [capture] CC 2.1.150

Slide 11

Slide 11 text

PART 2 Anatomy of a request & response What's actually in the bytes on the wire?

Slide 12

Slide 12 text

On the wire: one request, four parts request body · POST /v1/messages 1 { 2 "model": "claude-sonnet-4-6", "max_tokens": 32000, 3 "tools": [ 4 { "name":"Bash", "description":"Executes a bash command...", 5 "input_schema":{"type":"object","properties":{…}} }, 6 { "name":"Read", … }, // + Edit, Write, Grep, Glob, … 7 … 30 definitions total … 8 ], 9 "system": [ 10 {"text":"x-anthropic-billing-header: … cch=b2984;"}, // len 85 11 {"text":"You are a Claude agent…", "cache_control":{"ttl":"1h"}}, ◀ BP1 12 {"text":"You are an interactive CLI… [26,934]", "cache_control":{…"1h"}} ◀ BP2 13 ], 14 "messages": [ {"role":"user","content":[ 15 {"text":" agents…"}, // content[0] 1766 16 {"text":" skills…"}, // content[1] 3808 17 {"text":"# claudeMd … date 2026-06-22"}, // content[2] 1861 18 {"text":"Turn one. Reply with…"} ] } ] ◀ BP3 slides 19 } tools — the allowed-tool definitions system — billing header + identity + the ~27K-token system prompt itself messages — system-reminders + the whole conversation WHY The server is stateless, so all four parts re-ship in full every turn. They're ordered tools system messages → → (stable volatile) so the unchanging front can be cached and only the tail re-keyed. → [capture] CC 2.1.150

Slide 13

Slide 13 text

The render order: stable content first, volatile content last tools 30 definitions — byte 0, changes least ◀ byte 0 system billing header + identity + the ~27K system prompt messages system-reminders (agents/skills/CLAUDE.md/date) + the conversation — changes every turn stable ↓ volatile WHY The cache is a prefix match, so the things that change least must come first or they keep getting invalidated by the things that change most. This ordering is why caching works at all. [measured]

Slide 14

Slide 14 text

The full block map (real capture) request body · turn 1 · claude-sonnet-4-6 1 { 2 "model": "claude-sonnet-4-6", "max_tokens": 32000, 3 "tools": [ 4 { "name":"Bash", "description":"Executes a bash command...", 5 "input_schema":{"type":"object","properties":{…}} }, 6 { "name":"Read", … }, // + Edit, Write, Grep, Glob, … 7 … 30 definitions total … 8 ], 9 "system": [ 10 {"text":"x-anthropic-billing-header: … cch=b2984;"}, // len 85 11 {"text":"You are a Claude agent…", "cache_control":{"ttl":"1h"}}, ◀ BP1 12 {"text":"You are an interactive CLI… [26,934]", "cache_control":{…"1h"}} ◀ BP2 13 ], 14 "messages": [ {"role":"user","content":[ 15 {"text":" agents…"}, // content[0] 1766 16 {"text":" skills…"}, // content[1] 3808 17 {"text":"# claudeMd … date 2026-06-22"}, // content[2] 1861 18 {"text":"Turn one. Reply with…"} ] } ] ◀ BP3 slides 19 } tools @ byte 0 — 30 definitions, the biggest, most stable block. Change it and the whole prefix re-keys. 2 system breakpoints (ttl:1h) freeze the ~27K front — written once, read warm after. messages[0] = ONE message, 4 typed blocks. The cache counts blocks, not messages. WHY Three cache_control markers, all ttl=1h: BP1 on system[1], BP2 on system[2], BP3 on the message tail. Zero markers in the tools array — tools fold into BP1. CLAUDE.md, skills, and the date all sit in messages[0], after the system breakpoints. [capture] CC 2.1.150

Slide 15

Slide 15 text

The tools array @ byte 0 — built-ins AND MCP server tools request body · tools[] · one array, every tool 1 "tools": [ // ONE array: built-ins + every connected MCP server's tools 2 { "name":"Bash", ◀ built-in 3 "description":"Executes a bash command…", 4 "input_schema":{"type":"object", 5 "properties":{"command":{"type":"string"}}} }, 6 { "name":"Read", … }, { "name":"Grep", … }, 7 …28 built-ins… 8 { "name":"mcp__github__create_issue", ◀ MCP server 9 "description":"Create a GitHub issue…", 10 "input_schema":{ "title":{…}, "body":{…} } }, 11 { "name":"mcp__slack__send_message", … } ◀ MCP server 12 ] Built-in tools — Claude Code's own (Bash, Read, Edit, …): name + description + JSON input_schema. MCP server tools sit here too — connect a server and its tools are appended to this SAME array as mcp____, at byte 0. WHY Whether built-in or from an MCP server, every tool definition shares this one array at byte 0 — so connecting, disconnecting, or mutating any server changes byte 0 and cold-rewrites the whole prefix. [capture] CC 2.1.150 · mcp__ entries illustrative

Slide 16

Slide 16 text

Messages vs. blocks — one message, many blocks ① request body · messages[0] · a single user message 1 messages[0] { "role":"user", "content": [ 2 content[0] text agents… 1766 3 content[1] text skills… 3808 4 content[2] text # claudeMd … date … 1861 5 content[3] text the user prompt 33 6 ] } One message, four typed blocks. A message has a role; its content is an ordered list of typed content blocks. WHY Hold onto the message-vs-block distinction. CLAUDE.md, skills, and the date are separate content blocks inside the first user message — after the system breakpoints — which is why editing CLAUDE.md never disturbs the expensive tools+system prefix. [capture] CC 2.1.150

Slide 17

Slide 17 text

Messages vs. blocks — a tool turn ② captured round-trip · run `ls -1` via the Bash tool 1 messages[1] // assistant 2 tool_use Bash {"command":"ls -1"} stop_reason=tool_use 3 messages[2] // user (Claude Code ran it locally) 4 tool_result "CLAUDE.md\nhello.txt" ◀ BP3 5 messages[3] // assistant 6 text "There are two files…" tool_use and tool_result are two blocks. Each parallel tool call = 2 blocks (the call + its result). WHY An assistant turn that fires ten parallel tools is ONE message but twenty-plus blocks. The sliding cache breakpoint lands on the newest block (here, the tool_result). This block-counting is exactly what a big tool burst trips. [capture] CC 2.1.150

Slide 18

Slide 18 text

On the wire: what you get back — a streamed (SSE) response response · text/event-stream 1 event: message_start ◀ usage starts here 2 {"usage":{"input_tokens":3, 3 "cache_creation_input_tokens":30168, 4 "cache_read_input_tokens":0, "output_tokens":1}} 5 event: content_block_start {type:"text"} 6 event: content_block_delta {text:"ONE"} 7 event: content_block_stop 8 event: message_delta ◀ final usage + stop 9 {"stop_reason":"end_turn", "usage":{"output_tokens":4}} 10 event: message_stop message_start carries the input/cache usage up front; tokens then stream as deltas. message_delta carries the final output_tokens and stop_reason. WHY SSE = Server-Sent Events: the server streams typed events over one open HTTP response as it generates, not one JSON blob. Usage rides in two — cache/input counts in message_start (up front), final output in message_delta. The body arrives as content-block deltas (text here; also thinking / tool_use). [capture] CC 2.1.150

Slide 19

Slide 19 text

The content blocks you receive become next turn's input response content → next request's messages[N] 1 assistant reply = an ordered list of content blocks: 2 [thinking] the model's reasoning (sealed — Act II) 3 [text] the prose you read 4 [tool_use] a call your client executes locally 5 6 → this list is appended as messages[N] (role: assistant) 7 and RE-SENT in full on every later turn (it is now history) It becomes history — the reply you just received is resent as input next turn (and cache-written once). WHY What the model writes doesn't vanish: next turn it rides back in the message history as input. That is why output is paid once as output, then once more as a single cache write on the following turn (the next slide proves it). [capture] CC 2.1.150

Slide 20

Slide 20 text

The usage block — this is your bill, per turn response · message_delta · usage 1 "usage": { 2 "input_tokens": 3, 3 "cache_read_input_tokens": 0, 4 "cache_creation_input_tokens": 30168, 5 "output_tokens": 4 6 } cache_read — served from an existing cache entry, billed 0.1×. cache_creation — written to cache this turn, billed 2× (1h TTL). WHY Four numbers decide every turn's cost: input_tokens (1×), cache_read (0.1×), cache_creation (2×), output (5×). Learn to read these and you can tell a warm session from one rebuilding its cache every turn. [capture] CC 2.1.150 · turn-1 cold write

Slide 21

Slide 21 text

PART 3 What it costs Which numbers in the usage block cost what?

Slide 22

Slide 22 text

The three things you pay for (plus output) usage · each bucket's price vs. base input 1 "usage": { 2 "input_tokens": 3, × 1 uncached 3 "cache_read_input_tokens": 30168, × 0.1 the prize 4 "cache_creation_input_tokens": 16, × 2 1h write 5 "output_tokens": 5 × 5 not cached at gen 6 } A hit is ~10× cheaper than processing the same tokens cold (0.1× vs 1×). This is the prize. Output is the priciest class — 5× input; never cached as it's generated (re-paid once as input next turn). WHY That ratio is the whole game: re-processing history is the cost the cache attacks; output is the cost it can't lower as you generate it. Tokens the model writes dominate generation-heavy turns. [capture] CC 2.1.150

Slide 23

Slide 23 text

PART 3 · TWO FACTS TO BURN IN A cache hit is ~10× cheaper. Output is 5× — never cached at generation. Cost = re-processing history + the output you generate. Caching attacks the first term; the output you generate is never cached as it's produced — you re-pay it once as input on the next turn, then it rides warm.

Slide 24

Slide 24 text

The rate card (Anthropic list prices, per 1M tokens) Model Input 1× Cache read 0.1× Cache write 1h 2× Output 5× Fable 5 $10.00 $1.00 $20.00 $50.00 Opus 4.8 $5.00 $0.50 $10.00 $25.00 Sonnet 4.6 $3.00 $0.30 $6.00 $15.00 Haiku 4.5 $1.00 $0.10 $2.00 $5.00 Re-verify for your models and client version — prices are version-specific. WHY Cache read = 0.1× input · 1-hour cache write = 2× input · output = 5× input, on every model. Ratios worth memorizing: Sonnet = 0.6× Opus, Haiku = 0.2× Opus, Sonnet = 3× Haiku. [docs] claude.com/pricing (2026-06-15)

Slide 25

Slide 25 text

PART 4 The fix: caching How do you avoid re-processing the whole history every turn?

Slide 26

Slide 26 text

Caching's payoff: the same prefix, ~20× cheaper one turn later TURN 1 · cold "cache_read": 0, "cache_creation": 30168 TURN 2 · warm "cache_read": 30168, "cache_creation": 16 That gap — a cold write vs a warm read — is what caching buys. The next slides explain how it works. WHY The same ~30K prefix that cost a 2× write on turn 1 is read back at 0.1× on turn 2 — about 20× cheaper. [capture] CC 2.1.150 · claude-sonnet-4-6

Slide 27

Slide 27 text

PART 4 · MENTAL MODEL 2 Caching is a strict byte-prefix match. A cache entry is keyed on the exact tokens from position 0 up to a cut point. It is content-addressed, not position-addressed: identical leading bytes hit. → WHY Once you have this model, cache behavior is obvious instead of mysterious. It starts one level down — in how the model reads your prompt at all.

Slide 28

Slide 28 text

Under the hood: the model reads one token at a time The cat sat on the mat to understand each token, it looks back at all the earlier ones reads in order, left right → A token ≈ ¾ of a word. The model reads your prompt in order; to build each token's meaning it looks back over every token before it. That look-back is the compute that “processing the input” pays for. WHY Because the client re-sends the entire conversation every turn (Part 1), a naive server would redo this whole look-back each turn — exactly what the cache lets it skip. [docs]

Slide 29

Slide 29 text

For each token, the model builds three vectors: Q, K, V token “glass” Q — Query “what am I looking for in the earlier tokens?” K — Key “what do I offer, so other tokens can find me?” V — Value “the information I contribute when I'm attended to.” WHY Queries and Keys decide WHO attends to whom; Values are the content that gets mixed in. Hold onto K and V — those two vectors, computed for every token, are exactly what the prompt cache stores. [docs]

Slide 30

Slide 30 text

Attention: match my Query against earlier Keys, blend their Values I poured water into the glass until it was full “it”'s Query matches the Keys of “glass” and “water” most strongly their Values dominate what “it” means. Every token → runs this against all the tokens before it, in parallel. WHY Attention is causal — a token attends only to tokens before it, so its Key/Value never depend on anything that comes later. That left-to-right property is exactly what lets a fixed prefix be cached. [docs]

Slide 31

Slide 31 text

Why the prefix is the expensive part a ~30,000-token prefix = tools + system prompt + the whole conversation for EACH of those tokens: compute Q, K, V + attend to every earlier token That compute is what you pay for as input_tokens (1×) — or, when written to cache, cache_creation (2×). WHY The longer the prefix, the more K/V to compute and the more attention to run. And the client re-sends the whole conversation every turn — so a naive server redoes all of it, every turn. The cache is what stops that. [docs]

Slide 32

Slide 32 text

The prompt cache: store the K/V, then load instead of recompute COLD — recompute Compute every token's key/value vectors and its attention over all earlier tokens, from scratch. billed 1× (uncached input) WARM — load the KV cache Load the saved key/value vectors for the prefix and start real work at the first uncached token. billed 0.1× (cache read) Loading precomputed K/V (0.1×) instead of recomputing it (1×) is why a cache read is ~10× cheaper. You pay the 2× write once, then read cheap. WHY The cache stores the computed key/value vectors (the “KV cache”) for the prefix tokens, keyed to the exact tokens that produced them. Only a contiguous prefix from token 0 can be cached — state at token N depends on every token before it. [docs]

Slide 33

Slide 33 text

Proof: output is never cached at generation TURN 1 · generate user: "print 1..200" output_tokens: 403 cache_creation: (the 403 is NOT in this turn) TURN 2 · the very next turn cache_creation: 413 ≈ the 403 you just generated, written once then: read warm forever Output is paid once as output, then cache-written once next turn (W2 413 ≈ O1 403), then rides as cheap reads. R3 (30,583) = R2 (30,170) + W2 (413) — exact. WHY Each output token's KV is computed once while decoding and discarded — there's no cached output-KV to reuse, only the re-encoded text. [capture] CC 2.1.150 · reproduces logs/2026-06-19

Slide 34

Slide 34 text

PART 4 · THE HEADLINE INVARIANT Change one token at position N every cached → state at position ≥ N is invalid. Each later key/value vector was computed attending to the token you changed. Cacheable spans grow from the start, never from the middle. WHY This is why render order is load-bearing (stable bytes first) and why a byte change inside the prefix forces a cold rewrite from that point forward — at 2×.

Slide 35

Slide 35 text

Break-even: when does writing to the cache pay off? Requests N No cache With cache (2 + 0.1·(N−1)) Winner 1 1.0× 2.0× no cache 2 2.0× 2.1× ≈ tie 3 3.0× 2.2× cache ✓ 10 10.0× 2.9× cache ✓ The 5-min TTL (1.25× write) breaks even on the 2nd request; Claude Code forces the 1-hour TTL (2×) the 3rd. → WHY Caching is a trade: pay one expensive write (2×) so later requests get cheap reads (0.1×). It wins from the 3rd request on. A short interaction that ends after one or two turns can be cheaper WITHOUT caching. [docs] derived from the rate card

Slide 36

Slide 36 text

Mistake — paying for a cache you never read back ⚠ MISTAKE ⚠ A wasted write (content cached but never read again) costs 2× — double what you'd pay had you never cached it. Caching is a bet that you'll re-read the prefix at least three times. FIX ✓ Let Claude Code's defaults stand for interactive sessions (they're re-read many times). Only worry about this in custom harnesses that cache aggressively but terminate early. WHY Every read also refreshes the TTL, so a continuously-reused prefix never expires — the win compounds the longer a session runs. [docs]

Slide 37

Slide 37 text

PART 5 How Claude Code caches Where are the cache boundaries, and how do they move?

Slide 38

Slide 38 text

Claude Code's actual caching: the 3-breakpoint dump captured request · cache_control markers 1 TOOLS 30 definitions ◀ byte 0, no marker 2 SYSTEM 3 system[0] len 85 billing header (cch=b2984) 4 system[1] len 62 "You are a Claude agent…" ◀ BP1 (1h) 5 system[2] len 26934 full system prompt ◀ BP2 (1h) 6 MESSAGES 7 messages[0].content[0] agents 8 messages[0].content[1] skills 9 messages[0].content[2] CLAUDE.md + rules + date 10 messages[0].content[3] user prompt ◀ BP3 (1h, slides) tools fold into BP1 — there is no dedicated tool breakpoint. 2 frozen system breakpoints (BP1, BP2) + 1 sliding tail (BP3) = 3 markers, all ttl=1h. WHY Captured via raw request logging: 3 cache_control breakpoints, all 1-hour TTL — 2 on the system tier, 1 on the sliding message tail. The tools array carries zero markers; it folds into the first system breakpoint. [capture] CC 2.1.150

Slide 39

Slide 39 text

Where each breakpoint sits — and what its prefix covers captured request · cache_control placement 1 TOOLS 30 definitions ◀ byte 0, no marker 2 SYSTEM 3 system[0] len 85 billing header (cch=b2984) 4 system[1] len 62 "You are a Claude agent…" ◀ BP1 (1h) 5 system[2] len 26934 full system prompt ◀ BP2 (1h) 6 MESSAGES 7 messages[0].content[0] agents 8 messages[0].content[1] skills 9 messages[0].content[2] CLAUDE.md + rules + date 10 messages[0].content[3] user prompt ◀ BP3 (1h, slides) BP1's cached prefix covers tools + system[0] + system[1] — the whole stable front. (system[2] below is BP2's.) BP3 slides onto the newest message tail each turn; everything in messages[] is the message tier. WHY BP1 folds tools + system[0] + system[1]; BP2 covers the full system prompt (system[2]); BP3 is the sliding tail. CLAUDE.md, skills, and the date all sit in messages[0] — after the system breakpoints, so editing them never disturbs the front. [capture] CC 2.1.150

Slide 40

Slide 40 text

All three breakpoints request the 1-hour TTL captured request · the cache_control markers 1 system[1] "cache_control": {"type":"ephemeral","ttl":"1h"} ◀ BP1 2 system[2] "cache_control": {"type":"ephemeral","ttl":"1h"} ◀ BP2 3 msg tail "cache_control": {"type":"ephemeral","ttl":"1h"} ◀ BP3 ttl = 1h on all three — Claude Code overrides the API's 5-minute default. WHY The 1-hour TTL costs 2× to write (vs 1.25× for 5-min) — which is why break-even is the 3rd request, not the 2nd. Claude Code chooses this because interactive sessions reuse the prefix for far longer than 5 minutes. [capture] CC 2.1.150

Slide 41

Slide 41 text

The sliding window: the tail breakpoint moves forward each turn turn 1 (cold) entire prefix written cold (2×) BP3 turn 2 cached prefix (read warm · 0.1×) new (2×) BP3 turn 3 cached prefix (read warm · 0.1×) new (2×) BP3 WHY After turn 1's cold write, each later turn slides BP3 to the newest message: everything earlier is read warm (0.1×) and only the new slice since the last boundary is written (2×, the yellow tail). The big system prompt is written exactly once. [measured]

Slide 42

Slide 42 text

The sliding window, measured: only the tail moves Turn BP3 sits on cache_read cache_creation 1 messages[0].content[3] 0 30,168 2 messages[2].content[0] 30,168 16 3 messages[4].content[0] 30,184 16 WHY BP1 and BP2 (the system tier) hold the same bytes every turn — written once on turn 1, read warm after. Only BP3 slides to the newest tail, writing just the ~16-token delta. And R3 = R2 + W2 = 30,184 — exact. [capture] CC 2.1.150

Slide 43

Slide 43 text

Delta-only writes: the cost is just the new tokens TURN 1 · cold cache_read: 0 cache_creation: 30168 (whole prefix written) TURNS 2–3 · warm cache_read: 30168 cache_creation: 16 (only the tail written) write cost per turn ≈ (new tokens since the last boundary) × 2×. The big system prompt is written exactly once. WHY After the cold turn-1 write, every later turn reads the whole prefix warm and writes only the ~16-token sliding tail. [capture] CC 2.1.150

Slide 44

Slide 44 text

The prefix grows by exactly last turn's write the identity R(N+1) = R(N) + W(N) 1 turn 2: read = 30168 write = 16 2 turn 3: read = 30184 3 4 30184 = 30168 + 16 5 R3 = R2 + W2 ✓ exact Exact, to the token. Next turn's warm read = this turn's read + this turn's write. Cache entries are immutable; nothing is purged. WHY The cached prefix grows each turn by exactly what the previous turn wrote. A big generated block (e.g. 1..200) is written once, then sits inside the warm read on every subsequent turn — never regenerated, never re-written. [capture] CC 2.1.150

Slide 45

Slide 45 text

PART 5 · THE COROLLARY YOU'LL LEAN ON Every read refreshes the TTL. A continuously-reused prefix never expires. The longer a session runs, the more the one-time write cost amortizes into cheap, TTL-refreshing reads.

Slide 46

Slide 46 text

PART 5 · A LIMIT YOU CAN'T CHANGE FROM INSIDE Claude Code uses 3 of the 4 available cache breakpoints. The 4th slot sits unused, and you can't place your own. A request to expose breakpoint placement in settings.json was closed as “not planned” (#58103). WHY So from inside the client you can't add a breakpoint where you might want one — say, to survive a big burst of parallel tool calls (Part 6 shows how that can blow the cache). A request-rewriting proxy could use the 4th slot, but it then owns your cache correctness and rarely pays off.

Slide 47

Slide 47 text

The billing-header gotcha: system[0] is outside the cache key Turn-2 mutation cache_read cache_creation result none (control) 30,178 21 HIT system[0] version digit 30,206 25 HIT system[1] one byte 0 30,212 MISS WHY system[0]'s cch= token changes every request, yet warm reads still HIT — the ENTIRE first system block is special- cased out of the cache key. A byte flip in system[1] misses hard. Don't chase system[0] as a phantom buster. [capture] CC 2.1.150 · reproduces logs/2026-06-17

Slide 48

Slide 48 text

PART 6 What breaks it What quietly throws your warm cache away?

Slide 49

Slide 49 text

PART 6 · WHAT BUSTS THE CACHE Anything that changes the prefix bytes re-keys the cache. The next request rewrites at 2× instead of reading at 0.1×. Below: the ways it happens in a real session — tool churn, injected context, runtime tool mutation, and oversized tool bursts.

Slide 50

Slide 50 text

The invalidation hierarchy: what survives vs. re-keys each tier ✅ ❌ Change Tools System Messages Tool definitions (add / remove / reorder) ❌ ❌ ❌ Model switch ❌ ❌ ❌ speed / web-search / citations toggle ✅ ❌ ❌ System-prompt content ✅ ❌ ❌ tool_choice / images / thinking toggle ✅ ✅ ❌ Message content ✅ ✅ ❌ The columns run in byte order (Tools System Messages), so it's always a staircase: break a tier and every tier after it re-keys too — you never see warm ( ) after → → ✅ cold ( ). That cascade is what makes it a hierarchy. ❌ WHY = that tier still reads warm; = it re-keys (cold-writes at 2×). The two rows that re-key ALL THREE tiers — tool- ✅ ❌ definition changes and model switches — are the expensive ones: they sit at or before byte 0, so they rebuild everything. [docs] Anthropic — Prompt caching (cache- tiers / what-invalidates)

Slide 51

Slide 51 text

Dynamic MCP tools — only one of these lives at byte 0 Thing Lives in Cache impact Tool definitions tools param — byte 0 changing these is catastrophic Tool calls (tool_use) assistant message late, cheap Tool results (tool_result) user message late, cheap WHY MCP servers can change their advertised tools at runtime via notifications/tools/list_changed. When they do, the new definition lands at byte 0 the entire prefix re-keys a full cold rebuild at 2×. Calls and results are late and cheap. → → [docs + measured]

Slide 52

Slide 52 text

Dynamic registration on the wire — the new tool lands at byte 0 request body · tools[] after a tools/list_changed # mid-session, the MCP server announces a new tool 1 server: {"method":"notifications/tools/list_changed"} 2 TURN N "tools": [ …30 defs… ] ◀ byte 0 ──────────────────────────────────────────────── 3 TURN N+1 "tools": [ // SAME array, re-shipped in full 4 { "name":"Bash", … }, …29 unchanged… 5 { "name":"mcp__github__merge_pr", ◀ NEW def 6 "description":"Merge a pull request…", 7 "input_schema":{ "pull_number":{…} } } 8 ] byte 0 changed → whole prefix re-keys # …further down the same request body… 9 messages[…] content: tool_use → mcp__github__merge_pr ◀ the CALL 10 messages[…] content: tool_result ◀ the RESULT The trigger — notifications/tools/list_changed: an MCP server changes its advertised tools at runtime. New definition at byte 0 — the next turn re-ships the tools array with the new def near the front. byte 0 changes the whole prefix cold-rewrites at 2×. → Call & result land late — tool_use and tool_result append to the message tail: read warm, billed cheap. Only the definition is catastrophic. WHY An MCP server can add or drop tools mid-session via notifications/tools/list_changed. The next turn re-ships the entire tools array — including the new definition — at byte 0, so the cached prefix is thrown away and cold-rewritten at 2×. The matching tool_use / tool_result land late in the message tail and stay cheap. That's why a “dynamic toolsets” feature is a cache hazard, while a tool set fixed at startup is safe. [docs] MCP tools/list_changed · schema illustrative

Slide 53

Slide 53 text

Proof: the async-connector blow-up NO --strict-mcp-config turn 1 tools = 30 (cold) turn 2 tools = 55 MISS turn 3 tools = 85 MISS connectors register mid-session --strict-mcp-config turn 1 tools = 30 turn 2 tools = 30 HIT turn 3 tools = 30 HIT byte 0 frozen from turn 1 Four claude.ai connectors registered asynchronously after startup, so byte 0 kept growing and busting the cache every turn. Pinning the config froze it at 30. WHY Stabilize the tool/MCP surface FIRST. At a 2× write rate every cold turn is doubly expensive — the highest-leverage fix in the guide. [capture] 2026-06-15

Slide 54

Slide 54 text

Injected context: the date barely costs anything TURN before midnight messages[0] date: 2026-06-16 cache_read: 21812 cache_creation: 8305 TURN after midnight messages[0] date: still 06-16 cache_read: 30117 cache_creation: 58 The date lives in a system-reminder AFTER the system breakpoints. A midnight rollover doesn't rewrite messages[0]; it APPENDS a “date has changed” note to the newest turn — 58 tokens, not a rebuild. WHY Git status is snapshotted once at session start (frozen). Re-running it every turn and injecting it early would bust constantly — a cautionary design lesson. [capture] 2026-06-17

Slide 55

Slide 55 text

Case study: GitHub removed it; Docker ships it default-on Mode Tools Mutates byte 0 at runtime? Cache-safe? default / --toolsets fixed at startup (~50) No ✅ GitHub --dynamic-toolsets (removed v1.1.0) 3 meta-tools, grow on demand Yes — enable_toolset AddTool → list_changed → ❌ Docker dynamic-tools (default- ON) mcp-find / mcp-add / mcp- remove … Yes — same MCP Go SDK sink ❌ GitHub removed its dynamic path in v1.1.0 (tech-debt, not cache — PR #2512). Docker frames the cost as token volume; the prompt-cache consequence is our analysis. WHY When the model enables a toolset, the handler registers tools against the live server and fires tools/list_changed → new defs at byte 0 full cold rebuild, paid again on every enable call. → [source] github-mcp-server v1.0.5 · docker/mcp-gateway v0.43.0

Slide 56

Slide 56 text

PART 6 · WHY BLOCKS MATTER Re-linking the cache looks back only ~20 blocks. Each turn the sliding breakpoint sits on the newest message. To reuse the cache, the API walks back at most ~20 content blocks from that breakpoint, hunting for last turn's entry — and gives up if it finds none in that window, writing the whole prefix cold as if it had never been cached. (It counts blocks, not messages: ~10 parallel tools ≈ 20 blocks.) WHY Why 20 and not more? A fixed cap keeps the lookup cheap — the API never rescans thousands of blocks per request to find a match. (Anthropic documents the 20-block limit, but not the reason; the compute rationale is our inference.) [docs] 20-block lookback · rationale inferred

Slide 57

Slide 57 text

Proof: agentic tool bursts overflow the 20-block lookback 28 parallel tools = 57 blocks one turn fans out 28 tools blocks added: 57 next turn: MISS read 0 write 28149 5 parallel tools = 11 blocks one turn fans out 5 tools blocks added: 11 next turn: HIT read 25672 write 452 vs The sliding breakpoint sits on the newest message; the API walks back only ~20 blocks from it. A burst that adds >20 blocks (≈10 parallel tools) pushes last turn's entry out of reach the API can't re-link and cold-rewrites the whole prefix. → WHY You can't place your own breakpoints inside Claude Code, so the lever is to bound the burst — fewer parallel calls per turn. [measured] claude-opus-4-8

Slide 58

Slide 58 text

Recap: the four mental models 1 · The amnesiac contractor The server remembers nothing; the request is the only state. 2 · The prefix-match cache Caching is a strict byte-prefix match; one early byte changed invalidates everything after it. 3 · The model-scoped key A cache entry belongs to exactly one model; another model can't read it. (Act II) 4 · The encrypted envelope The model's reasoning rides back sealed; you carry it but can't open it. (Act II) You used the first two all through Act I. The last two — model-scoped key, encrypted envelope — are what Act II is about.

Slide 59

Slide 59 text

What to do — the end of Act I ✓ Freeze the tool/MCP surface from turn 1 (--strict-mcp-config). Byte-0 churn is the most expensive mistake. ✓ Keep volatile tokens (dates, IDs) after the last breakpoint — exactly where Claude Code puts them. ✓ Verify caching with the usage block, not faith: a non-zero cache_read across turns means warm. ✓ Don't trust the displayed cost as your invoice — it prices 1h writes at the 5-min rate. Console is truth. ✓ Bound parallel tool bursts (< ~10/turn) so you don't overflow the 20-block lookback. Steady-state, a single-model session is mostly cheap reads + the output you generate. The expensive surprises are Act II.