KV Cache — Aftomathia

The hook

Watch ChatGPT or Claude answer you: the words appear one at a time. Here's the part nobody tells you — the model doesn't write a sentence in one go. It writes one word, then starts over to pick the next word, then starts over again… hundreds of times per answer.

Now imagine a person who, every time they add a word to their essay, re-reads the entire essay from the very beginning — including your question and the whole conversation. Word 500 would require re-reading 499 words of work they already did. Insane, right?

The KV cache is how models avoid being that person: it's the model's notebook. Work done once on previous words gets written down and reused, so each new word only requires the new thinking.

Intuition @practitioner

Recall how Attention works: when the model processes a new token, that token's query gets compared against the keys and values of every previous token.

Here's the crucial observation: because generation is causal (tokens only look backward), the keys and values of past tokens never change. Token 17's key/value vectors are identical whether the model is currently writing token 18 or token 800. So why recompute them?

The KV cache is exactly that: after computing each token's K and V vectors (at every layer), store them. For the next token, compute only its query/key/value, attend against the stored K/V, and append the new K/V to the cache. You've traded memory for compute — the classic caching bargain — and turned a quadratic re-computation into a linear lookup.

The catch: the cache grows with every token, at every layer, for every sequence being served. At scale, KV cache memory — not model weights — becomes the thing that limits how many users a GPU can serve at once.

In practice

Where the KV cache shows up in real life:

Why the first word takes longest. Before generating, the model must build the cache for your entire prompt in one pass — called prefill. That's the pause before the first token. After that, each token is cheap (decode), which is why text then streams smoothly.
Why long conversations get slow and expensive. The cache grows linearly with context length — more memory per user, more data to read per token. API prices and rate limits reflect this directly.
Prompt caching discounts. When providers offer cheap "cached input tokens," they're literally saving and reusing the KV cache for a prompt prefix you send repeatedly (like a long system prompt). Same trick, one level up.
Serving tricks you'll read about: PagedAttention (vLLM) manages cache memory in small pages like an operating system, eliminating fragmentation; grouped-query attention (GQA) and multi-head latent attention (MLA) redesign attention itself to shrink the cache 4–30×; KV quantization stores the cache in 8-bit or 4-bit numbers.
A rule of thumb: generation speed is usually limited by how fast the GPU can read memory (weights + KV cache), not by math. That's why "memory bandwidth" is the spec that matters for inference — a preview of the hardware story.

Theory

Generation runs in two phases with very different characters:

Prefill — the whole prompt is processed in one parallel pass. Every token's K and V vectors are computed at every layer and written into the cache. Compute-heavy, parallel, GPU-friendly: cost grows roughly O(n²) with prompt length (it's full Attention over the prompt).
Decode — one token at a time. The new token's embedding produces its q, k, v at each layer; q attends over all cached K/V; the new k, v are appended. Per-token cost is O(n) in context length — but it's nearly all memory reads, not math, which is why decode is memory-bandwidth-bound.

Without the cache, generating token n would redo attention for all previous tokens: O(n²) work per token, O(n³) for a whole response. With the cache: O(n) per token, O(n²) total — the difference between unusable and instant.

What's stored: for each layer and each attention head, the K and V vectors of every token seen so far. Nothing about the model's weights is cached — the cache is per-conversation state, which is why serving 1,000 simultaneous chats needs 1,000 KV caches resident in GPU memory.

Used today

Every chat product you use — ChatGPT, Claude, Gemini — streams tokens through a KV cache; the prefill/decode split is why you see a pause, then smooth streaming.
Prompt caching is a billed product feature at Anthropic, OpenAI, and Google: reusing stored KV state for repeated prefixes at a fraction of the price.
vLLM's PagedAttention is the de-facto open-source serving standard, built entirely around cache memory management.
GQA is in nearly every open model (Llama, Mistral, Qwen); MLA powers DeepSeek's models; both exist purely to shrink this cache.
Long-context agents (coding assistants that hold whole repos in context) live and die by KV-cache economics — cache size is the real cost of "just put everything in context."

Where it’s going

Cache compression & eviction — research on dropping or merging unimportant cached tokens (H2O, SnapKV, TOVA): keep the notebook, tear out pages that stopped mattering.
Architectures that shrink or kill the cache — state-space models (Mamba) carry a fixed-size state instead of a growing cache; hybrid models mix a few attention layers with many SSM layers to get most of the quality with a fraction of the cache.
Cache as infrastructure — sharing KV state across requests, persisting it between sessions ("resume this conversation without re-prefilling"), even shipping it between datacenters. Expect "KV cache" to appear in more pricing pages and system-design docs.
Hardware co-design — chips like Groq's LPU keep everything in on-chip SRAM (~80 TB/s vs HBM's ~3 TB/s) precisely because decode is a memory-read problem; the KV cache is now a first-class input to chip design.

Big picture

The KV cache is where Attention's elegant math collides with physical reality — gigabytes and dollars. The O(n²) in the attention formula is abstract until you multiply it by 80 layers and a million users; the KV cache is the engineering answer, and its size dictates what a "context window" costs, why long chats are slow, and how many users one GPU can serve.

It's also the cleanest example of the field's most reliable pattern: trade memory for compute. You'll see the same bargain in prompt caching, in speculative decoding, in every serving system.

Mental anchor: attention decides what to look at; the KV cache is where what-was-seen lives. And since writing each new token is mostly reading that cache from memory, the speed of AI in practice is less about math and more about how fast memory can be read — which is exactly where the hardware story (GPUs, HBM, Groq, Cerebras) begins.

✦ The hook

◉ Intuition @practitioner

⚙ In practice

△ Theory

● Used today

◇ Where it’s going