Speeding up generation with the KV cache

Watch the decode loop closely and a glaring waste appears. To generate token 1,001, the new token’s query must be compared against the keys of all 1,000 previous tokens, and their values blended. But those tokens have not changed since the last step. Recomputing their keys and values from scratch, in every block, for every new token, would mean redoing almost the entire history’s work at every single step. No production system does this.

Store every token’s keys and values the first time they are computed, and each new step only computes one new token’s worth of work.

The fix

The KV cache is exactly what the name says: a memory buffer holding the key and value vectors of every token processed so far, for every attention head in every block. The loop becomes:

During prefill, compute the keys and values of all prompt tokens once, in parallel, and store them.
To generate a new token, run only that token through the network. Compute its query, key, and value; compare its query against all cached keys; blend the cached values; append the new key and value to the cache.
Repeat. Each step adds one entry and recomputes nothing.

Note what is cached and what is not. Queries are not stored: a token’s query is used the moment the token is processed and never needed again, since future tokens compare their own queries against past keys. Keys and values are the reusable half of attention, which is why the cache is named after them.

What it buys

Without the cache, generating each token costs work proportional to the whole sequence reprocessed through every block. With it, each step costs one token’s pass plus one lookup across the cache. The speedup grows with sequence length, and for long contexts it is the difference between usable and unusable: this single optimization is why chatbots stream tokens at a steady rate instead of slowing to a crawl as the conversation grows.

What it costs

The trade is memory. The cache holds two vectors per token, per head, per block, and it grows linearly with every token in the context. For a large model serving a 100k-token context, the cache reaches tens of gigabytes, often rivaling the model weights themselves, and it must live in fast GPU memory to be useful. This is the quiet reason behind several things you see as a user:

Long-context requests are priced higher and limited in concurrency: each one pins a giant cache in GPU memory.
Serving research obsesses over shrinking the cache, with techniques like grouped-query attention, where several heads share one set of keys and values.
Providers offer prompt caching discounts: if many requests share a long common prefix, such as a system prompt, its KV cache can be computed once and reused across requests, skipping that part of prefill entirely.

Chapter summary

The journey of this chapter, end to end: token vectors enter a stack of identical blocks, where attention moves information between tokens using queries, keys, and values, and feed-forward networks process each token alone. The LM head turns the final vector into logits, softmax makes probabilities, and a decoding strategy picks one token. Prompts are prefetched in one parallel pass, generation decodes one token at a time, attention’s quadratic appetite sets the context window, and the KV cache makes the loop fast by never repeating finished work.

In the next chapter we ask where the model’s weights come from in the first place: training, from raw text prediction to instruction following.