6 Request Compute, Memory, and Latency Bottlenecks

The previous chapter tackled scheduling, addressing how to keep the GPU busy by managing how requests are handled. This chapter examines the efficiency with which individual requests are handled. The first two sections address memory-bandwidth bottlenecks at the hardware execution level. The middle sections cover efficient use and reuse of the KV cache. And the remaining sections target the sequential decoding constraint, exploring techniques that can generate multiple tokens per step.

6.1 Memory-Aware Attention Kernels

Why standard attention wastes bandwidth

To understand why attention is expensive in practice, you need to think about where data lives during computation. Recall from Section 3.4 that the GPU has small, fast, on-chip registers and SRAM, and a large, slow off-chip HBM. The straightforward implementation of self-attention is depicted in Figure 6.1 and the memory access is as follows:

Read the \(Q\) and \(K\) tensors, each shape (H, S, \(\textcolor{#2563EB}{d_K}\)) — reminder, this would be (B, H, S, \(\textcolor{#2563EB}{d_K}\)) with batched processing, but we’re avoiding the clutter of repeatedly listing the batch dimension.
Compute the score matrix \(\text{Scores} = QK^T\) and write \(\text{Scores}\), shaped (H, S, S), to HBM.
Read \(\text{Scores}\) shaped (H, S, S) back from HBM.
Apply SoftMax to get the attention weights and write \(\text{Attn}\), also shaped (H, S, S), to HBM.
Read \(\text{Attn}\) shaped (H, S, S) back from HBM, read the \(V\) tensor shaped (H, S, \(\textcolor{#2563EB}{d_K}\)), and compute the weighted values \(\text{Weighted Values} = \text{Attn}V\).
Write the weighted values, shaped (H, S, \(\textcolor{#2563EB}{d_K}\)), to HBM

The read of \(Q\) and \(K\) at the beginning needs to happen in some form, and the writing of the weighted values also needs to happen at the end. But the writing and reading back of the (H, S, S) \(\text{Scores}\) and \(\text{Attn}\) tensors in the middle wastes bandwidth, especially as S gets large. For a sequence of length 4,096, the score matrix alone is \(4096 \times 4096 = 16.7\) million entries per head. The actual computations of the matrix multiplications and softmax are necessary, fixed costs. The round-trip to HBM for these intermediate results is slow and wastes HBM memory. This problem scales quadratically with sequence length — double the sequence length and you quadruple the HBM usage and intermediate memory traffic. The latency associated with increased HBM traffic reduces MFU and increases TTFT. The quadratic memory usage is also why this implementation of attention causes GPU out-of-memory errors for long sequences.

It’s worth noting that even a single matmul like \(\text{Scores} = QK^T\) doesn’t happen as one monolithic operation. The \(Q\) and \(K\) matrices are far too large to fit in SRAM, so the GPU kernel tiles the computation. Consider \(Q\) with shape (S, \(\textcolor{#2563EB}{d_K}\)) and \(K^T\) with shape (\(\textcolor{#2563EB}{d_K}\), S). A standard tiled matmul loops over row-blocks of \(Q\) (groups of query positions) and column-blocks of \(K^T\) (groups of key positions). For each pair, the kernel loads a block of \(Q\) rows and a block of \(K^T\) columns into SRAM, computes their partial product, and writes the result into the corresponding tile of \(\text{Scores}\) in HBM. Since \(d_K\) is small (typically 64 or 128), the inner dimension often fits in a single pass. But the outer dimension doesn’t: each column-block of \(K^T\) may be re-read for every row-block of \(Q\). This means the total HBM reads exceed the theoretical minimum of reading each matrix exactly once — the GPU re-reads portions of \(Q\) and \(K\) multiple times to tile through the computation. But that extra read traffic is modest and scales with S, not \(\textcolor{#DC2626}{S}^2\). The real waste in this simple attention implementation is the round-trips between operations: writing the (H, S, S) \(\text{Scores}\) matrix to HBM just so softmax can read it back, then writing the attention weights just so the next matmul can read them.

FlashAttention

FlashAttention (Dao et al. 2022) solves this memory problem by never materializing the full attention matrix in HBM. If we had enough SRAM, we could store these attention matrices there, and we would avoid the round trip to HBM, but we don’t. The core idea with FlashAttention is once again tiling: break the \(Q\), \(K\), and \(V\) matrices into small blocks that fit in SRAM, then compute the attention output block by block, keeping all intermediate results on-chip. The key change is combining enough steps into a single kernel that processes one block of \(K\) and \(V\) against all blocks of \(Q\) before moving on. Once a given portion of the weighted values is fully computed for a given block of \(K\) and \(V\), this portion of the attention scores and weights are not needed anymore, so there’s no need to write them to HBM, and they are discarded.

The algorithm works in two nested loops. The outer loop iterates over blocks of \(K\) and \(V\). The inner loop iterates over blocks of \(Q\). For each pair of blocks, the kernel computes the local attention scores, applies a local softmax, and accumulates the weighted sum, all in SRAM. The tricky part is that softmax requires the maximum and sum over the full row of scores, not just the current block. FlashAttention handles this with an online softmax technique that maintains running statistics and corrects the accumulated output as new blocks are processed.

The result is depicted in Figure 6.2. For large values of S, HBM traffic drops from \(O(\textcolor{#DC2626}{\text{S}}^2)\) to \(O(\textcolor{#DC2626}{\text{S}} \cdot \textcolor{#2563EB}{d_K})\) for the attention matrix, where \(\textcolor{#2563EB}{d_K}\) is the head dimension (typically 64 or 128). For a 4K sequence with \(\textcolor{#2563EB}{d_K}=128\), that’s a reduction from 16.7 million elements to about 524K elements, roughly a 32x reduction in memory traffic for the attention intermediates.

Here’s a counterintuitive detail: FlashAttention actually performs more FLOPs than standard attention. The online softmax rescaling adds extra multiply-accumulate operations. But wall-clock time drops substantially because the bottleneck was never the arithmetic — it was the HBM traffic. By trading a small increase in compute for a large decrease in memory movement, FlashAttention increases arithmetic intensity and MFU, moving to a much better spot on the roofline. (FlashAttention also has optimizations for model training that are outside the scope of our inference discussion.)

FlashAttention’s tiling is most effective during prefill, where the sequence length S (the full prompt) is large and the \(O(\textcolor{#DC2626}{\text{S}}^2)\) memory traffic savings and decrease to TTFT are dramatic. During decode, however, the query dimension is just 1 (the new token). FlashAttention parallelizes over batch size and query length, so with a single-token query, most of the GPU’s streaming multiprocessors sit idle — there simply aren’t enough blocks of work to fill the machine.

This gap motivated a family of dedicated decode-phase attention kernels. Flash-Decoding (Dao et al. 2023) added a new parallelization dimension: splitting the KV sequence across thread blocks, with a lightweight reduction kernel to rescale and combine partial results. This keeps the on-chip memory benefits of FlashAttention while fully utilizing the GPU even at small batch sizes, delivering up to 8x speedups for long contexts. FlashDecoding++ (Hong et al. 2023) further improved on this with asynchronous softmax (avoiding a global synchronization barrier) and better handling of the flat GEMM shapes that arise during decode. More recently, FlashMLA (DeepSeek-AI 2025) provides optimized decode kernels for DeepSeek’s Multi-head Latent Attention, and FlashInfer (Ye et al. 2025) offers a unified kernel library covering prefill, decode, and append with support for paged KV caches and attention variants like GQA.

FlashAttention v1, v2, v3, and v4

The original FlashAttention (Dao et al. 2022) established the tiling approach. Subsequent versions improved the implementation substantially:

FlashAttention-2 (Dao 2023) restructured the algorithm to improve GPU parallelism. The key change was swapping the inner and outer loops – iterating over \(Q\) blocks in the outer loop and \(K/V\) blocks in the inner loop. This reduced the amount of shared memory communication between warps (groups of 32 threads that execute together on the GPU) and improved occupancy on modern GPUs. The result was roughly a 2x speedup over v1.

FlashAttention-3 (Shah et al. 2024) targeted the H100 specifically, exploiting its new hardware features. It introduced warp specialization – dedicating some warps to data loading while others perform computation, overlapping memory access and arithmetic within the same kernel. It also leveraged the H100’s FP8 tensor cores and hardware-assisted asynchronous memory operations. These changes pushed FlashAttention closer to the theoretical peak throughput on H100 hardware. FlashAttention-4 (Dao et al. 2025) continues this hardware-specific approach for the Blackwell B200, achieving up to 1,605 TFLOPS and 71% hardware utilization.

The progression across versions illustrates a broader principle: writing a correct tiled attention kernel is highly beneficial, and extracting maximum performance from a specific GPU architecture can lead to further gains. Many other attention kernels have been written. One worth mentioning is FlexAttention (Dong et al. 2024), which provides a flexible framework in PyTorch for implementing various attention algorithms (dense, sparse, low-rank) without having to implement new kernels from scratch for every use case.

Note

FlashAttention handles the tiling of attention computation within a single GPU. When the KV sequence is very long, you may also want to parallelize the attention computation across the sequence dimension using multiple GPU cores or even multiple GPUs. This is the role of FlashDecoding, which we’ll cover in Section 7.6.

6.2 Kernel Fusion and Compute Graph Optimization

FlashAttention is a specific instance of a more general optimization strategy: reducing the number of times data makes a round trip to HBM by combining operations into fewer GPU kernels. This section covers the general techniques.

Kernel fusion

Every time the GPU finishes one kernel and starts the next, there’s an implicit data handoff through HBM. The output of the first kernel gets written to HBM, and the input of the second kernel gets read from HBM. If those two kernels operate on the same data, this round trip is pure waste.

Kernel fusion combines multiple operations into a single GPU kernel so that intermediate results stay in registers or SRAM. Common fusion opportunities in transformer inference include:

Fused QKV projection: instead of three separate matrix multiplications for \(Q\), \(K\), and \(V\), fuse them into a single kernel that reads the input activations once and produces all three outputs
Fused softmax + mask: apply the causal mask and softmax in one pass, avoiding a separate write-then-read for the masked scores
Fused add + LayerNorm: combine the residual addition and layer normalization, keeping the intermediate sum in registers
Fused RoPE + attention: apply rotary position embeddings to \(Q\) and \(K\) within the attention kernel itself

Each fusion eliminates one HBM round trip. For a single transformer layer, there might be 10 or more such opportunities. Across 80+ layers in a large model, these small savings add up to a meaningful reduction in total memory traffic and kernel launch overhead, increasing arithmetic intensity and decreasing TPOT.

CUDA graphs

We introduced CUDA graphs briefly in Section 3.4. Here’s the full picture.

During decode, the GPU executes the same sequence of kernels at every step — the same transformer layers in the same order with the same shapes (for a given batch size). But the CPU doesn’t know that. At each step, it launches each kernel individually, waits for the GPU to be ready, sends the launch command, and moves on to the next kernel. Each launch takes roughly 5–10 microseconds, which doesn’t sound like much, but a single decode step through a large model might involve hundreds of kernel launches. At 5 \(\mu\)s each, that’s hundreds of microseconds per token of pure overhead — time added to TPOT that the GPU spends idle waiting for the CPU to tell it what to do next.

A CUDA graph captures a fixed sequence of kernel launches into a single replayable object. You run the decode step once in “capture mode,” which records every kernel launch, memory copy, and synchronization point. Then on subsequent steps, you replay the entire graph with a single command. The CPU issues one launch instead of many, and the GPU executes the full sequence without waiting.

The main limitation is that CUDA graphs are static. The captured graph assumes fixed tensor shapes and memory addresses. If the batch size changes, or a request finishes and leaves the batch, the graph is invalidated and must be recaptured. Serving frameworks handle this with a small pool of pre-captured graphs for common batch sizes and sequence lengths, falling back to eager execution for uncommon shapes.

Computation graph compilation

Rather than manually fusing kernels, you can let a compiler do it. Computation graph compilers analyze the full model graph, identify fusion opportunities, optimize memory layouts, and emit optimized GPU code. This more efficient code improves both TTFT and TPOT, but doesn’t directly support higher concurrency.

torch.compile in PyTorch traces the model’s computation graph and applies a series of optimizations including operator fusion, memory planning, and code generation through backends like Triton (Tillet et al. 2019). For inference workloads, it can deliver significant speedups with minimal code changes.

TensorRT-LLM (NVIDIA 2024b) takes a more aggressive approach. It converts the model into NVIDIA’s TensorRT format, which applies extensive graph-level optimizations — layer fusion, precision calibration (auto-selecting FP16 or FP8 for each operation), kernel auto-tuning, and memory-efficient execution planning. The compilation step takes longer, but the resulting engine is heavily optimized for the specific model and GPU.

The tradeoff is flexibility versus performance. Eager execution (no compilation) is the most flexible but the slowest. torch.compile offers a middle ground. TensorRT-LLM produces the fastest code but requires a more involved build step and is harder to modify. Most production serving frameworks use some form of graph compilation, often TensorRT-LLM for NVIDIA hardware.

6.3 KV Cache Engineering

The KV cache is one of the biggest memory consumers during inference, and managing it well has a direct impact on concurrency levels and maximum sequence lengths. This section covers the core techniques for reducing KV cache memory pressure.

KV cache memory formula

For a model with L layers, \(\textcolor{#C084FC}{H_{KV}}\) KV heads per layer, and head dimension \(\textcolor{#2563EB}{d_K}\), each token adds the following to the KV cache:

\[\text{bytes per token} = 2 \times \textcolor{#8B4513}{\text{L}} \times \textcolor{#C084FC}{H_{KV}} \times \textcolor{#2563EB}{d_K} \times b\]

The factor of 2 accounts for both keys and values. \(b\) is the bytes per element (2 for FP16/BF16, 1 for FP8/INT8).

For a concrete example, consider a 70B parameter model with L = 80 layers, \(\textcolor{#C084FC}{H_{KV}} = 8\) GQA heads (see Section 4.4), \(\textcolor{#2563EB}{d_K} = 128\), stored in FP16:

\[\text{bytes per token} = 2 \times 80 \times 8 \times 128 \times 2 = 32,768 \text{ bytes}\] On an 80 GB GPU that’s already holding 35 GB of model weights (a 70B model in INT4), you’re consuming another MB for about every 30 tokens per request. The techniques in this section – paging, quantization, compression, and eviction – deal with how we can fit more requests into the same memory budget.

PagedAttention

In a naive implementation, you allocate a contiguous block of GPU memory for each request’s KV cache sized for the maximum possible sequence length. This wastes memory in three ways, illustrated in Figure 6.3:

reserved slots are those held for tokens that will be, but haven’t been generated yet,
internal fragmentation is slots allocated that will never be used because the sequence won’t reach the maximum length, and
external fragmentation is gaps between allocations that are too small to be useful (which arise if requests have varying lengths)

This naive approach is good in that it avoids out of memory errors for requests that get the initial allocation, it guarantees contiguous memory for each request’s KV cache, and it’s simple to implement. But the PagedAttention paper (Kwon et al. 2023) found that combined wasted memory could be as high as 80%.

PagedAttention, which was introduced in vLLM, borrows the virtual memory model from operating systems to solve the wasted memory problem. Instead of one big contiguous allocation per request, the KV cache is divided into fixed-size pages (also called blocks), typically holding 16 or 32 tokens each. A request’s KV cache is a linked list of pages that can be scattered anywhere in GPU memory. The same requests from Figure 6.3 are shown in Figure 6.4 with a paged memory model.

The benefits are significant:

Near-zero fragmentation: no external fragmentation, and the only internal fragmentation is the partial last page of each request. With 16-token pages, average waste is 8 tokens per request — negligible compared to the old approach of reserving memory for the maximum sequence length. The near-zero memory waste allows dramatically higher concurrency and longer sequence lengths within the same memory budget.
Dynamic allocation: pages are allocated on demand as the sequence grows, so a short request doesn’t tie up memory it will never use, and a long generation doesn’t hold reserved memory for long periods.
Sharing: two requests that share a common prefix (same system prompt, for instance) can point to the same physical pages for the shared portion, storing the data once. The combination of paged attention and ragged batching (discussed in Section 5.3) is a beautiful solution that reduces both KV cache memory usage and memory reads, benefiting TPOT. Figure 5.8 shows the elegance and large savings when you combine the two, and is worth reviewing again here.
Fine-grained preemption: if the scheduler needs to free memory (Section 5.4), pages for an evicted request can be reallocated individually, rather than killing the entire request. vLLM uses this optimistic strategy with recomputation, finding that when the request is resumed, most of the needed pages are often still in memory, avoiding much of the recomputation cost.

PagedAttention has become the standard approach for KV cache memory management. Nearly every major serving framework, including vLLM, SGLang, and TensorRT-LLM, uses some variant of it.

KV cache quantization

The KV cache formula above assumed 2 bytes per key/value using FP16 storage. If you store keys and values in INT8 or FP8 instead, you immediately halve the memory per token. For our 70B model example, that drops per-request KV cache memory from 1.28 GB to 640 MB at 4K sequence length.

KV cache quantization is separate from model weight quantization (Section 4.1). The model weights can be in one precision while the KV cache is stored in another. In practice, many deployments that run model weights in INT4 keep the KV cache in FP8 or INT8, because these different components have different sensitivity to precision loss.

The quality impact of KV cache quantization is generally small. Keys and values are intermediate activations, not learned parameters, and small quantization errors in individual KV entries tend to average out across the many entries involved in an attention computation. The tradeoff is straightforward: reduced memory for a small and often negligible reduction in output quality.

Another quantization approach, which garnered a lot of attention recently, is TurboQuant (Zandieh et al. 2025). Rather than simply choosing a new datatype, they use techniques like random rotations to support quantization to lower bit depths. They report success squeezing entries down to 3.5 bits without hurting LLM performance. Interest in KV cache quantization techniques is high because bandwidth-bound nature of decoding means that reductions in KV cache size provide close to linear speedups to TPOT and throughput.

KV cache compression

Rather than quantizing individual elements, you can compress the KV cache using a learned low-rank representation. The idea is to store a compressed version of the keys and values that uses fewer dimensions, then reconstruct the full representation when needed for attention.

Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2 (A. Liu et al. 2024), is the most prominent example. As was mentioned in Section 4.4, instead of caching separate key and value tensors for each head, MLA compresses them into a single low-rank latent representation. This can reduce KV cache size by 5–10x compared to standard multi-head attention, at the cost of additional compute to decompress during attention (and there are operation fusion techniques to mostly eliminate the overhead). The key difference from the attention architecture changes covered in Section 4.4 (like GQA and MQA) is that MLA achieves compression without reducing the number of effective attention heads – it maintains full representational capacity while storing less data.

Memento (Microsoft Research 2026) is a research model from Microsoft that includes the LLM in the process of learning a compressed KV representation. Rather than utilizing a separate compression process, Memento trains the LLM itself to participate in the compaction process for long reasoning chains. First, the model performs reasoning in chunks. At the conclusion of each chunk, the model produces a short, compressed summary of the key information from that portion of the reasoning. This summary is kept, and the raw text from the reasoning chunk is discarded. Since long reasoning chains make up the majority of generated tokens in many applications, this approach can reduce the KV cache size by 2-3x and double throughput. This is just one example of the active research in KV cache compression and compaction.

Selective KV cache and token eviction

Not all tokens in the context are equally important for generating the next token. In many attention patterns, a small fraction of tokens receive the vast majority of attention weight, while most tokens contribute very little. If you can identify tokens that never receive significant attention, you can remove them from the KV cache without impacting the attention mechanism. This observation motivates selective KV cache strategies that drop or compress low-importance entries.

Keyformer (Adnan et al. 2024) identifies important tokens based on accumulated attention scores. Tokens that consistently receive high attention weight across layers and heads are retained, while low-scoring tokens are evicted. The retained set is refreshed periodically as the model generates more tokens and attention patterns shift.

Native Sparse Attention (NSA) (Yuan et al. 2025) from DeepSeek takes a different approach, designing the sparsity pattern into the model architecture itself. Rather than using a learned or heuristic eviction policy at inference time, NSA defines structured sparse attention patterns that combine compressed coarse-grained tokens, selected high-relevance tokens, and a sliding window of recent tokens. This approach avoids the quality risk of post-hoc eviction because the model is trained to work with the sparse pattern.

The general tradeoff with token eviction is that it reduces memory but introduces risk. If the eviction policy drops a token that turns out to be important for a later generation step, the output quality degrades. Conservative eviction (keeping more tokens) is safer but saves less memory.

6.4 Prompt and Prefix Caching

The shared prefix opportunity

In many real-world deployments, a large fraction of requests share the same prefix. A chatbot application might prepend a 2,000-token system prompt to every user message. A RAG pipeline might prefix retrieved context that is identical across multiple queries. A coding assistant might include the same repository-level context for many completion requests.

Without caching, the system runs prefill on those 2,000 shared tokens for every single request, computing the same key-value pairs over and over. If you’re serving 100 requests per second, that’s potentially 200,000 tokens of redundant prefill per second. The wasted compute shows up directly as higher TTFT and lower throughput.

RadixAttention

RadixAttention (Zheng et al. 2023), introduced in the SGLang framework, stores KV cache entries in a radix tree keyed by token sequences. A radix tree is a prefix-compressed trie: shared prefixes are stored once, and branches diverge only where token sequences differ. An example radix tree is shown in Figure 6.5.

When a new request arrives, the system walks the radix tree matching its token sequence. If the first 2,048 tokens match a cached prefix, those KV entries are reused directly — no prefill needed for them. The system only runs prefill on the remaining tokens that diverge from the cached prefix. For a request with a 2,000-token system prompt and a 200-token user message, this cuts prefill work by roughly 90%.

The radix tree structure enables fine-grained prefix matching that goes beyond simple exact-prefix caching. Two requests that share a system prompt but have different few-shot examples will still reuse the system prompt portion. A multi-turn conversation reuses all KV entries from previous turns.

Prompt caching in production APIs

The radix tree approach is what happens inside the inference engine. At the API level, prompt caching is often exposed as an explicit feature. When prefix caching is enabled, a provider hashes the prefix tokens and looks up cached KV entries. If found, the cached entries are reused and the customer pays reduced cost for the cached portion.

This is a win for both sides: the user gets faster TTFT and lower cost, and the provider avoids redundant compute. The effectiveness depends on how much prefix overlap exists in the workload. Applications with long, stable system prompts benefit enormously, while applications with unique prompts for every request see little benefit.

Implementation dependencies

Prefix caching can be implemented very efficiently with the paged KV cache from Section 6.3. In the PagedAttention model, the physical pages for the shared prefix are simply referenced by multiple requests through their page tables. The pages themselves are stored once in GPU memory, and a reference count tracks how many requests are using each page. When the last request using a shared prefix completes, the pages can be freed or kept for future reuse based on an eviction policy (typically LRU).

Without paged memory, sharing would require contiguous memory layouts to be identical across requests, which is impractical. PagedAttention makes prefix sharing a natural consequence of the memory model.

6.5 Speculative Decoding and Multi-Token Prediction

Everything up to this point has optimized the cost of individual prefill or decode steps. But there’s a more fundamental constraint we haven’t attacked yet: the autoregressive loop itself. Standard decoding produces one token per forward pass, and each token depends on the previous one. For a 500-token response, that’s 500 sequential forward passes through the entire model. Even if each pass is perfectly optimized, you’re still bound by the serial dependency chain.

This section covers techniques that break or amortize that sequential constraint, generating multiple tokens per model invocation.

The core idea of speculative decoding

Speculative decoding (Leviathan et al. 2022; Chen et al. 2023) is based on a simple but powerful observation: verifying a sequence of multiple tokens is cheaper than generating them one at a time.

Here’s the setup. You have a large target model — the main model you actually want to generate from — and a small, fast draft model. The draft model generates \(k\) candidate tokens autoregressively (cheap, because the draft model is small). Then the target model processes all \(k\) candidates in a single forward pass, checking whether it would have generated the same tokens. Accepted tokens are kept. The first rejected token is resampled from the target model’s distribution. Draft tokens after the first rejection cannot be used because they have diverged from the target model’s distribution, so they are discarded. Then the cycle repeats.

In the best case, all \(k\) draft tokens are accepted, and you get \(k + 1\) tokens from one target model forward pass. In the worst case, the first draft token is rejected, and you get just 1 token from that target model forward pass. If you’re averaging more than one token per target forward pass, TPOT is improved. A comparison of standard decoding and speculative decoding is shown in Figure 6.6.

The reason speculative decoding works is the asymmetry between generation and verification. Generating \(k\) tokens with the target model requires \(k\) sequential forward passes. But verifying \(k\) tokens requires just one forward pass, because you can feed all \(k\) tokens as input and check them all in parallel, in the same way prefill processes multiple tokens at once. Blockwise Parallel Decoding (Stern et al. 2018) is an early precursor to this idea.

Acceptance rate and expected speedup

The acceptance rate \(\alpha\) is the probability that the target model accepts a draft token. It depends on how well the draft model approximates the target model’s distribution for the current context and task. Typical acceptance rates range from 0.6 to 0.9, depending on the draft model quality and the generation temperature.

The expected number of tokens generated per verification cycle is:

\[E[\text{tokens per cycle}] = \frac{1 - \alpha^{k+1}}{1 - \alpha}\]

where \(k\) is the number of draft tokens. For \(\alpha = 0.8\) and \(k = 4\), this gives about 3.4 tokens per cycle on average.

The speedup to TPOT depends on the relative cost of the draft and target models. If the draft model runs at \(c\) times the speed of the target model (so a draft forward pass takes \(1/c\) the time of a target forward pass), the wall-clock speedup is approximately:

\[\text{speedup} \approx \frac{E[\text{tokens per cycle}]}{1 + k/c}\]

The denominator accounts for the cost of the \(k\) draft passes (each costing \(1/c\) of a target pass) plus the one verification pass. For \(c = 10\) (the draft model is 10x faster), \(k = 4\), and \(\alpha = 0.8\), the speedup is roughly \(3.4 / 1.4 \approx 2.4\text{x}\).

A crucial property of speculative decoding is that it produces exactly the same distribution as the target model. This is not a heuristic or an approximation. The acceptance/rejection mechanism is designed so that the output distribution is identical to what you’d get from running the target model autoregressively. The speculative decoding paper goes into the details of how rejection sampling is used to ensure the exact statistical distribution is preserved.

Tree-based speculative decoding

Instead of generating a single chain of \(k\) draft tokens, the draft model can produce a tree of candidates. At each position, the draft model proposes multiple alternatives, branching into different possible continuations.

The target model verifies the entire tree in a single forward pass using a carefully constructed attention mask. The longest accepted path through the tree becomes the output. Because the tree explores multiple branches, it has a higher chance of finding a long accepted sequence than a single chain does, especially when individual token acceptance rates are moderate.

The cost is that the tree contains more total tokens than a single chain, so the verification forward pass does more work. But since the verification step is typically memory-bandwidth-bound (it’s essentially a prefill over the tree tokens), the marginal cost of additional tree tokens is often small.

Inference with reference

Inference with reference (Yang et al. 2023) is a specialized form of speculative decoding where the “draft” comes from an existing text, which could be a retrieved document, a cached previous response, or template text. If the target model’s likely output overlaps significantly with the reference text, you can propose long spans of reference tokens as candidates and verify them in a single forward pass.

This works especially well for tasks like text editing, abstractive summarization (where the output may quote the input), or regenerating a cached response with minor modifications. The acceptance rate can be very high — often above 0.95 — making this one of the most efficient speculative decoding variants when it applies.

Speculative speculative decoding

Standard speculative decoding has a subtle inefficiency: the draft model must wait for the target model to complete verification before it can start drafting the next batch of tokens. This is because the draft model needs to know which tokens were accepted so it can start with the correct prefix.

Speculative speculative decoding (Kumar et al. 2026) eliminates this sequential dependency. The draft model speculatively starts generating the next batch of candidates before the target model finishes verifying the current batch. It does this by assuming all current draft tokens will be accepted and conditioning on that optimistic prefix. If some tokens are rejected, the speculative draft work is discarded and redone, but when acceptance rates are high, the early generation pays off, and the draft and target models can operate in a pipelined fashion with better hardware utilization.

When speculative decoding helps vs. hurts

Speculative decoding is not a universal win. Several factors determine whether it helps:

Temperature: at low temperatures (greedy or near-greedy sampling), the draft model’s predictions are more likely to match the target model’s sampled tokens, giving higher acceptance rates. At high temperatures, the target model’s distribution is more spread out, making it harder for the draft to guess correctly. Speculative decoding works best for deterministic tasks like code generation and factual Q&A.

Draft model quality: a better draft model means higher acceptance rates and more tokens per cycle. But usually, better draft models are also larger and slower, which cuts into the speedup. There’s a sweet spot where the draft model is good enough to maintain high acceptance rates but small enough to be much faster than the target.

Batch size: this is the most important practical consideration. Speculative decoding’s benefit comes from replacing sequential target model forward passes with one parallel verification pass. But in high-throughput serving with large batch sizes, the target model’s forward passes may have high MFU already, meaning that the extra verification tokens can slow down the forward passes. Adding a draft model on top of a high-MFU pipeline can drive an increase in TPOT, offsetting gains from speculative decoding, and in the extreme, it can actually hurt throughput. Speculative decoding shines in low-batch-size, latency-sensitive scenarios.

Multi-token prediction

Speculative decoding uses a separate draft model to propose candidates. Multi-token prediction (MTP) methods take a different approach to speculative decoding: they generate multiple future tokens using the base model itself, without any auxiliary model.

Multi-token prediction (Gloeckle et al. 2024) modifies the model architecture to predict multiple future tokens at each position. In addition to the standard next-token prediction head, the model has auxiliary heads that predict the token 2, 3, or \(k\) steps ahead. During training, all heads are trained simultaneously with a shared loss. A comparison of standard decoding and multi-token prediction is shown in Figure 6.7.

At inference time, the auxiliary heads generate candidate tokens for positions beyond the next token. These candidates are then verified by treating them as draft tokens in a speculative decoding framework. When the candidates are accepted, the model has effectively generated multiple tokens from a single forward pass through the main body of the network.

DeepSeek-V3 uses MTP with two prediction heads, and Meta’s Llama architecture has explored similar approaches. The key advantage of MTP over external draft models (covered above) is that the draft candidates come from the model itself, so they tend to have high acceptance rates. One cost is a modest increase in model size, which is usually negligible. The primary barrier to widespread adoption is the added training complexity from the auxiliary heads.

Medusa

Medusa (Cai et al. 2024) allows you to do MTP with a pre-trained base model that wasn’t trained for MTP. It adds multiple lightweight decoding heads to an ordinary base model. The base model has the standard one head that predicts the next token. Medusa adds \(k\) additional heads, where head \(i\) predicts the token \(i+1\) positions ahead. These heads are small (typically just a single linear layer or a small MLP) and can be inexpensively trained on top of the frozen base model.

At inference time, all \(k+1\) heads produce predictions simultaneously from the same hidden state. The candidates from all heads are organized into a tree structure and verified using the tree-based attention approach described earlier. The longest accepted path through the tree gives the output.

Medusa has a practical advantage over standard speculative decoding with a separate draft model, since there’s no need to run a separate model. The additional compute for the lightweight heads is small compared to the main model’s forward pass. The disadvantage is that the heads need to be trained for each specific base model, adding a fine-tuning process before the model can be used.

6.6 Parallel Decoding

Lookahead decoding

Lookahead decoding (Fu et al. 2024) reformulates autoregressive generation as a fixed-point problem. Instead of generating tokens one at a time, it maintains a window of W future token positions, initialized with guesses. Then it iteratively refines all positions in parallel using Jacobi iteration: at each step, every position computes what its token should be given the current values of all other positions. When the sequence converges — that is, when a contiguous run of positions doesn’t change between iterations — those tokens are accepted.

The intuition is that many tokens in a sequence are relatively predictable given their context. Common phrases, syntactic patterns, and function words often converge quickly. The tokens that take many iterations to converge are the “surprising” ones — the genuinely creative or informative tokens.

Lookahead decoding requires no draft model, no extra parameters, and no changes to the base model. The tradeoff is that each Jacobi iteration involves a forward pass over W token positions, which is more expensive than a single-token decode step. The method wins when enough tokens converge quickly to offset the cost of processing the wider window.

Tree-based verification as a unifying framework

Looking across the techniques in these last two sections, a common pattern emerges. Speculative decoding with a draft tree, multi-token prediction, and even lookahead decoding all follow the same high-level structure:

Propose: generate multiple candidate tokens using some cheap mechanism (draft model, auxiliary heads, Jacobi iteration, or reference text)
Organize: arrange candidates into a tree (or chain) of possible continuations
Verify: run the target model on the full tree in a single forward pass, using a custom causal mask to check all candidates in parallel
Accept: take the longest prefix (or longest path through the tree) where the target model agrees with the proposals

The main differences between techniques lie in step 1 — how candidates are generated. The verification and acceptance steps are similar across all methods. This unifying view makes it easier to reason about the tradeoffs: the quality and cost of the proposal mechanism determine the acceptance rate and overhead, while the verification step is a single forward pass whose cost depends on the total number of candidate tokens. And all this works because the one-token-at-a-time autoregressive process is not a fundamental constraint, just a convenient way to structure generation.

This unifying view also explains why these techniques interact with batch size the same way. The verification step is essentially a small prefill — it processes multiple tokens in parallel. At large batch sizes, the GPU is already busy with many requests, so the marginal cost of the extra verification tokens is higher relative to the benefit. At small batch sizes, the GPU has spare capacity, and the extra tokens come nearly for free.

The techniques in this chapter operate at the level of individual requests — optimizing attention kernels, managing KV cache memory, and breaking the sequential decoding constraint. In the next chapter, we will consider what happens when a single GPU isn’t enough: how to distribute the model across multiple devices while keeping communication overhead manageable.

6.7 Further Reading

Attention kernels. Aleksa Gordić’s “ELI5: FlashAttention” (Gordić 2023) builds up the FlashAttention algorithm from first principles, starting with vanilla attention and addressing its inefficiencies one by one. It’s one of the most widely recommended explainers for building intuition about why tiling and IO-awareness matter. For readers who want to go deeper into hardware-specific optimizations, Tri Dao’s blog post accompanying FlashAttention-3 (Dao 2024) walks through how asynchronous memory transfers and FP8 computation work together on Hopper GPUs.

Kernel-level optimizations. NVIDIA’s “Mastering LLM Techniques: Inference Optimization” (Verma and Vaidya 2023) is a broad overview covering kernel fusion, quantization, and parallelism strategies, and is one of the most-referenced introductions to the topic. For a more focused look at how CUDA graphs reduce kernel-launch overhead in LLM decode loops specifically, NVIDIA’s blog post on optimizing llama.cpp (NVIDIA 2024a) walks through the implementation and reports concrete latency improvements.

KV cache quantization and compression. The chapter covers PagedAttention for memory management and touches on KV cache quantization and token eviction. For readers interested in going deeper on these topics, several papers push the frontier of KV cache compression. KIVI (Z. Liu et al. 2024) shows that keys and values have different distribution properties and should be quantized differently — per-channel for keys, per-token for values — achieving 2-bit KV cache quantization with minimal quality loss. H2O (Zhang et al. 2023) formalizes the observation that a small fraction of tokens dominate attention scores and proposes a “heavy-hitter” eviction policy that retains these tokens plus recent tokens, with theoretical guarantees on approximation quality. SnapKV (Li et al. 2024) takes a different approach, using an observation window at the end of the prompt to identify which KV positions each attention head consistently focuses on, then compressing the cache before generation begins. This is a fast-moving area; for a continuously updated survey of KV cache optimization techniques — compression, merging, quantization, budget allocation, and cross-layer sharing — the “Awesome-LLM-KV-Cache” repository (Cai 2024) is a useful index.

Prefix caching. The LMSYS blog post introducing SGLang and RadixAttention (LMSYS 2024) is the best visual explanation of how a radix tree enables automatic, fine-grained KV cache reuse across requests. It illustrates four common sharing patterns in LLM workloads and walks through the tree operations step by step with an LRU eviction policy.

Speculative decoding. For an accessible introduction, the PyTorch blog’s “A Hitchhiker’s Guide to Speculative Decoding” (PyTorch Team 2024) walks through draft-verify mechanics with diagrams and practical implementation guidance. For a retrospective from the original authors on how speculative decoding has been deployed in production at Google — including AI Overviews in Search — see Leviathan et al. (2024). For a comprehensive academic treatment, Xia et al. (2024) surveys the full landscape of speculative decoding methods, organizing them by drafter selection strategy and verification approach, with comparative benchmarks.

Adnan, Muhammad, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, and Purushotham Kamath. 2024. “Keyformer: KV Cache Reduction Through Key Tokens Selection for Efficient Generative Inference.” MLSys. https://proceedings.mlsys.org/paper_files/paper/2024/hash/48fecef47b19fe501d27d338b6d52582-Abstract-Conference.html.

Cai, Tianle, Yuhong Li, Zhengyang Geng, et al. 2024. “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.” arXiv Preprint arXiv:2401.10774. https://arxiv.org/abs/2401.10774.

Cai, Zefan. 2024. “Awesome-LLM-KV-Cache.” https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache.

Chen, Charlie, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. “Accelerating Large Language Model Decoding with Speculative Sampling.” arXiv Preprint arXiv:2302.01318. https://arxiv.org/abs/2302.01318.

Dao, Tri. 2023. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” arXiv Preprint arXiv:2307.08691. https://arxiv.org/abs/2307.08691.

Dao, Tri. 2024. “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision.” https://tridao.me/blog/2024/flash3/.

Dao, Tri, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” arXiv Preprint arXiv:2205.14135. https://arxiv.org/abs/2205.14135.

Dao, Tri, Daniel Haziza, Francisco Massa, and Grigory Sizov. 2023. Flash-Decoding for Long-Context Inference. https://crfm.stanford.edu/2023/10/12/flashdecoding.html.

Dao, Tri, Jay Shah, Ganesh Bikshandi, et al. 2025. FlashAttention-4: Hardware-Friendly Attention on Hopper and Blackwell GPUs. https://tridao.me/blog/2025/flash4/.

DeepSeek-AI. 2025. FlashMLA: Efficient Multi-Head Latent Attention Decoding Kernels. https://github.com/deepseek-ai/FlashMLA.

Dong, Juechu, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. 2024. “FlexAttention: A Programming Model for Generating Optimized Attention Kernels.” arXiv Preprint arXiv:2412.05496. https://arxiv.org/abs/2412.05496.

Fu, Yichao, Peter Bailis, Ion Stoica, and Hao Zhang. 2024. “Break the Sequential Dependency of LLM Inference Using Lookahead Decoding.” arXiv Preprint arXiv:2402.02057. https://arxiv.org/abs/2402.02057.

Gloeckle, Fabian, Badr Youbi Idrissi, Baptiste Roziere, David Lopez-Paz, and Gabriel Synnaeve. 2024. “Better & Faster Large Language Models via Multi-Token Prediction.” arXiv Preprint arXiv:2404.19737. https://arxiv.org/abs/2404.19737.

Gordić, Aleksa. 2023. “ELI5: FlashAttention.” https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad.

Hong, Ke, Guohao Dai, Jiaming Xu, et al. 2023. “FlashDecoding++: Faster Large Language Model Inference on GPUs.” arXiv Preprint arXiv:2311.01282. https://arxiv.org/abs/2311.01282.

Kumar, Tanishq, Tri Dao, and Avner May. 2026. “Speculative Speculative Decoding.” arXiv Preprint arXiv:2603.03251. https://arxiv.org/abs/2603.03251.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” arXiv Preprint arXiv:2309.06180. https://arxiv.org/abs/2309.06180.

Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2022. “Fast Inference from Transformers via Speculative Decoding.” arXiv Preprint arXiv:2211.17192. https://arxiv.org/abs/2211.17192.

Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2024. “Looking Back at Speculative Decoding.” https://research.google/blog/looking-back-at-speculative-decoding/.

Li, Yuhong, Yingbing Huang, Bowen Yang, et al. 2024. “SnapKV: LLM Knows What You Are Looking for Before Generation.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2404.14469.

Liu, Aixin et al. 2024. “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.” arXiv Preprint arXiv:2405.04434. https://arxiv.org/abs/2405.04434.

Liu, Zirui, Jiayi Yuan, Hongye Jin, et al. 2024. “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache.” Proceedings of the 41st International Conference on Machine Learning. https://arxiv.org/abs/2402.02750.

LMSYS. 2024. “Fast and Expressive LLM Inference with RadixAttention and SGLang.” https://lmsys.org/blog/2024-01-17-sglang/.

Microsoft Research. 2026. “Memento: Teaching LLMs to Manage Their Own Context.” https://www.microsoft.com/en-us/research/articles/memento-teaching-llms-to-manage-their-own-context/.

NVIDIA. 2024a. “Optimizing llama.cpp AI Inference with CUDA Graphs.” https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-inference-with-cuda-graphs/.

NVIDIA. 2024b. “TensorRT-LLM Documentation.” https://nvidia.github.io/TensorRT-LLM/.

PyTorch Team. 2024. “A Hitchhiker’s Guide to Speculative Decoding.” https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/.

Shah, Jay, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision.” arXiv Preprint arXiv:2407.08608. https://arxiv.org/abs/2407.08608.

Stern, Mitchell, Noam Shazeer, and Jakob Uszkoreit. 2018. “Blockwise Parallel Decoding for Deep Autoregressive Models.” arXiv Preprint arXiv:1811.03115. https://arxiv.org/abs/1811.03115.

Tillet, Philippe, H. T. Kung, and David Cox. 2019. “Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations.” Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL). https://doi.org/10.1145/3315508.3329973.

Verma, Shashank, and Neal Vaidya. 2023. “Mastering LLM Techniques: Inference Optimization.” https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/.

Xia, Heming et al. 2024. “Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding.” Findings of the Association for Computational Linguistics: ACL. https://arxiv.org/abs/2401.07851.

Yang, Nan, Tao Ge, Liang Wang, et al. 2023. “Inference with Reference: Lossless Acceleration of Large Language Models.” arXiv Preprint arXiv:2304.04487. https://arxiv.org/abs/2304.04487.

Ye, Zihao, Lianmin Zheng, Lequn Chen, et al. 2025. “FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.” arXiv Preprint arXiv:2501.01005. https://arxiv.org/abs/2501.01005.

Yuan, Jingyang et al. 2025. “Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention.” arXiv Preprint arXiv:2502.11089. https://arxiv.org/abs/2502.11089.

Zandieh, Amir, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2025. “TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate.” ICLR. https://arxiv.org/abs/2504.19874.

Zhang, Zhenyu, Ying Sheng, Tianyi Zhou, et al. 2023. “H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2306.14048.

Zheng, Lianmin, Liangsheng Yin, Zhiqiang Xie, et al. 2023. “SGLang: Efficient Execution of Structured Language Model Programs.” arXiv Preprint arXiv:2312.07104. https://arxiv.org/abs/2312.07104.