5 Scheduling Bottlenecks

The previous chapter focused on making the model itself smaller and faster, such as fewer parameters, lower precision, or more efficient attention mechanisms. Those techniques change what gets computed. This chapter drills into how requests flow through the system. Even with a well-optimized model, a naive scheduler can leave the GPU sitting idle while requests wait in line, waste compute on padding tokens that contribute nothing, or let a single long prefill block dozens of shorter requests from making progress.

The techniques here — batching strategies, prefill management, and request scheduling — act on the serving system layer. They’re about keeping the GPU busy doing useful work as much of the time as possible, regardless of what model is being used.

5.1 Batching Strategies

Why batching matters

In Section 3.4, we established that decode is deeply memory-bandwidth-bound. Each decode step reads the entire set of model weights from HBM to produce just one token per request. On an H100 SXM with a 70B parameter model in FP16, that’s roughly 140 GB of weight data read through 3.35 TB/s of bandwidth, requiring about 42 ms just to stream the weights, and all of that work produces a single token per request.

Batching changes this equation. When you process B requests in a single decode step, you still read the weights once, but you apply them to B token vectors instead of one. The compute scales linearly with B, but the data movement stays roughly constant (dominated by the weight read). This directly increases the arithmetic intensity — the ratio of FLOPs to bytes moved — pushing decode toward the compute-bound side of the roofline.

Let’s put concrete numbers on this. For a single linear layer with weight matrix of shape (D, D) in FP16:

Data moved: \(2\textcolor{#1E40AF}{D}^2\) bytes (reading the weight matrix)
Compute: \(2B\textcolor{#1E40AF}{D}^2\) FLOPs (\(B\) matrix-vector multiplications)
Arithmetic intensity: \(2B\textcolor{#1E40AF}{D}^2 / 2\textcolor{#1E40AF}{D}^2 = B\) FLOP/byte

Recall Figure 3.7 shows how increasing batch size moves decode further right toward the compute-bound region, improving throughput.

With batch size 1, the arithmetic intensity is about 1 FLOP/byte, far below the H100’s FP16 ridge point of ~295 FLOP/byte. The GPU’s tensor cores are almost entirely idle, waiting for data to arrive from HBM. With batch size 64, the intensity rises to 64 FLOP/byte. Still below the ridge point, but now the GPU is doing 64x more useful work per byte read. With batch size 256, we’re approaching the ridge point and the GPU’s compute units are finally busy. We have successfully increased MFU, at a roughly constant TPOT.

The throughput gain is dramatic. A system decoding one request at a time on an H100 might produce ~20 tokens/s. The same system decoding a batch of 64 requests might produce ~1500 tokens/s total — roughly the same latency per step, but 50x the throughput. This is why batching is the simplest, most important lever for inference cost efficiency.

The memory cost of batching

There’s a catch, and it’s a big one. Each request in the batch needs its own KV cache, with the stored key and value tensors from every layer, accumulated across every token processed so far. As we’ll cover in detail in Section 6.3, the KV cache for a single request can consume several gigabytes for long sequences with large models. The large amount of memory needed for the KV cache sets a concurrency ceiling on how many requests can be processed in the same batch.

Here’s a concrete example. If your model weights occupy 30 GB on an 80 GB GPU, you have roughly 50 GB left for KV caches (minus activation memory and framework overhead). How many requests fit depends on the model architecture, the KV cache precision, and the sequence lengths involved. A batch of 64 requests, each with 2,048 tokens of context, might consume far more than 50 GB of KV cache, at which point you simply can’t fit that many requests.

This is the central batching tradeoff: larger batches improve GPU utilization and throughput, but each additional request in the batch requires more KV cache memory. We want the improvements to MFU, concurrency, and throughput from larger batches, but we are limited by the amount of HBM available. HBM is an expensive and limited resource, so you can’t easily throw more memory at the problem. If we push the batch size too high, we may run out of GPU memory. Keep it too low and we’re leaving performance on the table. The memory management techniques in Section 6.3 and the preemption policies we’ll discuss at the end of this chapter focus on this tradeoff.

Example workload

To compare the batching strategies that follow, we’ll use the same small workload in Table 5.1: five requests arriving at staggered times, each needing 2 prefill steps and a varying number of decode steps.

Table 5.1: Example workload used in the batching diagrams below.

Request	Arrival time	Decode steps
A	0	4
B	1	6
C	1	3
D	4	4
E	5	3

This is a toy example. In practice, prefill time depends on input length and can vary enormously across requests, queuing delays can stretch to seconds under heavy load, and decode runs for hundreds or thousands of steps. We’ve flattened all of that to keep the diagrams readable. The point is to show how each strategy handles staggered arrivals and variable-length outputs, not to depict realistic timescales.

Static batching

The simplest batching strategy is static batching: collect a fixed number of requests, process them all together, and don’t start the next batch until every request in the current batch has finished generating.

Figure 5.1 shows the results of our workload with a static batch size of 4 requests. Requests A through D start work at time step 4 and finish at time step 11. At time step 12, Request E is still waiting for three more requests to arrive, and could finish much later. There are obvious problems with these results. First, the GPU sits idle for the first four time steps even though we have requests in hand, because we don’t have a full batch yet. Next, Requests A, C, and D finish earlier but their batch slots sit empty. The GPU does no useful work for those slots while continuing to work only on Request B until it completes. Meanwhile, Request E is stuck in the queue even though there are free batch slots starting at time step 9. This problem where a single slow request holds up the entire batch is known as head-of-line blocking. Imagine how much worse this problem would be if B were a very long request generating thousands of tokens. Head-of-line blocking hurts almost all latency and utilization metrics, and it can really destroy goodput.

There are also issues with prefill, which are not so obvious from our simplified diagram. If requests have different input lengths and prefill is done with all of them at the same time (as shown here with A, B, C, and D), then the shorter inputs need to be padded to match the longest one, wasting compute on dummy tokens during prefill.

Static batching works fine for offline workloads where all inputs and outputs have uniform length — say, scoring a dataset of fixed-length classification examples. Having the entire dataset available at the beginning, instead of dealing with unknown arrival times, makes scheduling much easier. For anything with variable-length inputs, variable-length outputs, and stochastic arrival times, which is the common case in online LLM generation, it wastes too many GPU cycles.

Dynamic batching

Dynamic batching improves on static batching by allowing the batch size to vary. Instead of fixing the batch size up front, the scheduler collects requests from a queue and forms a batch based on what’s available, subject to a maximum batch size and a timeout. If the batch size is 32 and only 5 requests are waiting, once the timeout elapses, a batch with 5 requests will start processing, rather than waiting for a full batch of 32. This reduces queuing delay and TTFT, and dynamic batching allows the system to adapt to variable arrival rates.

Figure 5.2 shows the results of dynamic batching on the sample workload in Table 5.1 for a maximum batch size of 4 and a timeout of 2 time steps. Under dynamic batching, the first batch starts at time step 2 and finishes requests A, B, and C by time step 9. Requests D and E start immediately after Batch 1 finishes. These two requests start at time step 10 and finish at time step 15. Compared to static batching, we have reduced the amount of waiting before Batch 1 starts. But it still has the same fundamental problem as static batching during decode: the batch doesn’t complete until the last request finishes generating. Short-output requests leave batch slots that sit idle while they wait for long-output requests to wrap up, so a really long Request B still wastes a lot of GPU capacity, lowering MFU. In our example, requests D and E stay queued until time step 10 even though the GPU has capacity starting at time step 7.

Continuous batching

One breakthrough that modern serving systems are built on is continuous batching, also called in-flight batching, introduced by Orca (Yu et al. 2022). The key insight is to operate at the granularity of individual decode iterations rather than complete requests.

In continuous batching, the scheduler makes a decision at every decode step: which requests should be in this step’s batch? When a request finishes generating (by hitting a stop token or a length limit), its slot is immediately freed and can be filled by a new request from the queue, without waiting for the rest of the batch to finish. This eliminates head-of-line blocking, and significantly improves MFU.

On our workload from Table 5.1, we see in Figure 5.3 that continuous batching completes all five requests by time step 12. Requests A and D each start as soon as they arrive. Request E does have to wait, but it starts processing as soon as the first batch slot is available, at time step 8. Continuous batching has eliminated unnecessary GPU idle time waiting for batches to be full (great for TTFT), and it has eliminated the problem of waiting for every request in a batch to complete before new requests can be added (great for MFU). It has a weakness in that long prefills still block the entire GPU, harming TPOT. In our example workload, we see this happening at time steps 1, 3, 5, and 9. At time step 1, new prefills can’t start until Request A’s prefill finishes. At time steps 3, 5, and 9, additional decode steps can’t start until the prefills for Requests B, C, D, and E finish. (Note that for static and dynamic batching, we showed all requests having equal prefill times, but in reality, there would also be a version of this problem where long prefills block the GPU from starting new work.)

The throughput improvement from continuous batching can be substantial. Under workloads with high variance in output length, continuous batching can improve MFU and throughput by 2-3x or more compared to static batching, simply by keeping batch slots occupied with useful work. This is why continuous batching is the foundational scheduling technique in serving frameworks like vLLM (Kwon et al. 2023a) and SGLang (Zheng et al. 2023).

Note

Continuous batching requires the serving system to manage per-request state (KV cache, position counters, stop conditions) independently for each slot, and to handle the bookkeeping of inserting and removing requests mid-generation. This is more complex than static batching but is well worth the implementation effort for any production system.

Continuous batching has eliminated most of the GPU idle time, but we’ve seen it still has a problem with long prefills causing other requests to wait. In this simple implementation of continuous batching, where the prefill for each input is processed all at once, we don’t have the ability to start new prefill or decode requests in the middle of a long-running prefill. And it’s worthwhile to remember that the prefill of a long input can take dozens or hundreds of times longer than a decode step (not just twice as long, as we’ve depicted).

This is less of a GPU utilization problem, which would show in MFU, than it is a problem with increasing TPOT and overall request completion latency. In the next section, we will unpack more of the issues with prefill and decode, and ultimately solve this issue of holes in GPU utilization.

5.2 Prefill Strategies

The prefill–decode tension

In Section 3.5, we quantified the nature of prefill being compute-bound and decode being memory-bandwidth-bound. This difference creates a scheduling headache when the system has to do both at the same time.

Consider a serving system running continuous batching with 30 requests in the decode phase, but with capacity for more decoding. A new request arrives with a 4,096-token prompt. The system needs to prefill this request before it can start decoding. But the prefill for a 4,096-token prompt is a large, compute-intensive operation that can take hundreds of milliseconds, during which the 30 decode requests are blocked from making progress.

From those 30 requests’ perspective, waiting while this prefill happens is terrible. Each one is generating text for a user, and they all just stalled for the duration of someone else’s prefill. Their TPOT spikes, the streaming output freezes, and the user experience degrades. This is the prefill–decode tension: serving a new request’s prefill comes at the cost of stalling existing requests’ decode steps.

The tension gets worse as prompt lengths increase. With the trend toward longer contexts — 32K, 128K, or even longer — a single prefill can take multiple seconds, creating an unacceptable stall for all active decode requests.

Chunked prefill

Chunked prefill, introduced by Sarathi (Agrawal et al. 2023), resolves this tension by splitting long prefills into smaller chunks and combining each chunk with a decode step.

Instead of processing all 4,096 input tokens in one large forward pass, the system breaks the prefill into smaller chunks, such as 8 chunks of 512 tokens each. At each iteration, the scheduler runs one prefill chunk alongside the batch of decode tokens. The decode requests make progress every iteration, and the new request’s prefill completes over 8 iterations instead of blocking everything during one long prefill operation. Capping the duration of each prefill chunk to be on the same order as a decode step eliminates the scheduling holes that hurt TPOT.

With chunked prefill, the differing resource profiles of prefill and decode work in our favor. The prefill chunk is compute-heavy and the decode step is bandwidth-heavy, so combining them in the same iteration gives the GPU a mix of both types of work. In the best case, the memory transfer of the decode step is overlapped with the compute of the prefill chunk, so the GPU doesn’t take much longer to complete the combined iteration than it would to do the prefill chunk alone. We can increase MFU for the prefill chunk without adding much to TPOT.

The chunk size is a tuning parameter. Smaller chunks mean smoother decode progress but an increase in the number of iterations needed to complete the prefill, adding overhead from additional kernel launches and attention computations across chunk boundaries. Larger chunks reduce this overhead but create longer pauses in decode progress and increase TPOT. In practice, chunk sizes of 256–2,048 tokens are common, depending on the model size, the capability of the GPU (or other hardware accelerator), and the latency sensitivity of the workload. In this range, the decode memory latency and the prefill compute time are well balanced.

For our sample workload from Table 5.1, we can now improve upon continuous batching by breaking up our prefills (which we simplified to all be exactly twice as long as a decode step) into separate chunks. Figure 5.4 shows that Requests A, B, and D all complete sooner than with vanilla continuous batching. In reality, if prefill were dozens of times slower than decode, the savings would be even greater.

Chunked prefill combined with continuous batching improves all aspects of GPU utilization, so both MFU and MBU increase. The net effect is almost all inference metrics improving simultaneously — in particular, TTFT, TPOT, and throughput all improve.

Note

Chunked prefill builds upon continuous batching in an intuitive way. The scheduler treats each prefill chunk as a partial request that occupies a “prefill slot” for one iteration. When the last prefill chunk completes, the request transitions to the decode phase and joins the regular decode batch. The system never has to choose between serving new requests and making progress on existing ones — it does both simultaneously.

Disaggregated prefill

Chunked prefill is effective, but it’s still a compromise because prefill and decode share the same GPU and compete for some of the same resources. Disaggregated prefill (Zhong et al. 2024) takes a more radical approach: run prefill and decode on entirely separate hardware instances.

The idea is straightforward. Prefill is compute-bound, so prefill instances should be optimized for compute (i.e., a high compute ceiling). Decode is memory-bandwidth-bound, so decode instances should be optimized for memory bandwidth and capacity. By disaggregating the two phases onto separate hardware, each can be independently optimized and scaled without interfering with the other.

The workflow looks like this:

A request arrives and is routed to a prefill instance
The prefill instance processes the full input and generates the KV cache
The KV cache is transferred over a fast interconnect (such as NVLink or InfiniBand) to a decode instance
The decode instance generates tokens autoregressively using the transferred KV cache

A comparison of co-located and disaggregated prefill on a sample workload is shown in Figure 5.5.

The main cost is the KV cache transfer. For a large model with a long prompt, the KV cache can be several gigabytes, and this data needs to move between machines before decoding can begin. The transfer latency adds directly to TTFT, so disaggregated prefill needs a fast interconnect to be practical. With NVLink or high-speed InfiniBand, this transfer can be completed in tens of milliseconds, which is often acceptable.

The benefits go beyond eliminating interference. Disaggregated prefill lets you scale prefill and decode capacity independently. If your workload has long prompts but short outputs, you might need more prefill capacity and less decode capacity. If you’re serving a chatbot with short prompts but long responses, the ratio flips. With disaggregated serving, you allocate hardware to match the workload rather than being stuck with a fixed ratio. This allows you to manage TTFT and TPOT SLAs more effectively, maximizing goodput for a given hardware budget.

Disaggregated prefill is a more complex deployment architecture, and the KV cache transfer adds latency and requires fast networking. For many deployments, chunked prefill provides enough balance between the phases. But at large scale, where the workload is diverse enough and the infrastructure supports fast interconnects, disaggregation can provide additional efficiency gains. These benefits will only increase with new hardware accelerators specifically optimized for prefill workloads or optimized for decoding workloads.

5.3 Ragged batching

Even with continuous batching, there’s still a source of waste: when requests in a batch have different sequence lengths, the shorter sequences need padding to match the longest one in the batch for the attention computation. Ragged batching eliminates this by concatenating all the tokens from different requests into a single flat sequence, with appropriate attention masks to prevent cross-request attention.

In standard batching, all requests must be the same length, so we pad shorter sequences with special tokens to match the longest one, and we wind up with an input batch of shape (B, \(\textcolor{#DC2626}{\text{S}_{\text{max}}}\)) where B is the batch size and \(\textcolor{#DC2626}{\text{S}_{\text{max}}}\) is the length of the longest sequence in the batch. This results in all B requests performing attention over \(\textcolor{#DC2626}{\text{S}_{\text{max}}}\) tokens, even if some requests have much shorter sequences.

Instead of padding, ragged batching creates a single packed tensor of shape (1, \(\textcolor{#DC2626}{\text{S}_{\text{total}}}\)), where \(\textcolor{#DC2626}{\text{S}_{\text{total}}}\) is the sum of all actual token counts. An index array tracks where each request’s tokens begin and end, and a special attention mask ensures that tokens from one request can’t attend to tokens from another request. To maximize the efficiency of the attention computation, a mask-aware attention kernel such as FlexAttention (Dong et al. 2024) is used. Attention kernels will be discussed in Section 6.1, but the key point here is that the kernel needs to be designed to handle the irregular memory access patterns and skip unnecessary computations. A comparison of simple padded batching and ragged batching is shown in Figure 5.6 and Figure 5.7.

The benefits of ragged batching are most pronounced when the input lengths across requests vary widely. A batch with one 2,048-token prompt and three 128-token prompts would waste enormous compute on padding without ragged batching. With ragged batching, the compute is roughly proportional to the actual number of tokens: \(2{,}048 + 3 \times 128 = 2{,}432\) tokens instead of \(4 \times 2{,}048 = 8{,}192\) tokens. This example shows how ragged batching supports high concurrency without wasteful padding, maximizing the useful work and keeping TTFT low.

The benefits of ragged batching combine particularly well with prefix sharing, which we will discuss in Section 6.4. Figure 5.8 briefly shows how sharing the prefix in ragged batching can save huge amounts of compute with long shared prefixes, such as a long system prompt.

5.4 Request Scheduling and Prioritization

Granularity of scheduling

We have seen in the previous section that continuous batching shifts the scheduling granularity from the batch level to the iteration level. While it introduces some additional overhead, the benefits are substantial, and the overhead lies mostly on the CPU side, allowing it to be overlapped with all of the GPU work. Modern serving systems are built on this iteration-level scheduling, and it allows very fine-grained control over which requests are processed at each step.

Priority-based scheduling

Not all requests are equally urgent. A real-time chatbot response that a user is waiting on matters more — at least from a latency perspective — than a background batch job scoring a dataset. Priority-based scheduling lets the serving system express this difference.

In a priority-aware scheduler, each incoming request is assigned a priority level. High-priority requests are placed at the front of the scheduling queue, while lower-priority requests wait until there’s spare capacity. The scheduler decides at each iteration which requests to include in the batch, and high-priority requests get scheduled first. With iteration-level scheduling, the scheduler can even preempt lower-priority requests mid-generation, in order to make room for higher-priority requests that arrive.

Priority-based scheduling enables a common production pattern: a system serves both real-time and batch traffic on the same hardware. Real-time requests get low TTFT because they jump the queue. Batch requests fill in the gaps when real-time traffic is light, keeping the GPU busy during off-peak periods. The result is better overall utilization without sacrificing the latency SLA for interactive users.

SLA-aware scheduling takes this a step further. Instead of simple priority levels, the scheduler tracks each request’s deadline — for example, “this request must produce its first token within 500 ms.” The scheduler then makes admission decisions based on whether it can meet the deadline. If admitting a new request would cause an existing request to miss its deadline, the scheduler may defer the new request or take other action.

Note

The details of how serving frameworks like vLLM and SGLang implement their iteration-level schedulers — deciding which mix of prefill chunks and decode tokens to process each step — are discussed further in Chapter 8. The scheduler is the heart of a modern serving system, and tuning it well has a large impact on goodput.

Preemption

What happens when the system runs out of KV cache memory? With continuous batching, the scheduler is constantly admitting new requests into the batch. Each new request needs KV cache memory, and the existing requests’ KV caches grow with each decode step. If you’re lucky, requests finish before you run out of memory, freeing up space for new requests. If you’re not so lucky, eventually something has to give.

Preemption is the mechanism for handling this. When KV cache memory is exhausted or if capacity for a new high-priority request needs to be created, the scheduler can preempt one or more existing requests, suspending them and freeing their KV cache memory to make room. When resources become available later, the preempted requests can resume.

There are two strategies for handling the KV cache of a preempted request:

Swap evicts the preempted request’s KV cache blocks from GPU memory to CPU memory. When the request resumes, the blocks are swapped back. This preserves the work already done, and the request picks up exactly where it left off without any recomputation. The cost is the time and the HBM bandwidth utilized to transfer data over the PCIe bus between the GPU and CPU. For a KV cache of several hundred megabytes, a swap to CPU memory at PCIe Gen5 speeds (~64 GB/s) takes a few milliseconds. This may be acceptable, but the PCIe bandwidth is shared with other transfers, so the transfer could take longer. In the extreme, high amounts of swapping can degrade decode performance for all requests, even the ones that aren’t preempted, harming TPOT and goodput.

Recomputation takes a simpler but more expensive approach: discard the preempted request’s KV cache entirely. When the request resumes, the system reruns prefill from scratch to rebuild the KV cache. This frees GPU memory immediately, requires no CPU memory, and doesn’t spend time transferring data back and forth between GPU and CPU, but it wastes the compute that went into the original prefill and the calculation of all the KV cache entries accumulated during decode so far. For a request with a long prompt, recomputation can be very costly. The extra recomputation won’t show up in MFU, but it will increase TTFT. In the extreme, high amounts of recomputation can cause almost all metrics to degrade.

Comparison of preemption strategies.
Strategy	GPU memory freed	CPU memory needed	Resume cost	Best when
Swap	Yes	Yes	Transfer latency	Plenty of CPU memory, long contexts
Recomputation	Yes	No	Full prefill/decode re-run	Limited CPU memory, rare preemption

The choice between swap and recomputation depends on the situation. If the system has ample CPU memory and preemptions are brief, swapping is better, since you avoid redoing work. If CPU memory is tight or the preempted request would sit idle for a long time, recomputation may be simpler and more predictable. Some systems use a hybrid approach: swap when possible, fall back to recomputation when CPU memory is full.

OOM handling and graceful degradation

Ideally, preemption should be a rare event that happens only when the system is under heavy load and the scheduler can’t fit all active requests into memory. But in adverse conditions, such as a sudden burst of long-context requests, preemption alone may not be optimal. At this point, the system may want an additional strategy. Options include:

Reducing batch size: admit fewer concurrent requests, prioritizing the ones already in progress. New requests queue up and wait. Reducing concurrency can protect TPOT and goodput by avoiding repeated preemptions and their associated overhead.
Rejecting new requests: return an error and let the client retry later. This provides immediate feedback for rejected requests, instead of letting them queue for an unacceptably long time and damage goodput.
CPU offloading: proactively keep some fraction of KV cache blocks on CPU memory, streaming them to the GPU as needed. This trades PCIe bandwidth for GPU memory capacity. While this is not ideal for performance, in small doses it can keep the system running smoothly by avoiding preemptions and their impact on performance.

The details of how production systems handle these failure modes — and how to configure policies that balance throughput, latency, and reliability — are discussed further in Section 8.6.

Putting scheduling together

The scheduling techniques in this chapter form a stack. At the bottom, the batching strategy determines how requests are grouped for each forward pass. Continuous batching provides the foundation, keeping batch slots full at all times. Ragged batching eliminates padding waste within each batch. On top of that, chunked or disaggregated prefill manages the resource needs for both prefill and decode. And the priority and preemption policies govern which requests get resources in order to best serve the overall workload.

None of these techniques change the model or its weights. They don’t reduce the number of FLOPs needed to process a token or shrink the KV cache per token. What they do is ensure that the FLOPs the GPU performs are useful FLOPs — not wasted on padding, not blocked by long prefills, not sitting idle while a long request holds up the batch. Primarily, these scheduling techniques increase MFU and concurrency, which in turn improves latency and throughput. In the next chapter, we’ll turn to techniques that operate on individual requests: optimizing the attention computation, managing the KV cache efficiently, and breaking the one-token-at-a-time constraint of autoregressive decoding.

5.5 Further Reading

For accessible introductions to batching strategies, two sources are notable. The Anyscale blog post (Anyscale 2024) is the most widely-shared visual explainer of continuous batching. It walks through static vs. continuous batching with clear diagrams and shows where the 23x throughput claim comes from. The original vLLM blog post (Kwon et al. 2023b) introduces PagedAttention with accessible visual explanations of KV cache memory fragmentation and how OS-style paging solves it — a good companion to the academic paper.

Several papers extend the scheduling ideas in this chapter in important directions. Sarathi-Serve (Agrawal et al. 2024) builds on the original Sarathi chunked prefill work, adding stall-free scheduling that eliminates decode stalls even in pipeline-parallel deployments — an important consideration for multi-GPU serving that we don’t cover here. For a more accessible treatment of chunked prefill with diagrams and benchmarks, NVIDIA’s blog post on chunked prefill in TensorRT-LLM (NVIDIA 2024) walks through the concept and its impact on GPU utilization. FastServe (Bai et al. 2023) proposes a skip-join multi-level feedback queue scheduler that uses input length information to assign requests to priority queues, a more principled approach to preemptive scheduling than the simple priority levels discussed in this chapter. DeepSpeed-FastGen (Holmes et al. 2024) introduces Dynamic SplitFuse, an approach closely related to chunked prefill that decomposes prefill requests and composes them with decode tokens to maintain consistent forward pass sizes. The DeepSpeed team’s blog post (DeepSpeed Team 2023) explains the approach with clear diagrams contrasting SplitFuse with other batching strategies.

On disaggregated serving, Splitwise (Patel et al. 2024) independently explored prefill-decode disaggregation around the same time as DistServe, with additional analysis of how to balance capacity between prefill and decode pools as workload mix changes. Mooncake (Qin et al. 2024) describes Moonshot AI’s production disaggregated system, which treats the KV cache as the first-class scheduling resource and manages it through a distributed cache pool — a concrete example of how disaggregation works at scale with real traffic. For a retrospective on how disaggregation went from research prototype to industry standard across nearly every major serving framework, “Disaggregated Inference: 18 Months Later” (Chen et al. 2025) traces the adoption arc and explores emerging directions like attention-FFN disaggregation.

For scheduling beyond single-instance serving, Llumnix (Sun et al. 2024) introduces live migration of requests and their KV caches between model instances, enabling dynamic rescheduling, load balancing, and memory defragmentation across a serving cluster. It’s a useful reference for anyone thinking about scheduling at the cluster level rather than the single-GPU level.

Agrawal, Amey, Nitin Kedia, Ashish Panwar, et al. 2024. “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve.” Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation. https://arxiv.org/abs/2403.02310.

Agrawal, Amey, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. “Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.” arXiv Preprint arXiv:2308.16369. https://arxiv.org/abs/2308.16369.

Anyscale. 2024. “How Continuous Batching Enables 23x Throughput in LLM Inference While Reducing P50 Latency.” https://www.anyscale.com/blog/continuous-batching-llm-inference.

Bai, Yinmin, Peng Li, Hanyu Qin, et al. 2023. Fast Distributed Inference Serving for Large Language Models. https://arxiv.org/abs/2305.05920.

Chen, Junda, Yonghao Zhuang, and Hao Zhang. 2025. “Disaggregated Inference: 18 Months Later.” https://haoailab.com/blogs/distserve-retro/.

DeepSpeed Team. 2023. “DeepSpeed-FastGen: High-Throughput Text Generation for LLMs via MII and DeepSpeed-Inference.” https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/README.md.

Dong, Juechu, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. 2024. “FlexAttention: A Programming Model for Generating Optimized Attention Kernels.” arXiv Preprint arXiv:2412.05496. https://arxiv.org/abs/2412.05496.

Holmes, Connor, Masahiro Tanaka, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, and Yuxiong He. 2024. “DeepSpeed-FastGen: High-Throughput Text Generation for LLMs via MII and DeepSpeed-Inference.” arXiv Preprint arXiv:2401.08671. https://arxiv.org/abs/2401.08671.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023a. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” arXiv Preprint arXiv:2309.06180. https://arxiv.org/abs/2309.06180.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023b. “vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention.” https://vllm.ai/blog/vllm.

NVIDIA. 2024. “Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill.” https://developer.nvidia.com/blog/streamlining-ai-inference-performance-and-deployment-with-nvidia-tensorrt-llm-chunked-prefill/.

Patel, Pratyush, Esha Choukse, Chaojie Zhang, et al. 2024. “Splitwise: Efficient Generative LLM Inference Using Phase Splitting.” arXiv Preprint arXiv:2311.18677. https://arxiv.org/abs/2311.18677.

Qin, Ruoyu, Zheming Li, Weiran He, et al. 2024. “Mooncake: A KVCache-Centric Disaggregated Architecture for LLM Serving.” arXiv Preprint arXiv:2407.00079. https://arxiv.org/abs/2407.00079.

Sun, Biao, Ziming Huang, Hanyu Chen, et al. 2024. “Llumnix: Dynamic Scheduling for Large Language Model Serving.” Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation. https://arxiv.org/abs/2406.03243.

Yu, Gyeong-In, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI. https://www.usenix.org/conference/osdi22/presentation/yu.

Zheng, Lianmin, Liangsheng Yin, Zhiqiang Xie, et al. 2023. “SGLang: Efficient Execution of Structured Language Model Programs.” arXiv Preprint arXiv:2312.07104. https://arxiv.org/abs/2312.07104.

Zhong, Yinmin, Shengyu Liu, Junda Chen, et al. 2024. “DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving.” arXiv Preprint arXiv:2401.09670. https://arxiv.org/abs/2401.09670.