8  Production LLM Serving Systems

Now that we understand LLM inference and many of the optimization techniques used, we can discuss some of the popular model serving frameworks and the implementation choices they have made for serving LLMs in the real world. Some of these systems are in production across many continents, scaling to thousands of concurrent users or more. This is not a usage tutorial for any particular framework. The goal is to understand which optimization combinations each system makes, why those choices matter, and what operational concerns arise once you move from benchmarking a single model on a single GPU to serving real traffic.

8.1 Serving Frameworks

The open-source LLM serving landscape has converged around a handful of frameworks, each with a different philosophy about what to optimize for. All of them implement continuous batching (Section 5.1), support tensor parallelism (Section 7.3), and integrate FlashAttention (Section 6.1). Where they differ is in their core innovation and the tradeoffs they make.

vLLM

vLLM is built around PagedAttention (Kwon et al. 2023), the virtual-memory-inspired KV cache management system we covered in Section 6.3. By breaking the KV cache into fixed-size pages that don’t need to be physically contiguous, vLLM nearly eliminates the memory fragmentation that plagued earlier systems. This is what allows it to pack more concurrent requests into the same GPU memory budget.

Beyond PagedAttention, vLLM implements continuous batching, chunked prefill (Section 5.2), prefix caching (Section 6.4), speculative decoding (Section 6.5), and multi-LoRA serving. It supports NVIDIA, AMD, and other hardware backends, making it one of the most broadly portable options. vLLM has become the default choice for many deployments, largely because of its active open-source community and broad model support.

SGLang

SGLang (Zheng et al. 2023) innovates in two directions. First, its RadixAttention system extends prefix caching beyond simple exact-match lookups. It organizes the KV cache as a radix tree keyed by token sequences, enabling automatic, fine-grained prefix sharing across requests (Section 6.4). When many requests share system prompts, few-shot examples, or RAG context, this can dramatically reduce redundant prefill computation.

Second, SGLang provides a programming model for multi-call LLM programs — workflows where a single user interaction involves multiple LLM calls with branching logic, structured output constraints, or iterative refinement. By co-designing the serving engine with this programming model, SGLang can optimize across calls in ways that a stateless API cannot.

TensorRT-LLM

TensorRT-LLM focuses on squeezing maximum throughput from NVIDIA hardware through aggressive kernel fusion and compilation (Section 6.2). Models are compiled into optimized execution plans that fuse operations across layers, eliminate unnecessary memory traffic, and exploit NVIDIA-specific hardware features like FP8 on Hopper GPUs.

The tradeoff is narrower hardware support — TensorRT-LLM runs only on NVIDIA GPUs — and a heavier compilation step before serving begins. But on NVIDIA hardware, it often achieves the highest raw throughput, particularly for large-batch, throughput-oriented workloads. It implements the same core techniques (continuous batching, paged KV cache, tensor parallelism) but through a compiled execution path rather than an interpreted one.

Framework comparison

The following table summarizes the key differences across these three leading frameworks. All three are actively developed, so specific feature gaps tend to close over time.

Table 8.1: Framework comparison across major open-source LLM serving systems.
Feature vLLM TensorRT-LLM SGLang
Core innovation PagedAttention Kernel fusion / compilation RadixAttention + structured gen
KV cache management Paged (virtual memory) Paged Radix tree
Continuous batching Yes Yes Yes
Speculative decoding Yes Yes Yes
Multi-LoRA Yes Yes Yes
Hardware support NVIDIA, AMD, others NVIDIA only NVIDIA, AMD
Best suited for General-purpose serving Max throughput on NVIDIA Prefix-heavy / multi-call

Other frameworks

Several other serving frameworks are worth knowing about, even though vLLM, SGLang, and TensorRT-LLM have emerged as the dominant choices.

Text Generation Inference (TGI) is Hugging Face’s serving engine. It saw wide adoption in 2023–2024 thanks to tight integration with the Hugging Face model ecosystem, and it implements continuous batching, paged KV cache, and speculative decoding. However, TGI has entered maintenance mode, and Hugging Face now offers vLLM and TensorRT-LLM as alternative backends within TGI — a signal that even its maintainers view the original engine as no longer competitive on raw performance.

LMDeploy, developed by the InternLM team at the Shanghai AI Lab, focuses on high throughput with quantized models. Its TurboMind backend delivers strong tokens-per-second numbers, particularly for 4-bit quantized models from the Llama family. It is less well-known outside of China but appears regularly in inference benchmarks.

DeepSpeed-Inference (Aminabadi et al. 2022) is the inference-focused sibling of the widely used DeepSpeed training library. Its ZeRO-Inference system distributes model weights across GPUs and even to NVMe storage, enabling inference on models that don’t fit in aggregate GPU memory. DeepSpeed-Inference also implements custom CUDA kernels for fused operations, but its primary strength is multi-GPU and multi-node orchestration rather than single-device kernel optimization. As GPU memory capacities have grown and tensor parallelism support has improved across all frameworks, the niche for NVMe offloading has narrowed, and DeepSpeed-Inference sees less adoption as a standalone serving system than the three leading frameworks.

Note

This chapter focuses on GPU-based serving at scale. Closed-source serving systems from cloud providers are out of scope, and so is local and edge deployment, which operates under fundamentally different constraints — limited memory bandwidth, no HBM, CPU-only or unified-memory architectures. While we won’t discuss them here, the top inference engines for local deployment are:

  • llama.cpp — the C/C++ inference engine that pioneered running LLMs on consumer hardware via aggressive quantization. Many of the tools below use llama.cpp under the hood.
  • Ollama — wraps llama.cpp in a Docker-like interface for pulling and running models with a single command. The fastest path from zero to a locally running LLM.
  • Apple MLX — Apple’s framework optimized for the unified-memory architecture of Apple Silicon, achieving strong throughput on MacBook and Mac hardware.
  • LM Studio — a desktop application built on llama.cpp that provides a GUI for downloading, configuring, and chatting with local models.

8.2 Multi-LoRA Serving

In many production deployments, you don’t just serve one model — you serve many fine-tuned variants of the same base model. LoRA (Hu et al. 2021) is a popular fine-tuning technique that adds small, trainable matrices to a frozen base model. Loading a separate full model for each customer is wildly impractical because of the loading time impact on TTFT and the memory overhead impact on concurrency. Multi-LoRA serving solves this by keeping a single copy of the base model weights in GPU memory and hot-swapping lightweight LoRA adapters at request time.

A LoRA adapter modifies the base model by adding low-rank update matrices to specific layers — typically the attention projections. These adapters are small, often just tens of megabytes compared to the base model’s tens of gigabytes. This size difference is what makes the whole approach work: you can keep the base model loaded and swap adapters in and out without the memory cost of maintaining multiple full model copies.

The serving framework needs to handle three things for multi-LoRA to work well:

  1. Adapter routing: Each incoming request specifies which adapter to use. The system must route the request to the correct adapter and apply it during the forward pass.
  2. Adapter memory management: Popular adapters can be kept resident in GPU memory. Less frequently used adapters are loaded on demand from CPU memory or storage. The system needs an eviction policy — typically LRU — to manage which adapters are resident.
  3. Batching across adapters: Requests using different adapters can still be batched together for the base model computation. The adapter-specific computation (the low-rank matrix multiplies) is applied per-request. This means you get some of the MFU benefits of shared base weights while still serving personalized models.

Both vLLM and SGLang support multi-LoRA serving, with adapter management integrated into their scheduling and memory systems.

8.3 Scheduling and Orchestration

The scheduling strategies from Chapter 5 operate at the level of a single serving instance — one model on one set of GPUs. In production, you often need to coordinate across a fleet of instances, potentially running different models on different hardware.

Multi-model routing

Many applications route requests to models of different sizes based on the task complexity. Simple classification tasks go to a smaller, cheaper model. Complex reasoning tasks go to a larger, more capable model. This quality-tier routing can significantly reduce cost while maintaining quality where it matters, but it requires a routing layer that can classify incoming requests and direct them appropriately.

Load balancing across replicas

When you scale out with data parallelism — running multiple identical copies of the same model — you need a load balancer that understands LLM workload characteristics. Naive round-robin doesn’t work well because requests vary enormously in cost. A request with a 10,000-token prompt and 2,000-token output takes orders of magnitude more resources than a 100-token prompt with a 50-token output. Load balancers that account for estimated request cost (based on input length and expected output length) distribute work more evenly. They can be tuned not only to improve median metrics but also to reduce tail latencies, thereby maximizing goodput.

Disaggregated prefill at fleet level

In Section 5.2, we discussed disaggregating prefill and decode to separate GPU pools (Zhong et al. 2024). At fleet scale, this becomes an orchestration problem: how many prefill instances versus decode instances should you run, and how do you transfer KV caches between them? The optimal ratio depends on your workload’s input/output length distribution (Section 3.3). Prefill-heavy workloads (long prompts, short outputs) need more prefill capacity. Decode-heavy workloads need more decode capacity. Production systems typically monitor queue depths on both pools and auto-scale the ratio to optimize the TTFT and TPOT tradeoff.

Heterogeneous hardware

Large deployments often mix GPU generations — perhaps H100s for the most latency-sensitive traffic and slower A100s or older GPUs for batch workloads. The orchestration layer must route requests based on both the model’s requirements and each hardware pool’s capabilities and current load.

8.4 Memory Management and Preemption

We covered the metrics around preemption in Section 3.2, and the paged KV cache in Section 6.3. In production, memory management becomes one of the most critical operational concerns, because running out of KV cache space under load can cascade into system-wide degradation.

Multi-tenant memory pressure

When many concurrent requests share the same GPU, the KV cache dominates memory usage. Each active request’s cache grows with every decode step, and different requests may have very different sequence lengths. If we have new requests arriving, we’d like to admit them to increase concurrency, MFU, and throughput. The serving framework must track per-request memory consumption and make admission decisions: can we accept this new request, or will its KV cache push us over our memory budget?

PagedAttention helps by eliminating fragmentation — memory is allocated in pages and freed when requests complete. But even with perfect paging, the total KV cache demand can exceed available memory when the system is under heavy load or processing long-context requests.

Preemption policies

When memory runs out, something has to give. The system preempts one or more active requests, freeing their KV cache to make room for others. The preempted request either has its KV cache swapped to CPU memory (to be restored later) or is simply evicted and must recompute its KV cache from scratch when it resumes.

The choice between swap and recompute involves a tradeoff between two negative impacts on TTFT and GPU resources:

  • Swap preserves the computation already done but requires CPU memory and PCIe bandwidth to move the KV cache. For long sequences with large KV caches, the transfer time can be significant.
  • Recompute avoids the transfer cost but wastes the prefill work. For short sequences, recompute is often cheaper than swap.

Production systems typically use priority-based preemption — lower-priority requests are evicted first, and SLA-aware policies protect requests that are close to completing or that belong to higher-priority tiers.

Graceful request degradation

Under sustained overload, the goal shifts from maximizing throughput to avoiding cascading failures and preserving as much goodput as possible. Request shedding — rejecting new requests at the admission layer when the system is at capacity — is blunt but effective. More sophisticated approaches include dynamically reducing maximum sequence lengths, lowering batch sizes, or routing overflow traffic to a degraded service tier that uses a smaller, faster model.

8.5 Monitoring and Benchmarking

You can’t optimize what you don’t measure. Production serving requires continuous monitoring of the metrics we introduced in Section 3.2, with particular attention to tail latencies and resource utilization.

Key production metrics

The metrics that matter most in production go beyond averages:

Table 8.2: Key production metrics and their diagnostic value.
Metric What it tells you Warning threshold
P99 TTFT Worst-case user wait for first token Exceeds SLA
P99 TPOT Worst-case streaming speed Users perceive lag
GPU utilization (MFU/MBU) Hardware efficiency Below 50% suggests misconfiguration
Queue depth Request backlog Sustained growth means under-provisioned
Preemption rate Memory pressure frequency Any sustained preemption suggests need for more capacity
KV cache hit rate Prefix caching effectiveness Low rate with shared prefixes means caching misconfigured
Goodput Useful throughput within SLA Gap between throughput and goodput means SLA violations

Percentile latencies (P50, P90, P99) are essential because averages hide tail behavior. A system can have a perfectly acceptable P50 TTFT of 200ms while its P99 is 5 seconds — meaning one in a hundred users waits 25x longer than the median user. Production SLAs are typically defined at the P99 level.

Instrumentation

Most production deployments use Prometheus (Prometheus Authors 2024) for metrics collection and Grafana (Grafana Labs 2024) for dashboarding. vLLM, SGLang, and TensorRT-LLM all expose Prometheus-compatible metrics endpoints. For distributed tracing across the full request lifecycle — from the load balancer through the serving engine to response delivery — OpenTelemetry (OpenTelemetry Authors 2024) provides a standard instrumentation layer.

Benchmarking

Before deploying a new configuration, you need reproducible benchmarks that reflect your actual workload. Tools like LLMPerf generate synthetic traffic with configurable input/output length distributions and concurrency levels (Kadous et al. 2023). The critical mistake in benchmarking is using a uniform workload (e.g., all requests with 512 input tokens and 128 output tokens) when your production traffic has high variance. Benchmark with a distribution that matches production, or you’ll be surprised by how differently your system behaves under real load.

8.6 Production Failure Modes

Understanding how systems fail is just as important as understanding how they work. Several failure modes are specific to LLM serving, and recognizing them early can prevent cascading outages.

OOM cascades

This is the most dangerous failure pattern. A single unusually long request consumes a large portion of the KV cache, triggering preemption of other requests. Those preempted requests eventually resume and recompute their KV caches, creating a burst of prefill work that competes with ongoing decode steps. This increases latency for all active requests, causing SLA violations, which may trigger retries from the client, which adds more load. The feedback loop can take down an entire serving instance.

Mitigation: Set hard limits on maximum sequence length per request. Use admission control to reject requests that would push memory usage above a safe threshold. Monitor preemption rate as an early warning signal.

Head-of-line blocking

A request with a very long prompt ties up GPU compute during its prefill phase. If the system uses naive scheduling, shorter requests queue behind it and experience inflated TTFT. This is particularly damaging in mixed-workload deployments where most requests are short but occasional long-context requests arrive.

Mitigation: Chunked prefill (Section 5.2) breaks long prefills into smaller pieces that are interleaved with decode steps, preventing any single prefill from monopolizing the GPU. Priority scheduling (Section 5.4) can also help by ensuring short requests aren’t starved.

Thundering herd

A burst of simultaneous requests — perhaps triggered by a traffic spike, a retry storm, or a batch job — overwhelms the KV cache allocator and scheduler. The system attempts to admit all requests at once, runs out of memory, and begins preempting aggressively, which compounds the problem.

Mitigation: Admission control with request queuing and rate limiting. Gradually ramp admitted requests rather than accepting the full burst. Backpressure signals to upstream load balancers allow the fleet to absorb traffic spikes across replicas.

Graceful degradation strategies

We have touched on some strategies for mitigating overload, but it’s worth reiterating. When a system is under sustained pressure, there are several levers that can be pulled before things break:

  1. Request shedding: reject low-priority requests at the edge, returning a “service unavailable” rather than degrading quality for everyone
  2. Dynamic batch policy adjustment: temporarily reduce maximum batch sizes to lower memory pressure, trading throughput for stability
  3. Quality-tier routing: redirect overflow traffic to smaller, cheaper models that can handle the load, accepting lower quality to maintain availability
  4. Speculative decoding toggle: disable speculative decoding under memory pressure, since draft tokens consume KV cache space

The common thread across all of these strategies is that controlled, deliberate degradation is vastly preferable to uncontrolled cascading failure. A system that gracefully sheds 10% of requests under a load spike is far better than one that falls over entirely and drops 100%.

8.7 vLLM

vLLM has become the most widely deployed open-source LLM serving framework, and it’s worth understanding why. The short answer is breadth: vLLM implements nearly every optimization technique we’ve covered in this book, from PagedAttention and chunked prefill through speculative decoding and expert parallelism. But breadth alone doesn’t explain adoption — the design decisions behind how these techniques fit together matter just as much. This section walks through those decisions, focusing on what’s distinctive about vLLM’s implementation rather than re-explaining the underlying techniques.

The V1 architecture

vLLM’s original architecture (now called V0) grew organically as new features were added. By the time it supported prefix caching, speculative decoding, chunked prefill, and multi-modal inputs, the scheduler and worker code had accumulated enough special cases that adding new features meant touching many interacting code paths. The V1 redesign (vLLM Team 2025e) addressed this by rethinking the core abstractions.

The most visible change is the persistent batch. In V0, each scheduling step constructed a new set of input tensors from scratch — assembling token IDs, position indices, and attention metadata for every request in the batch. V1 caches these tensors and applies only incremental diffs each step. If a batch of 200 requests produces one token each, V1 updates 200 entries rather than rebuilding the entire input. This sounds like a minor optimization, but the CPU overhead of tensor construction was a real bottleneck at high batch sizes, and eliminating it delivers up to 1.7x higher throughput compared to V0 (vLLM Team 2025e).

The second major change is symmetric tensor parallelism. In V0, the scheduler ran on rank 0 and broadcast full request state to all other TP ranks every step. V1 caches request state on every worker and transmits only diffs — new requests entering the batch and completed requests leaving it. This removes the asymmetry where rank 0 was doing substantially more work than the other ranks, and it reduces the communication volume between the scheduler and the workers.

V1 also embraces torch.compile as a first-class optimization path. Rather than relying entirely on hand-written custom CUDA kernels, V1 passes models through PyTorch’s Inductor compiler and then applies vLLM-specific fusion passes on top (vLLM Team 2024c). The compiler handles the boilerplate optimizations — operator fusion, memory planning, kernel selection — while custom passes add inference-specific fusions like RMSNorm + quantization and SiLU + quantized linear. This is a deliberate bet on maintainability: as new GPU architectures arrive, torch.compile can retarget them without rewriting kernel code.

Finally, V1 uses piecewise CUDA graphs aligned with its compilation strategy. The computation graph is split at attention operations — which need dynamic shapes because the batch composition changes every step — and CUDA graphs capture everything between them. This gives most of the kernel launch overhead reduction of full CUDA graphs while preserving the flexibility that attention requires (vLLM Team 2025e).

Scheduling and memory management

vLLM’s scheduler operates at the iteration level (Section 5.1): after every forward pass, it decides which requests continue, which new requests enter, and which running requests should be preempted. This is the same continuous batching approach used by other frameworks, but vLLM’s implementation has several distinctive features.

Chunked prefill is the default scheduling mode. Long prompts are split into chunks by capping the number of new tokens processed per step. The scheduler fills each step by first allocating budget to decode requests — since each one needs only a single token — and then filling the remaining budget with pending prefill chunks. This prioritization is what keeps inter-token latency stable for requests that are already generating: decode steps are never blocked behind a large prefill.

Priority-based scheduling lets you assign integer priorities to requests, and the scheduler dynamically reorders both the waiting and running queues. If a running request has lower priority than a waiting request, the scheduler can forcefully preempt it — swapping or freeing its KV cache blocks — to make room for the higher-priority request immediately. This goes beyond the simple memory-pressure preemption described in Section 8.4: it’s policy-driven preemption in service of request-level SLAs.

When preemption is necessary, vLLM uses a hybrid strategy. It attempts to swap KV cache blocks to CPU memory first, preserving the computed state. If CPU memory is insufficient, it falls back to recomputation — freeing the blocks and requeuing the request to recompute its KV cache when it’s rescheduled. The choice between swap and recompute is made automatically based on available resources, not configured manually.

SLA-aware scheduling is at the RFC stage as of early 2026, but worth noting because it reflects where the project is heading. The proposal introduces SLA tiers — interactive, batch, and background — as the primary sort key for scheduling, with integer priority as a tie-breaker within a tier. SLA tier would also influence preemption order under memory pressure: background requests are evicted first, then batch, then interactive. This is a natural extension of the priority-based scheduling already in place, moving from raw integer priorities toward semantically meaningful service classes.

OOM handling in vLLM follows a graceful degradation strategy rather than crashing. When the KV cache manager cannot allocate slots for a new step, the scheduler stops admitting new requests and may preempt the lowest-priority running requests to free blocks. If the waiting queue is full or no blocks can be freed, the engine rejects incoming requests outright rather than risking a system-wide failure. KV cache blocks can also be swapped to CPU memory during preemption, effectively extending available cache capacity at the cost of PCIe transfer latency (vLLM Team 2025b).

Prefix caching is enabled by default in V1 with near-zero overhead (vLLM Team 2025e). vLLM’s approach hashes each KV block using the tokens it contains plus all preceding tokens, then stores these hashes in a global table. When a new request shares a prefix with a cached request, the matching blocks are reused without recomputation. The eviction policy is LRU among blocks with a reference count of zero, with ties broken by evicting the block at the end of the longest cached prefix — an approach that produces equivalent behavior to SGLang’s RadixAttention for standard full-attention models. The V1 implementation achieves less than 1% throughput overhead even at a 0% cache hit rate, thanks to constant-time eviction data structures (vLLM Team 2025e).

KV cache quantization provides an additional memory lever on top of paged allocation and prefix caching (Section 6.3). vLLM supports FP8 KV cache quantization with two granularity strategies: per-tensor quantization (a single scale factor for the entire Q, K, or V tensor) and per-attention-head quantization (a separate scale factor for each attention head) (vLLM Team 2024d). Three calibration methods are available: no calibration (scale factors default to 1.0), random token calibration performed on-the-fly during warmup, and dataset-based calibration through the companion llm-compressor library, which also supports the finer-grained per-head quantization (vLLM Team 2024d). FP8 KV cache roughly halves the memory footprint compared to FP16, enabling either more concurrent requests or longer context windows within the same GPU memory budget.

Attention backends and kernel optimization

vLLM doesn’t commit to a single attention kernel. Instead, it auto-selects from a menu of backends based on the GPU architecture, and you can override the choice if needed.

The selection logic follows GPU compute capability (vLLM Team 2024a):

  • FlashAttention-2 is the default on Ada Lovelace (SM89) and earlier CUDA GPUs.
  • FlashAttention-3 is the default on Hopper (SM90) GPUs. It exploits Hopper-specific features — warp specialization, FP8 tensor cores, asynchronous memory operations — and is essential for V1’s mixed prefill/decode batching where batch composition is highly dynamic.
  • FlashAttention-4 is the default on Blackwell (SM100+) GPUs.
  • FlashInfer provides a unified API that abstracts over multiple kernel implementations and JIT-compiles CUDA kernels for the target architecture. It serves as the primary backend on NVIDIA HGX B200 systems (vLLM Team 2024a).
  • Triton attention is implemented entirely in Triton (Tillet et al. 2019) and carries no external kernel dependencies (vLLM Team 2026). The same source code runs on NVIDIA, AMD, and Intel GPUs, making it the cross-platform fallback when FlashAttention or FlashInfer are unavailable.

Beyond attention kernel selection, vLLM applies kernel fusion through custom torch.compile Inductor passes that rewrite the computation graph rather than modifying model code (vLLM Team 2024c). The specific fusions include:

  • Attention output quantization: eliminates a full-precision memory round-trip by quantizing the attention output in-place
  • Activation fusion: fuses SiLU with a quantized linear layer, yielding up to 8% throughput improvement for quantized MLP blocks
  • QK norm + RoPE fusion: combines split QKV, reshape, Q/K RMSNorm, and rotary embedding into a single kernel
  • RMSNorm + quantization: eliminates an intermediate read/write of the full-precision activation tensor

These fusions are applied automatically when torch.compile is enabled — you don’t need to select or configure them individually.

Parallelism

vLLM supports all four major parallelism strategies — data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), and expert parallelism (EP) — but the way they compose is worth understanding, especially for Mixture-of-Experts models.

The standard pattern for dense models is TP within a node and PP across nodes (Section 7.3, Section 7.4). TP shards individual layers across GPUs on the same node and synchronizes via AllReduce, while PP distributes layers sequentially across nodes. vLLM optimizes pipeline bubbles by processing multiple requests concurrently through the pipeline.

For MoE models, the picture gets more interesting. Expert parallelism is not a standalone strategy — it’s a modifier flag (--enable-expert-parallel) that changes how experts are distributed within an existing TP or DP configuration (AMD ROCm Team 2025).

Without the EP flag, all experts are present on every GPU with their weights sharded across devices, and synchronization uses AllReduce. With EP enabled alongside TP, experts are distributed across the TP GPUs — different experts on different devices — using AllToAll communication to route tokens to the correct expert. With EP enabled alongside DP, experts are distributed across DP replicas, with AllToAll routing tokens across replicas.

This distinction matters because the two combinations have very different performance profiles. TP+EP excels at low concurrency — roughly 52% higher throughput than DP+EP at 64 concurrent requests for DeepSeek-R1. But DP+EP dominates at high concurrency, with 47% higher throughput at 1024 concurrent requests. The crossover occurs around 256–512 concurrent requests (AMD ROCm Team 2025). The reason is that DP+EP partitions the KV cache across replicas, so each replica handles a smaller share of the total batch. At high concurrency, this partitioning matters more than the communication overhead.

There’s a subtlety for models with Multi-Head Latent Attention (MLA), like DeepSeek. MLA models have a single KV head, which means the KV cache can’t be sharded along the head dimension. Under TP, the full KV cache must be duplicated on every TP rank, wasting memory. DP+EP avoids this duplication, making it the preferred configuration for MLA models at scale.

One more wrinkle: for ultra-sparse models with less than 1% activation density — models like Llama-4-Maverick where very few experts fire per token — enabling the EP flag actually hurts throughput by 7–12% (AMD ROCm Team 2025). AllReduce is more efficient than AllToAll when the number of active experts is tiny, so the communication pattern that EP introduces costs more than it saves.

Disaggregated prefill is available as an experimental feature (vLLM Team 2025a). The idea (Section 5.2) is to run two separate vLLM instances — one dedicated to prefill (compute-bound) and one to decode (memory-bandwidth-bound) — with a connector that transfers prefill KV caches from the prefill instance to the decode instance. The motivation is that co-locating prefill and decode on the same GPUs forces suboptimal resource allocation, since the two phases have fundamentally different performance profiles. The vLLM Router (Section 8.7.6) supports this configuration natively, routing new requests to prefill workers and completed prefills to decode workers. This is still experimental, but it mirrors the more mature disaggregated serving support in SGLang (Section 8.8.8).

vLLM integrates specialized communication kernels for large-scale expert parallelism: DeepEP kernels (high-throughput for prefill, low-latency for decode) and PPLX kernels (CUDA graph compatible and flexible for chunked prefill) (vLLM Team 2025c). For load balancing across experts, vLLM supports EPLB (Expert Parallel Load Balancer), which implements hierarchical and global load balancing strategies from DeepSeek (vLLM Team 2025c).

Speculative decoding

vLLM supports a broad menu of speculative decoding methods (Section 6.5), and the practical question is which one to use for a given workload.

Draft model decoding is the classic approach: a smaller model proposes candidate tokens and the full target model verifies them in a single forward pass. The draft and target models must share the same tokenizer and vocabulary. vLLM reports up to 1.5x speedup with this method (JarvisLabs 2025).

N-gram matching (prompt lookup decoding) is the simplest option. It searches the prompt and previously generated tokens for n-gram matches and proposes the tokens that followed the match. There’s no separate model, so the VRAM overhead is zero. On summarization tasks where output heavily overlaps with the input, n-gram matching achieves up to 2.8x speedup — but for general-purpose chat, gains are modest at 1.10–1.17x (JarvisLabs 2025).

Suffix decoding builds dual suffix tree data structures from the current request and previous request outputs. It’s CPU-based, model-free, and ideal for repetitive workloads like code generation or agentic loops where outputs follow predictable patterns (JarvisLabs 2025).

MLP speculators attach lightweight multi-headed MLPs to the target model, with separate prediction heads for 1, 2, and 3 steps ahead. They add roughly one-tenth the parameters of a full draft model (JarvisLabs 2025).

EAGLE is where the state of the art currently sits for general-purpose inference. EAGLE-3, the latest version, uses multi-layer fusion that extracts features from low, mid, and high layers of the target model, plus a training-time technique for resolving distribution mismatch between draft and target. It uses tree attention verification to check multiple candidate sequences simultaneously. On 70B models, EAGLE-3 achieves 1.57–1.60x speedup and is consistently the best method for large models where memory bandwidth is the bottleneck (JarvisLabs 2025).

The practical guidance breaks down by model size and workload:

  • Smaller models (8B): suffix decoding excels for code generation; EAGLE for unpredictable chat
  • Larger models (70B+): EAGLE-3 consistently wins due to the memory bandwidth bottleneck
  • High prompt/output overlap: n-gram matching is hard to beat and costs nothing

vLLM supports CUDA graphs for EAGLE-1 and EAGLE-3, and exposes speculative decoding metrics — draft acceptance rate, per-position acceptance rates, and mean acceptance length — for monitoring effectiveness in production (vLLM Team 2025c).

Operational features

Several features round out vLLM’s production story beyond raw inference performance.

Multi-LoRA serving in vLLM processes requests using different LoRA adapters concurrently in the same batch — base model, LoRA-A, and LoRA-B requests all run in parallel. Adapters can be added and removed at runtime via API endpoints without restarting the server. The key tuning parameters are max_loras (how many adapters can be active in a single batch), max_cpu_loras (the LRU cache size for adapters in CPU memory), and max_lora_rank (set as low as your adapters allow to save memory) (vLLM Team 2024e).

The vLLM Router is a FastAPI-based HTTP proxy for distributing requests across a fleet of vLLM engines (vLLM Team 2025d). It implements four load balancing algorithms:

  • Consistent hashing: routes requests with the same key (session or user ID) to the same worker, maximizing KV cache reuse across requests from the same conversation. This is the recommended policy for most deployments.
  • Power of Two (PoT): randomly samples two workers and routes to the less loaded one. Low overhead with good load distribution.
  • Round robin and random: standard stateless policies.

The router also supports prefill/decode disaggregated routing, sending new requests to prefill workers and completed prefills to decode workers. It integrates with Kubernetes for automatic pod discovery and exposes its own Prometheus metrics endpoint for request volume, latency, and per-worker health (vLLM Team 2025d).

Monitoring uses the standard Prometheus + Grafana stack (Section 8.5). vLLM exposes two categories of metrics through its /metrics endpoint: server-level gauges and counters (engine state, KV cache utilization, token throughput) and request-level histograms (TTFT, TPOT, ITL, prompt and generation lengths). For distributed tracing, vLLM supports OpenTelemetry, sending traces via OTLP to backends like Jaeger for end-to-end visibility across the serving pipeline (vLLM Team 2024e).

Hardware support is unusually broad. Beyond NVIDIA and AMD GPUs, vLLM runs on Intel GPUs and CPUs, Arm CPUs, Google TPUs, and through plugins on Intel Gaudi, IBM Spyre, and Huawei Ascend accelerators (Red Hat 2025). The Triton attention backend and torch.compile integration are what make this portability practical — the same model code and attention implementation can target different hardware without per-platform kernel rewrites. CPU inference supports DP, TP, and PP across multiple CPU sockets, and vLLM even supports heterogeneous speculative decoding where the draft model runs on CPU while the target model runs on GPU (Red Hat 2025).

Weight quantization is well-supported through multiple methods. FP8 W8A8 (8-bit weights and activations) runs on Hopper and Ada Lovelace GPUs, with a weight-only W8A16 variant available on Turing and Ampere via Marlin kernels — providing up to 2x memory reduction and up to 1.6x throughput improvement (vLLM Team 2024e). INT8 W8A8 is supported on compute capability 7.5 and above (Turing through Hopper), though not on Blackwell. For 4-bit weight-only quantization, vLLM supports both AWQ (using the official AWQ kernel) and GPTQ (using the ExLlamaV2 kernel by default). The llm-compressor companion library handles the calibration and quantization workflow for all of these methods (vLLM Team 2024d).

Guided decoding constrains model output to conform to a grammar-based finite-state machine, supporting both regular and context-free grammars through backends like xgrammar (vLLM Team 2025b). This enables structured output generation — producing valid JSON matching a schema, for instance — without post-hoc validation and retry.

Multimodal and VLM support received significant attention in V1. Input processing for vision-language models is offloaded to separate processes to avoid blocking the main inference loop. V1 implements encoder caching to avoid redundant vision encoder forward passes, chunked-prefill scheduling adapted for VLM inputs, and image-hash-based prefix caching that extends the prefix caching system to multimodal inputs (vLLM Team 2025e).

Benchmarking tools ship with vLLM for evaluating serving performance across different scenarios (vLLM Team 2024b). Three built-in benchmark modes cover the main use cases: a latency benchmark (short inputs, fixed output length) for measuring per-request latency, a throughput benchmark (1000+ samples submitted at once) for measuring maximum offline batch throughput, and a serving benchmark with Poisson-distributed request arrivals for measuring online serving under realistic load. The serving benchmark reports TTFT, TPOT, and ITL at P50, P90, and P99 — the same percentile latencies that production SLAs are typically defined against (Section 8.5). vLLM also runs continuous benchmarks on every labeled commit and merged PR, publishing results to a public performance dashboard for tracking regressions over time (vLLM Team 2024b).

8.8 SGLang

SGLang’s identity is built around two ideas: RadixAttention for fine-grained prefix caching, and a co-designed programming model for multi-call LLM workflows. But the framework has grown well beyond those roots. SGLang now implements a comprehensive set of serving optimizations — from a zero-overhead batch scheduler and hierarchical KV caching to chunked pipeline parallelism and an extensive menu of attention backends. This section walks through those implementations, focusing on what SGLang does differently and where its design choices lead to distinct tradeoffs.

RadixAttention and prefix caching

RadixAttention is SGLang’s signature contribution to KV cache management (Zheng et al. 2023). Where vLLM hashes KV blocks and stores them in a flat lookup table, SGLang organizes the entire KV cache as a radix tree — a compressed trie where edges are labeled with token sequences of varying lengths and nodes point to the corresponding KV cache tensors.

When a new request arrives, the scheduler walks the tree to find the longest prefix match. If the request’s prompt shares a prefix with any cached sequence, the matching KV cache is reused without recomputation. The tree structure makes this automatic and fine-grained: it handles multi-level sharing across few-shot prompts, branching reasoning trees, multi-turn chat histories, and self-consistency sampling without any explicit configuration.

The key architectural difference from vLLM’s block-level hashing is granularity. vLLM matches at block boundaries using O(1) hash lookups — fast, but it can only reuse cache at the granularity of fixed-size blocks. SGLang’s token-level matching in a radix tree enables finer-grained reuse at the cost of a tree traversal. For workloads with many shared prefixes — the exact scenario RadixAttention was designed for — the finer granularity pays off.

RadixAttention is enabled by default and uses LRU eviction that recursively frees leaf nodes when GPU memory fills up. The overhead when there is no cache hit is negligible — the tree traversal cost is dwarfed by the GPU computation (Zheng et al. 2023). It can be disabled with --disable-radix-cache for workloads where prefix sharing is rare and the tree maintenance isn’t worth it.

Scheduling and memory management

SGLang’s scheduler operates at the iteration level (Section 5.1), forming a new batch from the waiting queue after every forward pass. The main event loop receives requests, processes input, calls get_next_batch_to_run(), executes a GPU forward pass, then calls process_batch_result. Unfinished requests loop back for further processing. This is the same continuous batching approach used by other frameworks, but SGLang layers several distinctive features on top.

Zero-overhead batch scheduling. Introduced in v0.4 (LMSYS 2024), this optimization overlaps CPU batch preparation with GPU execution. While the GPU runs the current batch’s forward pass, the CPU concurrently prepares the next batch — creating “future tokens” and carefully scheduling CUDA events and synchronization points so that there is no idle time on the GPU between consecutive decoding batches, reducing TPOT. The SGLang team verified this via Nsight profiling: the GPU stays fully occupied across batch boundaries. The result is a 1.1x throughput improvement over SGLang v0.3 and a 1.3x speedup over other frameworks at the time of release (LMSYS 2024).

Cache-aware scheduling policies. The default scheduling policy is Longest Prefix Match (lpm), which prioritizes requests whose prompts share the longest prefix already in the radix cache. This is a natural pairing with RadixAttention — the scheduler actively steers toward cache hits rather than leaving them to chance. SGLang also supports dfs-weight (balances cache hits with tree traversal efficiency), fcfs (first come first serve), lof (longest output first), and random, configured via --schedule-policy (SGLang Team 2024).

Ragged batching. Rather than using a traditional prefill mode, SGLang uses what it calls extend mode: it incrementally updates existing KV cache using ragged tensors. This is SGLang’s version of ragged batching, where sequences of different lengths in the same batch are packed together without padding. The extend mode determines whether a ragged forward pass is needed based on token count and wrapper count.

Chunked prefill is supported and enabled by default, configured via --chunked-prefill-size (default 8192, favoring throughput). If a prompt exceeds this value, it’s split into smaller chunks processed sequentially. Setting it to -1 disables chunking. SGLang also supports mixed chunked prefill (--enable-mixed-chunk, disabled by default), where prefill and decode operations are mixed within the same batch in a single forward pass. The current limitation is that mixed mode uses a single prefill attention kernel for the whole batch rather than separate optimal kernels for prefill versus decode requests (SGLang Team 2024).

Priority-based scheduling is opt-in via --enable-priority-scheduling. When enabled, requests can carry explicit priorities that override the scheduling policy ordering. The scheduler can preempt running requests to make room for higher-priority arrivals, evicting lowest-priority running requests first. The preemption threshold is controlled by priority_scheduling_preemption_threshold (default: 10). This feature is still maturing — a known issue is that excessive preemption can occur where low-priority requests keep getting preempted even after the minimum token removal threshold is satisfied (SGLang Team 2024).

Preemption in SGLang uses recomputation only. There is no swap-to-CPU preemption strategy. When a request is preempted, its KV cache is discarded, and when the request is rescheduled, its KV cache must be recomputed from scratch. This is a deliberate simplification compared to vLLM’s hybrid swap/recompute approach — it avoids the complexity and PCIe bandwidth cost of CPU swapping, but it means preemption is more expensive for long sequences that took significant prefill work to generate.

OOM handling is adaptive rather than static. The new_token_ratio parameter controls how aggressively the scheduler estimates future token consumption. When a batch executes successfully, the ratio decreases (more aggressive batching). When OOM occurs, the ratio increases (smaller batches). The --max-running-requests flag provides an additional hard cap on concurrent decoding sequences. The --mem-fraction-static parameter controls the ratio of GPU memory allocated to model weights plus the KV cache pool; the rule of thumb is to reserve 5–8 GB for activations, reducing the fraction to 0.8 or 0.7 if OOM occurs (SGLang Team 2024).

SGLang does not implement explicit request rejection under memory pressure at the engine level. The Model Gateway layer does support token-bucket rate limiting with FIFO queuing, returning 429 or 408 status codes, but this is admission control at the routing layer rather than backpressure from the GPU engine.

HiCache extends RadixAttention with a hierarchical KV caching system across three memory tiers: GPU memory (L1), host/CPU memory (L2), and distributed storage (L3) (LMSYS 2025c). A HiRadixTree acts as a page table for KV caches across these tiers. GPU-assisted I/O kernels provide up to 3x higher throughput for CPU-GPU transfers compared to naive copy. HiCache supports multiple storage backends including 3FS, Mooncake, NIXL, AIBrix KVCache, and local file, with configurable write policies: write-through, write-through-selective (which uses hit-count tracking to identify hot spots), and write-back. The performance impact is substantial: up to 6x throughput improvement and up to 80% reduction in TTFT for cache-heavy workloads (LMSYS 2025c).

Attention backends

SGLang doesn’t commit to a single attention kernel either, but its backend selection matrix is unusually broad — spanning not just GPU generations but entire hardware families.

The auto-selection logic follows GPU compute capability (SGLang Team 2024):

  • Triton is the default on Turing (SM75) GPUs.
  • FlashInfer is the default on Ampere (SM80, SM86) and Ada Lovelace (SM89) GPUs. FlashInfer provides a unified API that abstracts over multiple kernel implementations and serves as the primary backend for older NVIDIA architectures. NVIDIA releases its most performant inference kernels — including TensorRT-LLM kernels — through FlashInfer for integration into SGLang.
  • FlashAttention-3 is the default on Hopper (SM90) for standard multi-head attention models. SGLang skipped FA1 and FA2 as explicit backends, jumping directly to FA3 with FlashInfer covering earlier-generation workloads. FA3 became the default attention backend as of v0.4.6 for mainstream MHA models on Hopper, chosen for its native paged KV cache support via the flash_attn_with_kvcache API. It consistently delivers the highest throughput across tested scenarios, outperforming FlashInfer and Triton especially as input/output size increases (SGLang Team 2024).
  • FlashInfer remains the default on Hopper for non-MHA models (MLA, GQA variants).
  • FlashAttention-4 is supported for Blackwell (SM100+) GPUs with --page-size 128 for MHA and --page_size 1 for MLA. FA4 supports FP4 KV cache, but currently has a limitation: decode speed degrades as sequence length grows due to lack of SplitKV support on Hopper, making it primarily useful for prefill in some configurations.

For Multi-Head Latent Attention models like DeepSeek, SGLang provides a dedicated backend matrix: FlashInfer MLA, FlashMLA (from DeepSeek), Cutlass MLA, TRTLLM MLA, FA3, and FA4. FlashInfer MLA operates with page_size=1 and supports FP4 KV cache and prefix caching. On Blackwell, TRTLLM prefill/decode DSA kernels are the default (SGLang Team 2024).

SGLang also supports hybrid prefill/decode backend configuration via --prefill-attention-backend and --decode-attention-backend, allowing different kernels for each phase — a recognition that the optimal kernel for compute-bound prefill is often different from the optimal kernel for memory-bound decode.

Beyond the attention backends, hardware support extends to AMD (ROCm via AITER and Wave backends), Huawei Ascend (NPU), and Intel (XPU), though the NVIDIA backends are the most mature (SGLang Team 2024).

Kernel optimization

SGLang’s approach to kernel optimization centers on sgl-kernel, a standalone high-performance CUDA kernel library that provides optimized primitives for quantization, MoE, attention, and GEMM operations. sgl-kernel supports multiple architectures (SM80, SM89, SM90, SM90a, SM100a+) and integrates third-party libraries including CUTLASS, FlashInfer, and DeepGEMM.

Piecewise CUDA graphs are enabled by default. SGLang splits model computation into separate pieces captured as individual CUDA graphs for predefined token lengths. Inputs are padded to the nearest captured size at runtime, eliminating kernel launch overhead. Memory is optimized via a global shared memory pool across runners, with capture in reverse order (largest-first). The approach auto-disables for speculative decoding, pipeline parallelism, LoRA, VLM models, non-CUDA hardware, and deterministic inference — cases where the static capture assumption breaks down (SGLang Team 2024).

torch.compile integration is experimental. SGLang implements a custom SGLangBackend that traces model forward passes as FX graphs, splits at registered split points (such as MoE dispatch operations), compiles each piece separately for dynamic shapes, and dispatches at runtime through eager split ops and per-piece replay. Custom CUDA kernels must be registered via register_custom_op for compatibility. Currently, torch.compile is only supported in combination with CUDA graphs, not standalone (SGLang Team 2024).

KV cache engineering

Beyond RadixAttention’s tree-based management, SGLang supports several KV cache optimization techniques.

Paged KV cache is supported across backends, though with different page size constraints. FA3 provides native paged KV cache support via the flash_attn_with_kvcache API, which accepts the entire page table directly. Different backends impose different page size requirements: FA4 requires page_size=128 for MHA, FlashInfer MLA uses page_size=1, and TRTLLM uses fixed sizes of 16, 32, or 64 (SGLang Team 2024).

KV cache quantization supports FP8 E5M2 (larger dynamic range), FP8 E4M3 (higher precision, recommended), and experimental FP4 E2M1 (MXFP4 with 16-element blocks), enabled via --kv-cache-dtype. FP8 provides roughly 2x memory savings over BF16; FP4 provides roughly 3.56x savings accounting for scaling overhead. FP8 requires per-tensor scaling factors from checkpoints or a JSON file (defaulting to 1.0 with an accuracy warning), while FP4 handles scaling automatically. INT8 KV cache is not supported — only FP8 and FP4. A critical caveat: not all backends support fused dequantization with the attention kernel, and using quantized KV with an unsupported backend can be “extremely slow” (SGLang Team 2024).

MLA support is first-class. MLA stores latent representations rather than full K/V heads, significantly reducing KV cache memory. SGLang provides dedicated MLA backends and supports FP4 KV cache quantization for MLA models. Under data parallelism, MLA’s latent state caches are stored on different GPUs, with requests split across them (SGLang Team 2024).

Selective KV cache and token eviction are supported through multiple mechanisms. DoubleSparsity provides approximated attention with token selection for KV cache compression. Token eviction methods (SnapKV, PyramidKV) permanently drop non-important tokens for long-context and streaming scenarios. FlashInfer supports vector-sparsity (page_size=1), enabling efficient KV cache token pruning at token granularity. Native Sparse Attention (NSA) is also supported, with a fuse store indexer for K cache and a configurable KV length threshold for sparse MLA attention at prefill — improving throughput for DeepSeek V3.2 and GLM-5 (SGLang Team 2024).

Speculative decoding

SGLang supports several speculative decoding methods (Section 6.5), with a particular emphasis on the EAGLE family.

EAGLE-2 and EAGLE-3 are the primary methods, using tree-structured speculative decoding with branching controlled by --speculative-eagle-topk. EAGLE-2 uses a draft model to evaluate branch probability and dynamically stops expansion of unlikely branches; after expansion, reranking selects the top --speculative-num-draft-tokens final nodes for verification. EAGLE-3 removes the feature prediction objective, incorporates low and mid-layer features, and is trained on-policy. The performance difference is significant: on LLaMA-3.1-8B with an H100, EAGLE-2 achieves 244 tokens/s (54% improvement over baseline), while EAGLE-3 reaches 373 tokens/s (136% improvement over baseline of 158 tokens/s). SGLang also supports EAGLE-2 combined with FR-Spec, which uses a truncated high-frequency vocabulary to reduce lm_head overhead, configured via --speculative-token-map (SGLang Team 2024).

SGLang provides SpecForge, its own training framework for EAGLE draft models — making it one of the few serving frameworks that includes tooling for the draft model training side, not just inference.

Standalone draft model decoding is supported (--speculative-algorithm STANDALONE), using a separate smaller model for token-level drafting. The limitation is that it cannot be combined with --enable-dp-attention (SGLang Team 2024).

N-gram speculative decoding (--speculative-algorithm NGRAM) drafts tokens from the previous context cache with no separate model required. It’s configurable via --speculative-ngram-max-bfs-breadth (1–10, default 10), --speculative-ngram-match-type (“BFS” or “PROB”), --speculative-ngram-max-trie-depth (max 18), and --speculative-ngram-capacity (up to 10M entries). Like standalone decoding, it’s incompatible with data-parallel attention, overlap scheduling, and mixed chunked prefill (SGLang Team 2024).

Multi-token prediction is supported for models with built-in MTP heads, specifically DeepSeek V3/V3.1/V3.2. MTP operates in two stages: lightweight heads generate \(n\) candidates in a single pass, then the full model validates all drafts in parallel, accepting the longest matching prefix. At moderate scale (16 H200 GPUs), MTP delivers +59.8% throughput with 3-token prediction and +60.8% with 4-token prediction. At larger scale (128 H200 GPUs), the gains are more modest at +14.2% output throughput. Average acceptance length is 2.4 tokens, with a recommended starting draft_token_num of 2 (LMSYS 2025a).

Lookahead decoding (Jacobi iteration) has partial support, contributed via a community PR. An experimental Speculative Decoding V2 with an overlap scheduler is available via SGLANG_ENABLE_SPEC_V2=True, though it requires --speculative-eagle-topk 1 (SGLang Team 2024).

Parallelism

SGLang supports data parallelism, tensor parallelism, pipeline parallelism, expert parallelism, and context parallelism, with total world size computed as \(\text{TP} \times \text{PP} \times \text{EP} \times \text{DP}\).

Tensor parallelism (--tp <size>) shards model weights across GPUs within a node, synchronized via AllReduce — the standard approach (Section 7.3).

Pipeline parallelism is where SGLang has made a distinctive contribution. Chunked Pipeline Parallelism, introduced in January 2026 (LMSYS 2026), splits long sequences into chunks and processes different chunks in parallel across pipeline stages. It uses asynchronous P2P communication with non-blocking sends and receives, and a micro-batching event loop that overlaps GPU computation with CPU metadata processing and PP communication. Dynamic chunking adapts chunk sizes at runtime rather than using a fixed split.

The performance results are notable: PP4 TP8 multi-node yields a 3.31x prefill throughput improvement for DeepSeek-V3.1 on an H20 cluster compared to TP8 with 12K chunked prefill. Scaling efficiency stays above 80% at PP4. For ultra-long prompts (Qwen3-235B-A22B-FP8 on H20 with PP8), chunked pipeline parallelism delivers up to 81% reduction in TTFT (LMSYS 2026).

Expert parallelism (--ep <size> plus --enable-ep-moe) distributes expert weights across multiple devices in MoE models using optimized all-to-all communication and grouped GEMMs. The all-to-all backend is configurable via --moe-a2a-backend with options including DeepEP, Mooncake (DeepEP extension with RDMA), NIXL (NVIDIA NIXL with RDMA and NVLink), MORI (AMD ROCm native), FlashInfer, and Ascend. A key constraint: DeepEP, Mooncake, NIXL-EP, Ascend, and MORI only support ep_size = tp_size. For ep_size < tp_size, only the default AllReduce/AllGather backend works (SGLang Team 2024).

The MoE computation backend is separately configurable via --moe-runner-backend, with options including Triton, DeepGEMM, CUTLASS, and FlashInfer variants for Blackwell (TRT-LLM kernels), FP4/FP8 (CUTLASS), MXFP4, and cuDSL. DeepEP dispatch mode (--deepep-mode) can be set to auto (runtime switching), normal (optimized for prefill), or low_latency (optimized for decode, CUDA graph compatible) (SGLang Team 2024).

For expert load balancing, SGLang supports EPLB (Expert Parallelism Load Balancer from DeepSeek) via --enable-eplb, plus --enable-two-batch-overlap (up to 2x throughput) and --enable-single-batch-overlap for overlapping communication with computation (SGLang Team 2024).

Data parallelism (--dp <size>) replicates the model across multiple scheduler instances. SGLang also supports --enable-dp-attention for data-parallel attention, which is distinct from standard DP: latent state caches for different requests are stored on different GPUs, which is particularly useful for MLA models where the KV cache can’t be sharded along the head dimension (SGLang Team 2024).

Context parallelism is supported for the prefill phase, distributing long sequences across GPUs for MHA models. This is distinct from pipeline parallelism — it splits a single sequence’s tokens across GPUs rather than distributing layers (LMSYS 2026).

Combined parallelism patterns are well-supported. PP4 TP8 has been demonstrated for DeepSeek-V3.1 on multi-node H20 clusters. TP within a node combined with EP across nodes has been demonstrated at scale with 96 H100 GPUs for DeepSeek deployment. Different parallelism configurations can be used per phase in disaggregated serving: for example, PP8 TP8 for prefill and PP1 DP16 EP16 for decode (LMSYS 2025b).

Multi-node deployment is configured via --nnodes, --node-rank, and --dist-init-addr, with support for Kubernetes (via LWS_LEADER_ADDRESS and LWS_GROUP_SIZE environment variables) and SLURM clusters (SGLang Team 2024).

Disaggregated prefill

SGLang supports disaggregated prefill (Section 5.2) where prefill and decode run on separate node pools connected via high-performance communication backends. KV cache transfer between pools uses RDMA-based transports including the Mooncake Transfer Engine and NIXL. Send and receive operations are non-blocking, running in background threads, and use scatter-gather elements (SGE) in RDMA to transfer non-contiguous memory chunks efficiently.

At scale, this has been demonstrated on 12 nodes of 8 H100 GPUs, achieving 52.3k input tokens/s and 22.3k output tokens/s per node for 2000-token inputs. The Model Gateway supports PD disaggregation routing with separate prefill and decode worker pools (LMSYS 2025b).

Quantization

SGLang supports a broad range of weight quantization methods: fp8, mxfp4, blockwise_int8, awq, gptq, compressed-tensors, quark, auto-round, awq_marlin, gptq_marlin, gguf, modelopt/modelopt_fp8 (Hopper SM90+), modelopt_fp4 (Blackwell SM100+), bitsandbytes, and torchao. Both offline quantization (pre-quantized weights) and online quantization (dynamic scaling at runtime) are supported, though offline is recommended for better performance (SGLang Team 2024).

For weight-activation quantization, SGLang supports W8A8 with --quantization w8a8_int8 or w8a8_fp8, using optimized CUTLASS int8 or fp8 kernels from sgl-kernel for per-channel W8A8 with per-token dynamic activation quantization (SGLang Team 2024).

GPTQ and AWQ are supported both in their standard forms and with Marlin kernel optimization (gptq_marlin, awq_marlin) (SGLang Team 2024).

The GEMM backend is configurable for both FP8 (--fp8-gemm-backend with options including DeepGEMM, FlashInfer TRT-LLM, FlashInfer CUTLASS, CUTLASS, Triton, and AITER) and FP4 (--fp4-gemm-backend with CUTLASS, FlashInfer CUTLASS, FlashInfer cuDNN, and FlashInfer TRT-LLM). AMD GPUs have their own supported quantization methods including fp8, mxfp4, w8a8_int8, w8a8_fp8, and petit_nvfp4 (NVFP4-on-ROCm) (SGLang Team 2024).

Torchao online quantization options include int8dq, int8wo, fp8wo, fp8dq (per_tensor/per_row), and int4wo (group sizes 32/64/128/256). Note that int8dq has CUDA graph capture bugs requiring --disable-cuda-graph (SGLang Team 2024).

There are several practical limitations to be aware of. Mixed-bit quantization is incompatible with layer fusion (QKV fusion issues). Quantized MoE models may hit kernel limitations with mlp.gate layers. Quantized VLM support is limited, with some models showing near-zero accuracy. ModelOpt online quantization causes high startup overhead and increased VRAM usage. Pre-quantized models (such as DeepSeek V3/R1 with native FP8) should not have additional --quantization flags applied (SGLang Team 2024).

Operational features

Structured output and guided decoding. SGLang supports JSON schema (via Pydantic BaseModel or raw schema), regular expressions, and EBNF grammar (GGML BNF format for XGrammar). The grammar backend is configurable via --grammar-backend: XGrammar (default, best performance, supports JSON/regex/EBNF), Outlines (JSON/regex only), or Llguidance (JSON/regex/EBNF). The key optimization is that per-step mask generation is overlapped with LLM inference, hiding grammar processing latency and achieving constrained decoding with near-zero overhead at production scale. SGLang also supports structural tags for multiple constrained regions within a single response — useful for tool calling where some parts of the response are free-form and others must conform to a schema (SGLang Team 2024).

Multi-LoRA serving integrates S-LoRA and Punica for efficient multi-adapter batching. Key configuration parameters include --max-loras-per-batch (default 8), --max-loaded-loras (CPU memory limit), and --lora-backend (triton or csgmv; csgmv is the default, providing 20–80% latency improvement at high concurrency). SGLang supports GPU pinning for frequently-used adapters, LRU/FIFO eviction policies, and async overlap loading (--enable-lora-overlap-loading, yielding roughly 35% TTFT reduction). Adapters can be loaded and unloaded dynamically via REST endpoints. The API supports both a native lora_path parameter and OpenAI-compatible model:adapter-name syntax. Piecewise CUDA graphs auto-disable when LoRA is active (SGLang Team 2024).

The Model Gateway is a Rust-based router that sits in front of one or more SGLang engine instances. It implements five load balancing policies: random, round_robin, power_of_two (sample 2 workers, pick the lighter one), cache_aware (default, balancing cache locality with load distribution), and bucket (dynamic load buckets). The cache-aware policy is particularly interesting: it routes requests to the instance most likely to have relevant prefix cache entries in its radix tree, achieving up to 1.9x throughput increase with a 3.8x higher cache hit rate compared to cache-oblivious routing (LMSYS 2024). Cache-aware tuning parameters include --cache-threshold, --balance-abs-threshold, --balance-rel-threshold, and --eviction-interval-secs (SGLang Team 2024).

For reliability, the Model Gateway provides circuit breakers (three states: closed, open, half-open), exponential backoff retries with jitter (retryable status codes: 408, 429, 500, 502, 503, 504), API key authentication, and TLS/mTLS (SGLang Team 2024).

Monitoring. The Model Gateway exposes over 40 Prometheus metrics across HTTP (requests_total, request_duration_seconds, rate_limit_total), router (ttft_seconds, tpot_seconds, generation_duration_seconds), worker (pool_size, connections_active, health_checks_total), and circuit breaker categories, with duration histogram buckets from 1ms to 240s. OpenTelemetry tracing exports to OTLP/gRPC with W3C Trace Context propagation for distributed tracing across the full request lifecycle (SGLang Team 2024).

Multimodal support covers 30+ model architectures including Llama 3.2 Vision, LLaVA 1.5/NeXT, Qwen3-VL, NVILA, and DeepSeek-VL2. FA4 support has been added for multimodal encoders. VLM models auto-disable piecewise CUDA graphs due to the dynamic nature of vision token lengths. SGLang Diffusion extends the framework to video and image generation workloads (SGLang Team 2024).

Deployment modes include co-launch (single process), separate HTTP workers, gRPC workers, PD disaggregation, OpenAI-compatible proxy, and a multi-model gateway (IGW) with per-model policies. WASM middleware supports custom request/response processing, and MCP integration enables tool execution. Reasoning parser integration supports DeepSeek-R1, Qwen-3, and GLM-4.5 thinking formats (SGLang Team 2024).

8.9 TensorRT-LLM

TensorRT-LLM takes a fundamentally different design path from vLLM and SGLang. Where those frameworks operate primarily at the Python level — scheduling, memory management, and model execution all live in Python with selective CUDA kernel calls — TensorRT-LLM compiles the entire model into an optimized execution plan before serving begins. The result is a system that trades flexibility and hardware breadth for raw throughput on NVIDIA hardware.

Architecture and compilation

TensorRT-LLM has gone through a significant architectural shift. The legacy TensorRT backend compiled the model graph through NVIDIA’s TensorRT engine, which performed kernel selection, operator fusion, and memory optimization at build time. This produced highly optimized execution plans but required a heavyweight compilation step — rebuilding the engine for different batch sizes, sequence lengths, or quantization configurations.

The PyTorch workflow, which became the default in v1.0, takes a different approach (NVIDIA 2024d). Instead of compiling through TensorRT’s graph optimizer, it runs the model through PyTorch and applies optimizations via torch.compile and piecewise CUDA graphs — the same strategy vLLM V1 adopted (Section 8.7.1). The PyTorch workflow offers more flexibility: you can modify model code without recompiling, swap attention backends, and iterate faster during development. The tradeoff is that the legacy backend can still squeeze out a few extra percent of throughput for well-characterized workloads where the compilation cost is amortized across many requests.

An AutoTuner benchmarks and selects optimal kernel implementations for each specific configuration, trading compile-time for runtime efficiency (NVIDIA 2024d, 2024i). This is the compilation philosophy in a nutshell: spend time upfront to make every kernel call as fast as possible.

Scheduling and memory management

TensorRT-LLM’s scheduler uses the same continuous batching approach as the other frameworks (Section 5.1), but NVIDIA’s terminology calls it in-flight batching — the batch composition changes at every iteration, with finished sequences immediately evicted and new requests inserted (NVIDIA 2024d, 2024f). The constraint is that context-phase sequences must precede generation-phase sequences in the packed input tensor, which affects how the scheduler orders work within each step.

Chunked prefill splits context processing into chunks that are batched with generation tokens, just as in vLLM and SGLang (Section 5.2). It requires paged KV cache and fused multi-head attention (FMHA) to be enabled, and chunk sizes (except the final one) must be integer multiples of the KV cache block size (NVIDIA 2024f, 2024d).

Preemption is where TensorRT-LLM’s design choices diverge most from the other frameworks. It offers two separate preemption strategies under the MAX_UTILIZATION allocation policy, rather than the hybrid swap-with-recompute-fallback that vLLM uses (NVIDIA 2024g; SqueezeBits 2025):

  • Swap: the preempted request’s KV cache is copied to host (CPU) memory and reloaded when the request resumes. This preserves computed state but introduces “significant overhead” from PCIe transfers, particularly for long sequences.
  • Drop (recomputation): the KV cache is discarded entirely. On resumption, the original context plus any tokens generated so far are concatenated and run through a single prefill pass to reconstruct the cache. This is preferred over swapping because it only requires one forward pass rather than a round-trip through host memory.

The default policy is GUARANTEED_NO_EVICT, which takes a more conservative approach than either preemption strategy: it preallocates memory for the maximum output length of each request at admission time. If there isn’t enough KV cache memory to guarantee a request can run to completion, the request is simply not admitted. This eliminates preemption entirely at the cost of smaller batch sizes, since reserved-but-unused memory can’t serve other requests. It’s the safe choice for latency-sensitive deployments where preemption would cause unacceptable tail latency spikes.

Priority-based KV cache eviction adds a finer-grained control layer on top of these policies. You can assign priority levels (0–100) to specific token ranges within a request, plus duration values that control how long the priority applies (NVIDIA 2024c). The practical use case is system prompts: assign them maximum priority so they remain cached longer, improving reuse for requests that share the same system prompt.

OOM handling is less adaptive than what vLLM and SGLang provide. There is no automatic dynamic batch size reduction at runtime — max_batch_size is configured at build or launch time. NVIDIA’s guidance is to profile throughput across batch sizes, set max_batch_size 20–30% above the throughput knee, and reduce it if KV cache utilization consistently exceeds 80% (NVIDIA 2024d). This is a manual tuning process rather than the self-adjusting ratios that SGLang uses (Section 8.8.2).

There is no built-in SLA-aware scheduling — no mechanism to automatically adjust scheduling based on per-request latency targets. The priority-based KV cache eviction and request admission policies provide building blocks, but tying them to SLA tiers is left to the orchestration layer.

Attention backends and kernel optimization

This is where TensorRT-LLM’s compilation heritage shows most clearly. Rather than selecting from a menu of third-party attention backends the way vLLM and SGLang do, TensorRT-LLM primarily uses its own custom attention kernels, with FlashInfer available as a pluggable alternative (NVIDIA 2024d, 2024f).

The context phase (prefill) uses NVIDIA’s fused multi-head attention (FMHA) kernels, which implement FlashAttention v1 and v2 algorithms for larger sequences. For short sequences, vanilla MHA/MQA runs instead. Two FMHA modes are available: standard FP32 accumulation and a variant that forces FP32 in the first batched matrix multiplication for improved numerical accuracy. FP8 context FMHA accelerates attention on Ada Lovelace and Hopper GPUs and works simultaneously with paged KV cache (NVIDIA 2024f).

The generation phase uses XQA (eXtended Query Attention), a specialized kernel for MQA and GQA models that is distinct from TensorRT-LLM’s other attention paths (NVIDIA 2024f). XQA supports FP16 and BF16 compute with FP16, BF16, FP8, and INT8 KV cache, and paged KV cache with block sizes of 8, 16, 32, 64, or 128 tokens. It’s enabled by default and can be disabled with --disable_xqa.

Multi-block mode is TensorRT-LLM’s version of FlashDecoding (Section 6.1) — it distributes attention computation across multiple CUDA thread blocks during the generation phase to improve GPU occupancy when a single query attends to a long KV cache. It’s enabled by default since v0.13 (NVIDIA 2024f).

The kernel fusion story goes well beyond attention. The masked multi-head attention kernel fuses QKV bias addition, RoPE application, and dequantization/quantization into a single kernel launch. A fused GEMM-SwiGLU kernel is available on Hopper (SM90). Fused MoE finalize and AllReduce operations reduce overhead for Mixture-of-Experts models, and one-sided AlltoAll over NVLink further optimizes MoE communication (NVIDIA 2024f, 2024i).

Piecewise CUDA graphs capture operation sequences into optimized graphs for reduced CPU launch overhead, combined with torch.compile in the PyTorch workflow (NVIDIA 2024d). This is the same approach vLLM V1 takes (Section 8.7.1), but TensorRT-LLM was an earlier adopter.

One notable absence: FlexAttention is not supported. FlexAttention is a PyTorch-native API (torch.nn.attention.flex_attention) for user-defined attention patterns via compilation. TensorRT-LLM uses its own custom attention kernels and plugin system rather than PyTorch’s attention APIs, so FlexAttention’s composability advantage doesn’t apply here.

KV cache engineering

TensorRT-LLM implements paged KV cache with configurable block sizes of 8, 16, 32, 64, or 128 tokens (default 128, must be a power of 2) (NVIDIA 2024f). The contrast with the legacy contiguous KV cache is stark: the old approach allocated a monolithic tensor of shape [max_batch_size * max_beam_width, 2, num_heads, max_seqlen, hidden_dim_per_head], wasting memory whenever sequences were shorter than the maximum — which was almost always.

KV cache quantization supports INT8 and FP8 modes with per-tensor scaling factors. The XQA kernel handles FP16, BF16, FP8, and INT8 KV cache combinations, so quantization integrates cleanly with the generation-phase attention path (NVIDIA 2024f). There is no support for TurboQuant, the rotation-plus-vector-quantization technique for KV cache compression that has community integrations in vLLM and SGLang.

Prefix caching (KV cache reuse) uses block hashing — only full blocks can be shared across requests, similar to vLLM’s approach and unlike SGLang’s token-level radix tree (Section 8.8.1). It’s enabled by default when built with paged context FMHA, with the default block size of 128 tokens (NVIDIA 2024c). There’s an important practical limitation: reuse requires the first request to complete its block before subsequent requests can access it. At high batch sizes, requests may launch before prior ones finish filling their blocks, preventing reuse.

A KV Cache Event API provides real-time tracking of cache state changes — block creation, storage, removal, and updates — enabling downstream routing decisions (NVIDIA 2024c). This is particularly useful for multi-instance deployments where a load balancer needs to know which instances already have specific token sequences cached.

Host memory offloading extends KV cache reuse by preserving evicted blocks in pinned host memory with LRU eviction. A secondary_offload_min_priority threshold prevents low-priority blocks from being offloaded at all, evicting them directly to reduce GPU-CPU traffic (NVIDIA 2024g, 2024c).

MLA (Multi-Head Latent Attention) support is first-class. KV cache reuse for MLA was added in v1.1, chunked prefill for MLA in v1.0, FP8 MLA on Hopper and Blackwell in v0.19.0, and FlashMLA for SM90 in v0.19.0 (NVIDIA 2024i).

Sliding window attention is handled through a cyclic KV cache that treats storage as a circular buffer retaining the last \(N\) tokens. Per-layer window sizes are supported, and a StreamingLLM mode maintains a set of “sink tokens” permanently while cycling other positions, adjusting position embeddings to use in-cache positions (NVIDIA 2024f). The limitation is that it’s incompatible with beam search.

Keyformer-style token-level eviction based on attention scores is not supported. TensorRT-LLM’s selective eviction operates at the block level via prioritized LRU, not at the individual token level.

Speculative decoding

TensorRT-LLM supports a broad set of speculative decoding methods (Section 6.5), with particularly strong performance numbers on large models (NVIDIA 2024j, 2024c).

Draft model decoding uses two independently trained models that share the same vocabulary. The draft model generates up to \(K\) candidate tokens, and the target model validates them in a single forward pass. The performance on H200 GPUs tells the story: Llama 405B with a 3B draft model achieves 120.75 tokens/sec — a 3.61x speedup over the 33.46 tok/sec baseline. Llama 70B with a 1B draft achieves 146.05 tokens/sec (2.86x speedup). These numbers are notably higher than what vLLM reports for draft model decoding (Section 8.7.5), reflecting TensorRT-LLM’s kernel-level optimizations. Draft model decoding works with FP8 quantization and is compatible with Triton Inference Server (NVIDIA 2024c).

EAGLE is well-supported across versions. EAGLE-1 uses predefined decoding trees, while EAGLE-2 assembles trees dynamically via beam search. EAGLE-3 was added in v0.19.0 with disaggregated serving support following in v0.21.0 (NVIDIA 2024i, 2024j).

Medusa adds multiple lightweight LM heads predicting future tokens, with configurable tree structures at runtime via medusa_choices. Inflight batching support has been available since v0.9.0. The current limitations are notable: it only supports Vicuna (fine-tuned LLaMA), requires medusa_temperature=0, and is incompatible with beam search (NVIDIA 2024j).

ReDrafter takes a different approach — recurrent prediction where each draft token depends on the previous one. The engine performs logits prediction, beam search, and acceptance internally, supporting both inflight fused batching and static batching (NVIDIA 2024j).

N-gram speculative decoding copies input prompt and previously generated output as draft tokens, requiring only the target model. It performs best on tasks with high n-gram overlap — summarization, question answering, code editing — where the output draws heavily from the input. NGram v2 was added in v1.0 (NVIDIA 2024j, 2024i).

Multi-token prediction is supported for models with built-in MTP heads, notably DeepSeek V3/R1. When combined with disaggregated serving, MTP adds another 1.6x–2.5x speedup on top of the disaggregation benefits (NVIDIA 2024i, 2024b).

Lookahead decoding (Jacobi iteration) runs two parallel computation branches — a lookahead branch and a verification branch — within the same model, requiring no additional training or fine-tuning. It has been experimental since v0.13.0, with inflight batching support since v0.16.0 (NVIDIA 2024j, 2024i).

An overlap scheduler between draft forwards was added in v0.21.0, implementing the pipelined draft-verify idea from speculative speculative decoding (Section 6.5) (NVIDIA 2024i).

Inference with reference — reusing a reference output to accelerate decoding — is not supported.

Parallelism

TensorRT-LLM supports data parallelism, tensor parallelism, pipeline parallelism, expert parallelism, and context parallelism, with the constraint that tensor_parallel_size * pipeline_parallel_size must equal the total GPU count (NVIDIA 2024h, 2024a).

Tensor parallelism splits individual weight matrices across devices with AllReduce synchronization via NCCL. It’s the preferred strategy within a single node where fast NVLink connections keep the AllReduce cost low.

Pipeline parallelism divides the model into sets of contiguous layers, with each GPU housing one set. The recommended default for multi-node deployment is TP within a node, PP across nodes, with one exception: NVIDIA’s NVL36/NVL72 Blackwell systems have multi-node NVLink, so TP can span their full GPU sets without the usual cross-node bandwidth penalty (NVIDIA 2024a).

Expert parallelism distributes complete expert weights across GPUs, so each GPU holds the full weights of its assigned experts. Configuration uses --moe_tp_size and --moe_ep_size, with the constraint that their product must equal tp_size. A hybrid approach combining EP with TP enables load balancing across experts (NVIDIA 2024e).

Wide Expert Parallel introduces expert slots that are decoupled from specific experts, enabling replication of high-demand experts for load balancing (NVIDIA 2024h). The Expert Parallelism Load Balancer (EPLB) handles both offline and online load balancing, with optimized communication kernels for GB200 multi-node NVLink. Large-scale EP was extended in v0.21.0, and DeepEP integration provides optimized MoE communication kernels (NVIDIA 2024i, 2024d). Token dropping for load balancing is not implemented — load balancing relies on expert replication and EPLB instead.

Attention Data Parallel (ADP) is a distinctive feature that doesn’t appear in the other frameworks under this name. ADP replicates GEMM weights on every GPU but partitions the KV cache across devices, eliminating cross-GPU communication during the attention phase. This is conceptually similar to what vLLM calls DP+EP for MLA models (Section 8.7.4) and what SGLang calls --enable-dp-attention (Section 8.8.7), but TensorRT-LLM provides it as an explicit mode enabled via enable_attention_dp: true (NVIDIA 2024h).

Context parallelism distributes long sequences across GPUs during the prefill phase, including Ulysses-style context parallel support added in v0.16.0 (NVIDIA 2024i, 2024d).

Disaggregated prefill is more mature in TensorRT-LLM than in the other frameworks. It fully separates context (prefill) and generation (decode) phases onto different GPU pools, eliminating interference between phases and enabling independent optimization of TTFT and TPOT. Three deployment approaches are available (NVIDIA 2024b):

  1. trtllm-serve: a REST-based orchestrator with round-robin or KV-cache-aware routing
  2. Dynamo: datacenter-scale deployment with a smart router and Kubernetes support
  3. Triton Inference Server: an ensemble model with a BLS (Business Logic Scripting) orchestrator

KV cache transfer between prefill and decode GPUs uses MPI, UCX, or NIXL backends over RDMA or NVLink. UCX and NIXL are recommended for deployments that need dynamic scaling. A particularly useful capability is support for different parallelism strategies between phases — for example, context with TP2 paired with generation using PP2, with orchestrated block mapping for layout transformation between the two (NVIDIA 2024b).

The performance gains are substantial: DeepSeek R1 shows 1.4x–2.5x speedup, and Qwen 3 shows 1.7x–6.11x speedup on GB200 GPUs. The current limitation is that the number of context and generation instances is fixed at deployment time; dynamic scaling is under development (NVIDIA 2024b).

Quantization

TensorRT-LLM’s quantization support reflects its NVIDIA-first philosophy — it provides deep integration with NVIDIA’s quantization formats and hardware features rather than broad third-party method support.

Weight quantization covers FP4/NVFP4 (Blackwell-only, with native support and optimized kernels), FP8 (Hopper and Blackwell, with automatic conversion via Transformer Engine), INT4, and INT8 (NVIDIA 2024d, 2024i). Mixed precision quantization is supported, with AutoQ for automated mixed-precision selection since v0.15.0. Block scaling (per-block scale factors) is available for improved accuracy at low precision (NVIDIA 2024d).

Weight-activation quantization includes W4A8 (INT4 weights, FP8 activations) with CUTLASS kernels on Ada Lovelace, and MXFP8xMXFP4 added in v1.0 (NVIDIA 2024i).

GPTQ (INT8, added in v0.15.0), AWQ (INT4), and SmoothQuant (TensorRT-native INT8, added in v0.15.0) are all supported (NVIDIA 2024i, 2024d).

Operational features

Guided decoding uses the XGrammar backend for grammar-based constrained generation, supporting JSON schema, regular expressions, EBNF grammar, and structural tags. Integration with the overlap scheduler arrived in v1.0 and with speculative decoding in v1.1 (NVIDIA 2024d, 2024i).

Multi-LoRA serving supports dynamic LoRA loading per request with Hugging Face and NeMo format compatibility. FP8 base model with FP16/BF16 LoRA adapters has been supported since v0.11.0, MoE model LoRA since v0.12.0, and PyTorch backend LoRA with adapter eviction since v1.0 (NVIDIA 2024i, 2024d).

Multimodal support covers a broad range of vision-language models including LLaVA-NeXT, Qwen2-VL, Llama 3.2 Vision, and Mistral Small 3.1 VLM, plus visual generation models like FLUX and Wan 2.1/2.2 for image and video (NVIDIA 2024d, 2024i).

Sparse attention support is available for structured sparsity patterns including Native Sparse Attention (NSA) and Skip Softmax Attention, which approximates attention for long-context inference acceleration (NVIDIA 2024d).

Monitoring is less standardized than vLLM’s Prometheus endpoint or SGLang’s 40+ metric categories. The KV Cache Event API provides real-time tracking of cache state changes for monitoring and routing decisions, and per-request stats are available in the PyTorch workflow since v0.20.0. The trtllm-bench benchmarking tool supports streaming with TTFT and ITL metrics since v0.10.0 (NVIDIA 2024i, 2024c).

Hardware support is NVIDIA-only. This is the fundamental tradeoff: TensorRT-LLM doesn’t run on AMD, Intel, or any other hardware, but it can exploit NVIDIA-specific features — Hopper’s FP8 tensor cores, Blackwell’s FP4 support, NVLink topology-aware communication — more deeply than frameworks that target multiple backends.


Production LLM serving sits at the intersection of systems engineering and machine learning optimization. The techniques from the preceding chapters provide the building blocks, but assembling them into a reliable, efficient serving system requires understanding the operational realities — memory pressure, failure modes, monitoring, and orchestration — that don’t show up in benchmark results. The frameworks surveyed here each represent a different set of answers to these challenges, and the right choice depends on your specific workload, hardware, and reliability requirements.

8.10 Further Reading

Multi-LoRA serving. The original LoRA paper (Hu et al. 2021) explains the low-rank adaptation technique and why the adapter weights are small enough to swap efficiently. For serving multiple adapters simultaneously, the S-LoRA paper (Sheng et al. 2023) introduces the idea of a unified base model with dynamically loaded adapters, including a custom CUDA kernel for batched LoRA computation across requests using different adapters. Punica (Chen et al. 2023) takes a similar approach with a focus on the GPU kernel design for multi-adapter batching. For a practical walkthrough of deploying one base model with many adapters, the Hugging Face blog post “TGI Multi-LoRA: Deploy Once, Serve 30 Models” (Thomas et al. 2024) shows the end-to-end process with real adapter switching.

Disaggregated serving. The DistServe paper (Zhong et al. 2024) introduced the case for separating prefill and decode onto different GPU pools, but the idea has evolved rapidly since then. “Disaggregated Inference: 18 Months Later” (Chen et al. 2025), a retrospective from the DistServe authors, documents how disaggregation went from a research prototype to an industry-standard pattern adopted by NVIDIA Dynamo, vLLM, SGLang, and others, and explores emerging directions like attention-FFN disaggregation.

Multi-model routing. Quality-tier routing — sending easy requests to a small model and hard ones to a large model — can dramatically reduce cost. The RouteLLM blog post (Ong et al. 2024) describes a framework of trained routers that achieve up to 85% cost reduction while maintaining 95% of GPT-4 quality, with open-source router weights and evaluation code.

Benchmarking. If you’re evaluating frameworks for a specific deployment, LLMPerf (Anyscale 2024) provides standardized benchmarking scripts that measure the metrics from Section 3.2 across different frameworks and hardware configurations. For understanding how to interpret benchmark results and avoid common pitfalls, the Anyscale metrics blog post (Kadous et al. 2023) is a practical companion.

System design. For the broader context of how LLM serving fits into production ML systems, Miao et al. (2023) surveys the full stack from model optimization through serving infrastructure. For fleet-level scheduling across multiple model replicas, Wu et al. (2023) covers how to route requests across heterogeneous hardware to maximize goodput under latency constraints.

Framework documentation. Each of the major frameworks has comprehensive documentation that goes beyond what we’ve covered here. Here again are links to their main documentation sites. The vLLM docs (vLLM Team 2024e) include guides on configuring PagedAttention parameters, enabling speculative decoding, and tuning scheduler policies. The TensorRT-LLM docs (NVIDIA 2024d) cover the compilation pipeline and how to exploit FP8 and custom kernels on NVIDIA hardware. SGLang’s docs (SGLang Team 2024) are particularly useful for understanding the RadixAttention cache and the structured generation programming model.