4 Model Design Choices

Now that we have an understanding of the inference pipeline and where the bottlenecks live, starting with this chapter, we shift to doing something about them.

The optimizations in this chapter are about the model itself, separate from request handling. They change what the hardware has to do per token by reducing the model’s compute and memory footprint at the architecture and weight level. Many of these are choices made by model designers prior to training. A few can be applied post-training to an existing model. Either way, the result is the same: fewer bytes to move, fewer FLOPs to compute, or both.

We’ll keep the coverage here lighter than in subsequent chapters. Much of the time, you don’t get a choice about the model to be used. Because of this, much of the focus of this book is on what you can do when you’re given a model to serve. But understanding these techniques is still helpful, whether you have any say in the model design or not.

4.1 Model Quantization

One can make a strong argument that quantization is the single most impactful model-level optimization for LLM inference. The reason is straightforward, and it connects directly to the bottleneck framework from Section 3.4.

During decode, the biggest overall bottleneck is memory bandwidth. As we’ve discussed, each decode step reads the entire model’s weights from HBM just to produce one token. If your model has 70 billion parameters stored in FP16 (2 bytes each), that’s 140 GB of weight data read per token. On an H100 with 3.35 TB/s of bandwidth, reading 140 GB takes about 42 milliseconds. That’s a hard floor on your per-token latency, ignoring everything else. If you quantize those same weights to INT4 (0.5 bytes each), you’re reading 35 GB instead, and the floor drops to about 10 milliseconds. That’s a 4x improvement in your theoretical best-case decode speed, and we didn’t need to change a single line of serving code.

A further insight is that quantization doesn’t just reduce the amount of GPU memory needed to store the model weights (which is a pretty big benefit). Quantization also reduces the bandwidth demand that dominates decode latency, and it can speed up computation too. Let’s walk through the main approaches to quantization. We will assume 16-bit weights in FP16 or BF16 are our baseline.

Note

Floating point arithmetic used to be performed at 32 bits (FP32) for both training and inference, but the last few years have seen a shift to 16-bit floating point (FP16 or BF16). The dynamic range of FP16 and BF16 is still quite high, which made the transition from FP32 relatively easy, even for training, where automatic mixed precision (AMP) techniques can be used to keep the model stable. More recently, the focus has shifted to quantization below 16 bits, which requires more careful techniques to avoid quality degradation.

Lower precision datatypes

There are a number of datatypes supported, especially in the latest generation of GPUs. While higher-precision datatypes, such as FP32, are needed for training, most models can run inference using FP16 without any noticeable difference. Datatypes that require fewer than 16 bits to store can save even more memory. Table 4.1 lists commonly supported datatypes.

Table 4.1: Common datatype precisions supported. The Layout column shows the number of bits allocated for the sign (“s”), exponent (“e”), and mantissa (“m”).

Type	Bits	Layout	Key Properties
FP32	32	1s + 8e + 23m	Standard training precision; baseline
FP16	16	1s + 5e + 10m	2x compression; narrower dynamic range (max ~65,504)
BF16	16	1s + 8e + 7m	Same dynamic range as FP32, less mantissa precision; preferred for training stability
FP8 E4M3	8	1s + 4e + 3m	More precision; used for weights/activations in forward pass
FP8 E5M2	8	1s + 5e + 2m	More dynamic range; used for gradients
INT8	8	signed integer	Range [-128, 127]; 4x compression vs FP32
INT4	4	signed integer	Range [-8, 7]; 8x compression; requires sophisticated algorithms
NF4	4	non-uniform	16 levels at normal distribution quantiles; information-theoretically optimal for normally distributed weights (QLoRA)
FP4	4	1s + 2e + 1m	Less common than NF4 for LLM weights

Older GPUs support 32-bit and 16-bit floating point, as well as 8-bit integer matrix multiplication. FP8 is a newer format supported on H100 and later GPUs. With the popularity of quantization, newer generations of GPUs are supporting a wider range of smaller datatypes.

Weight-only quantization

The simplest form of quantization leaves the activations in their original precision (typically FP16 or BF16) and only quantizes the weights. This is called weight-only quantization, and it’s the most common starting point. In weight-only quantization, the weights are stored in a smaller datatype, then dequantized back to FP16 on the fly before each matrix multiplication, so the compute still happens at FP16 precision. The benefit is purely about reducing how much data needs to be read from HBM and reducing TPOT.

When quantizing below 16-bit precision, the storage space and memory read savings are usually slightly reduced because we also need to store conversion parameters. If we have both an offset and a scaling parameter, then when \(x\) is the higher precision value and \(\tilde{x}\) is the quantized value, the conversion is \(x = (\tilde{x} - \text{offset}) \times \text{scaling}\). This dequantization step also adds a small amount of compute overhead to model inference.

LLM inference tolerates quantization well. While the lower precision datatypes could cause cascading rounding errors, in practice the errors tend to cancel out. Quantizing from FP16 to INT8, known as W8, cuts the weight memory in half compared to FP16. The quality impact is usually negligible for large models — a 70B model at W8 will behave almost identically to its FP16 counterpart on most benchmarks. If we drop to W4 quantization, cutting weight memory by 4x compared to FP16, this is where quality tradeoffs start to become more visible, especially for smaller models. A 7B model at W4 may show degradation on some tasks, while a 70B model at W4 often remains remarkably capable. The general pattern is that larger models are more robust to aggressive quantization.

Weight-activation quantization

Weight-only quantization reduces memory bandwidth demand but doesn’t speed up the actual matrix multiplication, since the compute still runs at FP16. Weight-activation quantization takes the next step: quantize both weights and activations so that the matrix multiplications themselves can run on lower-precision hardware units. Faster, lower-precision compute cores increase compute throughput. In this way, TTFT can be reduced for prefill, especially for long prompts.

For example, weight-activation quantization using INT8, known as W8A8, quantizes both weights and activations to INT8, allowing the use of INT8 tensor cores. On the H100, INT8 tensor cores deliver roughly 2x the throughput of FP16 tensor cores. W8A8 reduces memory traffic for bandwidth-bound operations, and it also increases the compute ceiling for compute-bound operations. As with weight-only quantization, weight-activation quantization introduces some additional storage and compute overhead associated with the conversion parameters.

Granularity

Early attempts to quantize below 16-bit floating point took a simple approach that used the same datatype and conversion factor for all values in each weight matrix. In this scenario, one very large value distorts the conversion scale for the entire weight matrix. This increases the rounding errors caused by quantization and can also cause small weights to get rounded to zero, which can cause even more problems. To mitigate the risk of outlier values, quantization can be performed at more fine-grained granularities:

Per channel(s)
Per block, potentially in a hierarchy of super-blocks and sub-blocks
Per token

Fine-grained quantization seeks to group similar values together, and is highly effective. It does, however, come at a cost of additional memory and compute overhead. For example, dividing a 2,048 by 2,048 weight matrix into groups of 32 values will require 131,072 scaling parameters and dequantization calculations. To reduce the storage of the scaling factors, they can themselves be quantized at a cost of additional dequantization overhead.

When quantization is applied

Post-Training Quantization (PTQ) is performed after the model is fully trained. This is the dominant approach since it can be applied to any model from any source, without requiring any special training. Simple PTQ only examines the weights, and more advanced PTQ techniques require processing a small set of calibration data through the model so activation information can also be analyzed.

Quantization-Aware Training (QAT) modifies the model training procedure to minimize the rounding errors caused by quantization. While effective, the extra work required has limited adoption rates.

Techniques

One of the practical challenges of quantization, especially weight-activation quantization, is activation outliers. Large LLMs tend to develop a small number of activation channels with values that are orders of magnitude larger than the rest. If you try to quantize activations using a single scale factor per tensor, these outliers force a wide quantization range that wastes precision on the many small values. Addressing activation outliers is an area of active research, and we will mention a few post-training quantization methods and how they try to handle these activation outliers.

GPTQ (Frantar et al. 2022) uses approximate second-order information to quantize weights to 4 bits (or even 3 bits) with minimal accuracy loss. It works by quantizing weights one layer at a time and adjusting the remaining weights to compensate for the error introduced by each quantization step. GPTQ can be applied to a pre-trained model without a large training set. It only needs a small calibration set.

AWQ (Activation-aware Weight Quantization) (Lin et al. 2023) takes a different approach. It observes that a small fraction of weights are disproportionately important — specifically, the weights that correspond to large activation magnitudes. Rather than treating all weights equally, AWQ protects these salient weights by scaling them up before quantization and scaling the activations down correspondingly. This keeps the important weights at higher effective precision while allowing aggressive quantization elsewhere.

SmoothQuant (Xiao et al. 2023) addresses activation outliers by migrating the quantization difficulty from activations to weights. The idea is to apply a per-channel scaling transformation that divides the activation by a smoothing factor and multiplies the corresponding weight by the same factor. This makes the activation distribution more uniform — and therefore easier to quantize — at the cost of making the weight distribution slightly less uniform. Since weights are fixed and can be quantized offline with more care, this is a favorable tradeoff.

Connecting quantization to MBU and MFU

In Section 3.2, we introduced Model Bandwidth Utilization (MBU) and Model FLOPs Utilization (MFU) as key metrics for decode efficiency. Quantization improves the theoretical decode speed by reducing the bytes that need to be read per token. But it doesn’t automatically improve decode speed linearly. If your serving system has overhead that prevents it from saturating HBM bandwidth, that overhead will prevent the total time from dropping proportionally to the reduction in memory traffic. For example, if transferring half the number of bytes requires 70% of the time, achieved bandwidth has decreased, so MBU will decrease. Similarly, if moving to a lower precision doubles the peak FLOPS of the GPU, but your overhead limits you to a 70% increase in FLOPS, your MFU will drop. So, quantization could improve TTFT and TPOT but potentially hurt MBU and MFU. It’s important to remember that latency and efficiency metrics are measuring different things.

In practice, quantized models can achieve higher MBU because the smaller data footprint means fewer cache misses, better memory access patterns, and the possibility of fitting the entire model in a single GPU’s memory rather than splitting across devices. These secondary effects can make the real-world speedup even better than the raw byte reduction and fixed overhead would suggest.

Note

Quantization affects the roofline model in an important way. When you use FP8 or INT8 tensor cores, you raise the compute ceiling (from ~990 TFLOPS to ~1,979 TFLOPS on the H100) while the bandwidth ceiling stays constant. This shifts the ridge point higher, which means some workloads that were compute-bound at FP16 can become memory-bandwidth-bound at FP8. Keep this in mind when choosing quantization strategies for prefill-heavy workloads.

4.2 Knowledge Distillation

Knowledge distillation is a training technique where a smaller student model is trained to mimic the behavior of a larger teacher model. In the original distillation formulation (Hinton et al. 2015), the student learns not just from the ground-truth labels, but from the teacher’s full output distribution — the probability the teacher assigns to every token in the vocabulary. These soft targets carry richer information than hard labels, which is why a distilled student can often outperform a model of the same size trained from scratch. Newer techniques vary the approach by training on sample outputs from the larger model or using reinforcement learning to impart some of the larger model’s capability into the smaller model.

From an inference perspective, the result is simple: you get a smaller model. Fewer parameters means less data to read from HBM per decode step, a smaller KV cache (assuming the model also has fewer layers or a smaller hidden dimension), and less memory consumed overall. The inference system doesn’t need to know or care that the model was distilled — it just sees a model with fewer parameters that performs better than you’d expect for its size. Running a smaller model usually improves all metrics, with decreases in TTFT and TPOT, and increases in concurrency.

The tradeoff is that distillation requires significant training compute, and you need the larger, more capable teacher model. This may not be an option for a frontier model, but distillation is popular in research and in use for smaller LLMs.

4.3 Pruning

Pruning removes weights or structures from a model that contribute little to its output. The intent is similar to quantization: fewer parameters means less data to move, which speeds up bandwidth-bound decode, resulting in a lower TPOT. But while quantization keeps all the parameters and reduces their precision, pruning eliminates parameters entirely. There are side effects to pruning, however, that make it challenging.

Unstructured pruning

Unstructured pruning zeroes out individual weights based on some importance criterion (typically magnitude). A pruned model might have 50% of its weights set to zero, theoretically cutting compute and memory in half. The problem is that these zeros are scattered throughout the weight matrices in an irregular pattern. These irregular patterns reduce the ability to load data efficiently and parallelize computations, which can decrease MFU on the GPU dramatically. If enough of the weights can be pruned, good speedups can be achieved for CPU inference, which doesn’t parallelize compute as much as a GPU. Standard GPU hardware and matrix multiplication kernels still process the full dense matrices, so they can’t exploit this sparsity efficiently. And sparse matrix multiplications are much slower than their dense counterparts, so they need extremely high levels of sparsity before they are faster than full dense matrix multiplication. This is why unstructured pruning rarely sees speedups, despite the reduction in parameters and FLOPs. To get real speedups from pruning, you need specialized sparse compute support, such as NVIDIA’s 2:4 structured sparsity format on Ampere and later GPUs. It is more difficult to prune weights into this structured sparsity pattern, so structured sparsity isn’t common either.

Structured pruning

Structured pruning takes a coarser approach: instead of zeroing individual weights, it removes entire structural units, such as attention heads, entire layers, or neurons in MLP blocks. The result is a smaller dense model that runs on standard hardware with no special sparse computation support required.

Removing an attention head reduces the KV cache proportionally. Removing an entire layer reduces both parameters and the depth of the forward pass. These are straightforward wins for inference, and the cost model from Section 3.5 applies directly — fewer parameters means proportionally less time reading weights during decode.

The challenge with structured pruning is identifying which structures can be removed with minimal quality impact. Attention heads in the later layers of a model often contribute less than early or middle layers, but this varies by model and task. Structured pruning typically requires some retraining or fine-tuning after removal to recover quality, which limits its use as a purely post-training optimization.

In practice, pruning is less widely adopted for LLM inference than quantization because of the performance and text quality challenges. Quantization offers a more predictable quality-performance tradeoff and is easier to apply post-training.

4.4 Efficient Attention Architectures

The sections above are optimizations you can apply to an existing model. This section covers architecture choices made during model design that affect inference efficiency from the start.

FFN layers account for roughly two-thirds of a typical transformer’s parameters, but they operate independently on each token. Their cost scales only linearly with sequence length, and they are straightforward to parallelize. In addition, GPUs are highly optimized for dense matrix multiplications, so FFNs can be executed very efficiently. The MoE architecture discussed in Section 2.6 is one architectural approach that reduces the FFN inference cost by activating only a subset of FFN experts per token.

Attention is the more interesting bottleneck. Its KV cache grows with sequence length, creating a per-request memory cost that directly limits concurrency. For long sequences, the quadratic cost of full attention also becomes significant during prefill. This makes attention the primary target for architectural optimization.

Reducing KV cache size

The KV cache is one of the largest consumers of GPU memory during inference, and unlike model weights, it scales with the number of active requests and the sequence length of each. Reducing the size of the KV cache per request lets you serve more requests in the same memory budget, increasing concurrency and throughput. We’ll cover KV cache management techniques in Section 6.3; here we focus on the architectural choices that determine how large the cache is in the first place.

Multi-Head Attention (MHA) is the original design from the transformer paper (Vaswani et al. 2017). Each attention head has its own key and value projections, so the KV cache stores separate K and V vectors for every head, every layer, and every token in the sequence. For a model using FP16 with 64 heads and a head dimension of 128, each token adds \(2 \times 64 \times 128 \times 2 = 32{,}768\) bytes to the KV cache per layer. This is the most expensive option. The top row of Figure 4.3 shows how queries align with keys and values in MHA.

Multi-Query Attention (MQA) (Shazeer 2019) goes to the opposite extreme: all attention heads share a single set of key and value projections, as shown on the third row of Figure 4.3. Each head still has its own query projection, so the heads can attend to different aspects of the input, but the KV cache only stores one K and one V vector per layer per token. This reduces the KV cache by a factor equal to the number of heads — from 64 sets of K/V down to 1 in the example above. The negative impact on the quality of text generation is the main reason why MQA is not widely adopted.

Grouped-Query Attention (GQA) (Ainslie et al. 2023) is a middle ground between MHA and MQA. Instead of one shared KV set for all heads, GQA uses G groups, where each group of heads shares one set of keys and values. With 32 heads and G = 8 groups, the KV cache is reduced by 4x compared to MHA. GQA has become the dominant choice in recent models — LLaMA 3, Mistral, and Gemma all use it. It offers much of MQA’s memory savings with less text quality risk. A GQA example with 8 query heads and 4 groups is shown on the second row of Figure 4.3.

Multi-head Latent Attention (MLA) (Liu et al. 2024) is DeepSeek’s approach to KV cache compression. Rather than storing full key and value vectors, MLA compresses them into a low-rank latent vector at each layer. At attention time, the full keys and values are reconstructed from the latent representation. The compression ratio depends on the latent dimension, but MLA can achieve significant cache reductions — comparable to or better than GQA — while preserving model quality. The tradeoff is additional compute to decompress the latent vectors during attention, but this compute is typically small relative to the bandwidth savings, and there are tricks to reduce the compute. MLA is depicted on the bottom row of Figure 4.3.

Cross-Layer Attention (CLA) (Brandon et al. 2024) takes a different angle: rather than sharing KV across heads within a layer, it shares KV across adjacent layers. If every pair of consecutive layers shares the same KV cache, the total cache size is halved. Figure 4.4 compares standard attention with CLA, where the K and V vectors are shared between pairs of layers. CLA can be combined with GQA or MQA for even greater reductions.

Reducing attention compute

Beyond the KV cache, the attention computation itself can be a bottleneck, particularly during prefill for long sequences, since the cost is quadratic in sequence length.

Sliding window attention (Beltagy et al. 2020) restricts each token to attend only to a fixed-size window of nearby tokens rather than the full sequence. This reduces the attention computation from \(O(\textcolor{#DC2626}{S}^2)\) to \(O(\textcolor{#DC2626}{S} \times w)\), where S is the sequence length and \(w\) is the window size. Figure 4.5 illustrates the difference between full attention and sliding window attention with a window size of 4. For prefill, the smaller attention computation improves TTFT. For decode, sliding window attention also bounds the KV cache size per layer to the window size, regardless of the total sequence length. The reduction in the amount of KV that needs to be transferred can improve TPOT, but the main benefit is that the smaller KV cache increases concurrency and throughput. Several recent models — including Mistral — use sliding window attention in some or all layers, often combined with a few full-attention layers to preserve the model’s ability to attend to distant tokens.

Note

This book focuses on the dominant transformer architecture with softmax attention. Alternative approaches like Mamba and other state-space models, as well as linear attention variants, replace the attention mechanism entirely with recurrent-style computations that have constant-size state rather than a growing KV cache. These are an active area of research, but the vast majority of deployed LLMs today still use softmax attention. Because sliding window attention and attention alternatives hurt generation quality, competitive models today retain the use of full softmax attention in some of their layers. For now, these KV cache optimizations are still needed.

Parallelizing computation

Parallel attention and MLP blocks is a simple architectural trick used in some models, including PaLM (Chowdhery et al. 2022). Instead of computing attention and MLP sequentially within each transformer block, they are computed in parallel and their outputs are summed. This doesn’t reduce total FLOPs, but it reduces the serial dependency within each layer, which can improve MFU and TPOT — especially when the attention and MLP computations can be overlapped. A schematic of this parallel design is shown in Figure 4.6.

4.5 Tokenization and Vocabulary Effects

One model-level lever that’s easy to overlook is the tokenizer and its vocabulary size. The tokenizer determines how many tokens a given piece of text is broken into, and that token count directly drives the number of decode steps, each of which requires a full forward pass through the model.

A larger vocabulary means each token covers more text on average. With a larger vocabulary, more words and subwords that would otherwise be split into multiple tokens get their own single token. Fewer tokens means fewer decode steps, which means faster end-to-end generation. For latency-sensitive applications, this is a meaningful win. Going from a 32K vocabulary to a 128K vocabulary might reduce the average token count for a given text by 10-20%, which translates directly to 10-20% fewer decode steps.

The tradeoff is that a larger vocabulary increases the size of two model components: the embedding table (which maps token IDs to vectors at the input) and the LM head (the output projection that produces logits over the vocabulary). For a model with a hidden dimension of 4,096 and a vocabulary of 128K tokens, the LM head alone has 4,096 \(\times\) 128,000 \(\approx\) 500 million parameters. At FP16, that’s about 1 GB, which is not negligible, but typically a small fraction of the total model size for large LLMs. The compute cost of the final softmax over a larger vocabulary also increases, though this is usually a minor contributor to overall latency.

For most practical scenarios, the efficiency gains from fewer decode steps outweigh the increased memory footprint and minor TPOT increase from the larger embedding and LM head. This is one reason why the trend in recent LLMs has been toward larger vocabularies.¹

4.6 Summary

The techniques in this chapter all reduce how much work the hardware has to do for each token. Model quantization reduces the bytes-per-parameter, reducing TPOT, and weight-activation quantization can also improve TTFT. Distillation and pruning produce smaller models with fewer total parameters, primarily affecting TPOT. Efficient attention architectures reduce the KV cache footprint and, in some cases, the attention compute itself. This primarily improves concurrency and throughput, but it can also reduce TTFT for prefill. Even vocabulary design affects inference cost by changing how many decode steps are needed.

These are the model-level knobs. Once they’re set — once you have a quantized, architecturally efficient model ready to serve — the next set of optimizations operates at the system level: how to schedule requests, batch them together, and manage memory. That’s where we turn next in Chapter 5.

4.7 Further Reading

For intuitive explanations of quantization:

Maarten Grootendorst’s “A Visual Guide to Quantization” (Grootendorst 2024) walks through symmetric and asymmetric quantization, per-tensor vs. per-channel granularity, and formats like GGUF with over 50 custom diagrams. It’s one of the best visual introductions to the topic.
Tim Dettmers’ blog post “LLM.int8() and Emergent Features” (Dettmers 2022) explains why naive INT8 quantization breaks down at scale: large transformers develop outlier activation channels that are orders of magnitude larger than the rest. The post provides the intuition behind the mixed-precision decomposition that addresses this, which is useful background for understanding SmoothQuant and similar techniques. Dettmers is also the author of the bitsandbytes library (Dettmers et al. 2021), the most widely used quantization library in the Hugging Face ecosystem.
The Hugging Face Quantization Concept Guide (Hugging Face 2024) is a practical reference for choosing between quantization backends (GPTQ, AWQ, bitsandbytes, and others) in the Hugging Face ecosystem, with guidance on when each approach is appropriate.

For knowledge distillation applied to LLMs, Snorkel AI’s “LLM Distillation Demystified” (Casey 2024) is a practitioner-oriented guide covering distillation approaches and when to use them. For a more comprehensive treatment, Xu et al. (2024) surveys the full taxonomy from white-box distillation (where student models train on the teacher’s logits) to black-box distillation (where they learn from the teacher’s generated outputs), along with domain-specific and skill-specific distillation techniques.

On pruning, Shaw and Goin’s “SparseGPT: Remove 100 Billion Parameters for Free” (Shaw and Goin 2023) is an accessible introduction to how one-shot pruning works in practice, showing that OPT-175B can be pruned to 50% sparsity on a single GPU in about four hours. For the technical details behind this, the SparseGPT paper (Frantar and Alistarh 2023) describes how approximate second-order information guides the pruning decisions, and how the pruning pattern can be constrained to NVIDIA’s 2:4 structured format for hardware acceleration. Wanda (Sun et al. 2024) simplifies things further: it prunes based on the product of weight magnitude and input activation norm, achieving competitive results without any retraining and at a fraction of SparseGPT’s computational cost. For the hardware side of sparsity, NVIDIA’s blog post on accelerating inference with Ampere Sparse Tensor Cores (NVIDIA 2021) explains the 2:4 structured sparsity format, the train-prune-fine-tune workflow, and the performance gains achievable through TensorRT.

On efficient attention architectures, Sebastian Raschka’s “A Visual Guide to Attention Variants in Modern LLMs” (Raschka 2026) covers MHA, GQA, MLA, sliding window attention, and several newer variants side by side with clear diagrams, making it a good companion to the brief treatment of these architectures in this chapter. For a deeper understanding of Multi-head Latent Attention specifically, Eryk Banatt’s “Understanding Multi-Head Latent Attention” (Banatt 2025) is a detailed technical walkthrough of the low-rank KV compression, the decoupled rotary position embedding trick, and how MLA compares to GQA in practice.

On tokenization, Tao et al. (2024) establishes scaling laws relating optimal vocabulary size to compute budget, showing that most current LLMs have undersized vocabularies — Llama 2’s 32K vocabulary should have been roughly 7x larger for its model size, according to their analysis. This provides theoretical grounding for the vocabulary-size discussion in this chapter.

For broader surveys that cover multiple topics from this chapter:

Lilian Weng’s “Large Transformer Model Inference Optimization” (Weng 2023) is one of the most widely cited blog posts on the topic, covering quantization, pruning, distillation, and architectural optimizations in a single thorough reference.
Wan et al. (2024) is the most comprehensive academic survey of LLM efficiency techniques, covering quantization, pruning, distillation, and efficient architectures from model-centric, data-centric, and framework-centric perspectives. It is accompanied by an actively maintained GitHub repository that tracks new papers.
NVIDIA’s “Mastering LLM Techniques: Inference Optimization” (Verma and Vaidya 2023) provides a practitioner-oriented overview that covers quantization, KV cache optimization, batching, and model parallelism in a single article, making it a useful starting point for someone new to the field.

Larger vocabularies also improve multilingual performance for languages such as Chinese, which require very different tokens from English.↩︎