References

Adnan, Muhammad, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, and Purushotham Kamath. 2024. “Keyformer: KV Cache Reduction Through Key Tokens Selection for Efficient Generative Inference.” MLSys. https://proceedings.mlsys.org/paper_files/paper/2024/hash/48fecef47b19fe501d27d338b6d52582-Abstract-Conference.html.

Agrawal, Amey, Nitin Kedia, Ashish Panwar, et al. 2024. “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve.” Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation. https://arxiv.org/abs/2403.02310.

Agrawal, Amey, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. “Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.” arXiv Preprint arXiv:2308.16369. https://arxiv.org/abs/2308.16369.

Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.” arXiv Preprint arXiv:2305.13245. https://arxiv.org/abs/2305.13245.

Alammar, Jay. 2018. “The Illustrated Transformer.” https://jalammar.github.io/illustrated-transformer/.

AMD ROCm Team. 2025. “The vLLM MoE Playbook: A Practical Guide to TP, DP, PP and Expert Parallelism.” https://rocm.blogs.amd.com/software-tools-optimization/vllm-moe-guide/README.html.

Aminabadi, Reza Yazdani, Samyam Rajbhandari, Ammar Ahmad Awan, et al. 2022. “DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.” arXiv Preprint arXiv:2207.00032. https://arxiv.org/abs/2207.00032.

Anyscale. 2024a. “How Continuous Batching Enables 23x Throughput in LLM Inference While Reducing P50 Latency.” https://www.anyscale.com/blog/continuous-batching-llm-inference.

Anyscale. 2024b. “LLMPerf: A Tool for Benchmarking LLM Inference.” https://github.com/ray-project/llmperf.

Austin, Jacob, Sholto Douglas, Roy Frostig, et al. 2025. “How to Scale Your Model.” https://jax-ml.github.io/scaling-book/.

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. “Layer Normalization.” arXiv Preprint arXiv:1607.06450. https://arxiv.org/abs/1607.06450.

Bai, Yinmin, Peng Li, Hanyu Qin, et al. 2023. Fast Distributed Inference Serving for Large Language Models. https://arxiv.org/abs/2305.05920.

Banatt, Eryk. 2025. “Understanding Multi-Head Latent Attention.” https://planetbanatt.net/articles/mla.html.

Beltagy, Iz, Matthew E. Peters, and Arman Cohan. 2020. “Longformer: The Long-Document Transformer.” arXiv Preprint arXiv:2004.05150. https://arxiv.org/abs/2004.05150.

Bloem, Peter. 2019. “Transformers from Scratch.” https://peterbloem.nl/blog/transformers.

Brandon, William, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. 2024. “Reducing Transformer Key-Value Cache Size with Cross-Layer Attention.” arXiv Preprint arXiv:2405.12981. https://arxiv.org/abs/2405.12981.

Bycroft, Brendan. 2023. “LLM Visualization.” https://bbycroft.net/llm.

Cai, Tianle, Yuhong Li, Zhengyang Geng, et al. 2024. “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.” arXiv Preprint arXiv:2401.10774. https://arxiv.org/abs/2401.10774.

Cai, Zefan. 2024. “Awesome-LLM-KV-Cache.” https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache.

Casey, Matt. 2024. “LLM Distillation Demystified: A Complete Guide.” https://snorkel.ai/blog/llm-distillation-demystified-a-complete-guide/.

Chen, Carol. 2022. “Transformer Inference Arithmetic.” https://kipp.ly/transformer-inference-arithmetic/.

Chen, Charlie, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. “Accelerating Large Language Model Decoding with Speculative Sampling.” arXiv Preprint arXiv:2302.01318. https://arxiv.org/abs/2302.01318.

Chen, Junda, Yonghao Zhuang, and Hao Zhang. 2025. “Disaggregated Inference: 18 Months Later.” https://haoailab.com/blogs/distserve-retro/.

Chen, Lequn, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. 2023. “Punica: Multi-Tenant LoRA Serving.” arXiv Preprint arXiv:2310.18547. https://arxiv.org/abs/2310.18547.

Chips and Cheese. 2023. “Nvidia’s H100: Funny L2, and Tons of Bandwidth.” https://chipsandcheese.com/2023/07/02/nvidias-h100-funny-l2-and-tons-of-bandwidth/.

Cho, Aeree, Grace C. Kim, Alexander Karpekov, et al. 2024. “Transformer Explainer: Interactive Learning of Text-Generative Models.” https://poloclub.github.io/transformer-explainer/.

Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, et al. 2022. “PaLM: Scaling Language Modeling with Pathways.” arXiv Preprint arXiv:2204.02311. https://arxiv.org/abs/2204.02311.

Dao, Tri. 2023. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” arXiv Preprint arXiv:2307.08691. https://arxiv.org/abs/2307.08691.

Dao, Tri. 2024. “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision.” https://tridao.me/blog/2024/flash3/.

Dao, Tri, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” arXiv Preprint arXiv:2205.14135. https://arxiv.org/abs/2205.14135.

Dao, Tri, Daniel Haziza, Francisco Massa, and Grigory Sizov. 2023. Flash-Decoding for Long-Context Inference. https://crfm.stanford.edu/2023/10/12/flashdecoding.html.

Dao, Tri, Jay Shah, Ganesh Bikshandi, et al. 2025. FlashAttention-4: Hardware-Friendly Attention on Hopper and Blackwell GPUs. https://tridao.me/blog/2025/flash4/.

DeepSeek-AI. 2025. FlashMLA: Efficient Multi-Head Latent Attention Decoding Kernels. https://github.com/deepseek-ai/FlashMLA.

DeepSpeed Team. 2023. “DeepSpeed-FastGen: High-Throughput Text Generation for LLMs via MII and DeepSpeed-Inference.” https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/README.md.

Dettmers, Tim et al. 2021. “Bitsandbytes: Accessible Large Language Models via k-Bit Quantization for PyTorch.” https://github.com/bitsandbytes-foundation/bitsandbytes.

Dettmers, Tim. 2022. “LLM.int8() and Emergent Features.” https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/.

Dong, Juechu, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. 2024. “FlexAttention: A Programming Model for Generating Optimized Attention Kernels.” arXiv Preprint arXiv:2412.05496. https://arxiv.org/abs/2412.05496.

Frantar, Elias, and Dan Alistarh. 2023. “SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot.” Proceedings of the 40th International Conference on Machine Learning. https://arxiv.org/abs/2301.00774.

Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. “GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers.” arXiv Preprint arXiv:2210.17323. https://arxiv.org/abs/2210.17323.

Fu, Yichao, Peter Bailis, Ion Stoica, and Hao Zhang. 2024. “Break the Sequential Dependency of LLM Inference Using Lookahead Decoding.” arXiv Preprint arXiv:2402.02057. https://arxiv.org/abs/2402.02057.

Gloeckle, Fabian, Badr Youbi Idrissi, Baptiste Roziere, David Lopez-Paz, and Gabriel Synnaeve. 2024. “Better & Faster Large Language Models via Multi-Token Prediction.” arXiv Preprint arXiv:2404.19737. https://arxiv.org/abs/2404.19737.

Gordić, Aleksa. 2023. “ELI5: FlashAttention.” https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad.

Grafana Labs. 2024. “Grafana Documentation.” https://grafana.com/docs/grafana/latest/.

Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. “The Llama 3 Herd of Models.” arXiv Preprint arXiv:2407.21783. https://arxiv.org/abs/2407.21783.

Grootendorst, Maarten. 2024. “A Visual Guide to Quantization.” https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization.

Gu, Albert, and Tri Dao. 2023. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv Preprint arXiv:2312.00752. https://arxiv.org/abs/2312.00752.

Gu, Albert, and Tri Dao. 2024. “GPUs Go Brrr.” https://hazyresearch.stanford.edu/blog/2024-05-12-tk.

Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” arXiv Preprint arXiv:1503.02531. https://arxiv.org/abs/1503.02531.

Holmes, Connor, Masahiro Tanaka, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, and Yuxiong He. 2024. “DeepSpeed-FastGen: High-Throughput Text Generation for LLMs via MII and DeepSpeed-Inference.” arXiv Preprint arXiv:2401.08671. https://arxiv.org/abs/2401.08671.

Hong, Ke, Guohao Dai, Jiaming Xu, et al. 2023. “FlashDecoding++: Faster Large Language Model Inference on GPUs.” arXiv Preprint arXiv:2311.01282. https://arxiv.org/abs/2311.01282.

Hu, Edward J., Yelong Shen, Phillip Wallis, et al. 2021. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv Preprint arXiv:2106.09685. https://arxiv.org/abs/2106.09685.

Hugging Face. 2024a. “Methods and Tools for Efficient Training on a Single GPU.” https://huggingface.co/docs/transformers/en/perf_train_gpu_many.

Hugging Face. 2024b. “Quantization Concept Guide.” https://huggingface.co/docs/transformers/en/quantization/concept_guide.

Jacobs, Sam Ade, Masahiro Tanaka, Chengming Zhang, et al. 2023. “DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models.” arXiv Preprint arXiv:2309.14509. https://arxiv.org/abs/2309.14509.

JarvisLabs. 2025. “Speculative Decoding in vLLM: Complete Guide to Faster LLM Inference.” https://docs.jarvislabs.ai/blog/speculative-decoding-vllm-faster-llm-inference.

Jiang, Albert Q., Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, et al. 2024. “Mixtral of Experts.” arXiv Preprint arXiv:2401.04088. https://arxiv.org/abs/2401.04088.

Kadous, Waleed, Kyle Huang, Wendi Ding, Liguang Xie, Avnish Narayan, and Ricky Xu. 2023. “Reproducible Performance Metrics for LLM Inference.” https://www.anyscale.com/blog/reproducible-performance-metrics-for-llm-inference.

Karpathy, Andrej. 2022. nanoGPT. https://github.com/karpathy/nanoGPT.

Karpathy, Andrej. 2023. “Let’s Build GPT: From Scratch, in Code, Spelled Out.” https://www.youtube.com/watch?v=kCc8FmEb1nY.

Kazemnejad, Amirhossein, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. 2023. “The Impact of Positional Encoding on Length Generalization in Transformers.” arXiv Preprint arXiv:2305.19466. https://arxiv.org/abs/2305.19466.

Kimi Team, Yu Zhang, Zongyu Lin, et al. 2025. “Kimi Linear: An Expressive, Efficient Attention Architecture.” arXiv Preprint arXiv:2510.26692. https://arxiv.org/abs/2510.26692.

Kumar, Tanishq, Tri Dao, and Avner May. 2026. “Speculative Speculative Decoding.” arXiv Preprint arXiv:2603.03251. https://arxiv.org/abs/2603.03251.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023a. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” arXiv Preprint arXiv:2309.06180. https://arxiv.org/abs/2309.06180.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023b. “vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention.” https://vllm.ai/blog/vllm.

Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2022. “Fast Inference from Transformers via Speculative Decoding.” arXiv Preprint arXiv:2211.17192. https://arxiv.org/abs/2211.17192.

Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2024. “Looking Back at Speculative Decoding.” https://research.google/blog/looking-back-at-speculative-decoding/.

Li, Dacheng, Rulin Shao, Anze Xie, et al. 2023. “DistFlashAttn: Distributed Memory-Efficient Attention for Long-Context LLMs Training.” arXiv Preprint arXiv:2310.03294. https://arxiv.org/abs/2310.03294.

Li, Shenggui, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2021. “Sequence Parallelism: Long Sequence Training from System Perspective.” arXiv Preprint arXiv:2105.13120. https://arxiv.org/abs/2105.13120.

Li, Yuhong, Yingbing Huang, Bowen Yang, et al. 2024. “SnapKV: LLM Knows What You Are Looking for Before Generation.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2404.14469.

Li, Zhuohan, Lianmin Zheng, Yinmin Zhong, et al. 2023. “AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving.” arXiv Preprint arXiv:2302.11665. https://arxiv.org/abs/2302.11665.

Lin, Ji, Jiaming Tang, Haotian Tang, et al. 2023. “AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration.” arXiv Preprint arXiv:2306.00978. https://arxiv.org/abs/2306.00978.

Liu, Aixin et al. 2024a. “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.” arXiv Preprint arXiv:2405.04434. https://arxiv.org/abs/2405.04434.

Liu, Aixin et al. 2024b. “DeepSeek-V3 Technical Report.” arXiv Preprint arXiv:2412.19437. https://arxiv.org/abs/2412.19437.

Liu, Zirui, Jiayi Yuan, Hongye Jin, et al. 2024. “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache.” Proceedings of the 41st International Conference on Machine Learning. https://arxiv.org/abs/2402.02750.

LMSYS. 2024a. “Fast and Expressive LLM Inference with RadixAttention and SGLang.” https://lmsys.org/blog/2024-01-17-sglang/.

LMSYS. 2024b. “SGLang V0.4: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, and Faster Structured Outputs.” https://www.lmsys.org/blog/2024-12-04-sglang-v0-4/.

LMSYS. 2025a. “Accelerating SGLang with Multiple Token Prediction.” https://lmsys.org/blog/2025-07-17-mtp/.

LMSYS. 2025b. “Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism.” https://www.lmsys.org/blog/2025-05-05-large-scale-ep/.

LMSYS. 2025c. “HiCache: Hierarchical KV Caching for SGLang.” https://lmsys.org/blog/2025-09-10-sglang-hicache/.

LMSYS. 2026. “Chunked Pipeline Parallelism in SGLang.” https://www.lmsys.org/blog/2026-01-15-chunked-pipeline/.

Miao, Xupeng, Gabriele Oliaro, Zhihao Zhang, et al. 2023. “Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems.” arXiv Preprint arXiv:2312.15234. https://arxiv.org/abs/2312.15234.

Microsoft Research. 2026. “Memento: Teaching LLMs to Manage Their Own Context.” https://www.microsoft.com/en-us/research/articles/memento-teaching-llms-to-manage-their-own-context/.

Modal. 2024. “GPU Glossary.” https://modal.com/gpu-glossary.

NVIDIA. 2021. “Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT.” https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/.

NVIDIA. 2022. “NVIDIA Hopper Architecture in-Depth.” https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.

NVIDIA. 2023. “NVIDIA H100 Tensor Core GPU.” https://www.nvidia.com/en-us/data-center/h100/.

NVIDIA. 2024a. “CUDA Programming Guide.” https://docs.nvidia.com/cuda/cuda-programming-guide/.

NVIDIA. 2024b. “Deciding Model Sharding Strategy.” https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/deciding-model-sharding-strategy.html.

NVIDIA. 2024c. “Disaggregated Serving in TensorRT-LLM.” https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.html.

NVIDIA. 2024d. “KV Cache Reuse Optimizations in TensorRT-LLM.” https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/.

NVIDIA. 2024e. “Optimizing llama.cpp AI Inference with CUDA Graphs.” https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-inference-with-cuda-graphs/.

NVIDIA. 2024f. “Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill.” https://developer.nvidia.com/blog/streamlining-ai-inference-performance-and-deployment-with-nvidia-tensorrt-llm-chunked-prefill/.

NVIDIA. 2024g. “TensorRT-LLM Documentation.” https://nvidia.github.io/TensorRT-LLM/.

NVIDIA. 2024h. “TensorRT-LLM Expert Parallelism Documentation.” https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html.

NVIDIA. 2024i. “TensorRT-LLM GPT Attention Documentation.” https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html.

NVIDIA. 2024j. “TensorRT-LLM KV Cache System Documentation.” https://nvidia.github.io/TensorRT-LLM/latest/features/kvcache.html.

NVIDIA. 2024k. “TensorRT-LLM Parallelism Strategies Documentation.” https://nvidia.github.io/TensorRT-LLM/features/parallelism.html.

NVIDIA. 2024l. “TensorRT-LLM Release Notes.” https://nvidia.github.io/TensorRT-LLM/release-notes.html.

NVIDIA. 2024m. “TensorRT-LLM Speculative Decoding Documentation.” https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html.

Ong, Isaac, Amjad Almahairi, Vincent Wu, et al. 2024. “RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing.” https://www.lmsys.org/blog/2024-07-01-routellm/.

OpenTelemetry Authors. 2024. “OpenTelemetry Documentation.” https://opentelemetry.io/docs/.

Patel, Pratyush, Esha Choukse, Chaojie Zhang, et al. 2024. “Splitwise: Efficient Generative LLM Inference Using Phase Splitting.” arXiv Preprint arXiv:2311.18677. https://arxiv.org/abs/2311.18677.

Pope, Reiner, Sholto Douglas, Aakanksha Chowdhery, et al. 2022. “Efficiently Scaling Transformer Inference.” arXiv Preprint arXiv:2211.05102. https://arxiv.org/abs/2211.05102.

Press, Ofir, Noah A. Smith, and Mike Lewis. 2021. “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.” arXiv Preprint arXiv:2108.12409. https://arxiv.org/abs/2108.12409.

Prometheus Authors. 2024. “Prometheus Documentation.” https://prometheus.io/docs/.

PyTorch Team. 2024. “A Hitchhiker’s Guide to Speculative Decoding.” https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/.

Qin, Ruoyu, Zheming Li, Weiran He, et al. 2024. “Mooncake: A KVCache-Centric Disaggregated Architecture for LLM Serving.” arXiv Preprint arXiv:2407.00079. https://arxiv.org/abs/2407.00079.

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners.” OpenAI Blog. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

Raschka, Sebastian. 2025. “LLM Architecture Gallery.” https://sebastianraschka.com/llm-architecture-gallery/.

Raschka, Sebastian. 2026. “A Visual Guide to Attention Variants in Modern LLMs.” https://magazine.sebastianraschka.com/p/visual-attention-variants.

Red Hat. 2025. “Why vLLM Is the Best Choice for AI Inference Today.” https://developers.redhat.com/articles/2025/10/30/why-vllm-best-choice-ai-inference-today.

Rush, Alexander. 2018. “The Annotated Transformer.” https://nlp.seas.harvard.edu/annotated-transformer/.

Sanderson, Grant. 2024. “Attention in Transformers, Visually Explained.” https://www.3blue1brown.com/lessons/attention.

Sanseviero, Omar, Lewis Tunstall, Philipp Schmid, Sourab Mangrulkar, Younes Belkada, and Pedro Cuenca. 2023. “Mixture of Experts Explained.” https://huggingface.co/blog/moe.

SGLang Team. 2024. “SGLang Documentation.” https://docs.sglang.ai/.

Shah, Jay, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision.” arXiv Preprint arXiv:2407.08608. https://arxiv.org/abs/2407.08608.

Shaw, Robert, and Michael Goin. 2023. “SparseGPT: Remove 100 Billion Parameters for Free.” https://developers.redhat.com/articles/2023/03/21/sparsegpt-remove-100-billion-parameters-free.

Shazeer, Noam. 2019. “Fast Transformer Decoding: One Write-Head Is All You Need.” arXiv Preprint arXiv:1911.02150. https://arxiv.org/abs/1911.02150.

Shazeer, Noam. 2020. “GLU Variants Improve Transformer.” arXiv Preprint arXiv:2002.05202. https://arxiv.org/abs/2002.05202.

Sheng, Ying, Shiyi Cao, Dacheng Li, et al. 2023. “S-LoRA: Serving Thousands of Concurrent LoRA Adapters.” arXiv Preprint arXiv:2311.03285. https://arxiv.org/abs/2311.03285.

Spheron. 2025. “vLLM Vs TensorRT-LLM Vs SGLang: Benchmarks on H100.” https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/.

SqueezeBits. 2025. “vLLM Vs TensorRT-LLM #4: Which Scheduler Wins?” https://blog.squeezebits.com/vllm-vs-tensorrtllm-4-which-scheduler-wins--33083.

Stern, Mitchell, Noam Shazeer, and Jakob Uszkoreit. 2018. “Blockwise Parallel Decoding for Deep Autoregressive Models.” arXiv Preprint arXiv:1811.03115. https://arxiv.org/abs/1811.03115.

Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv Preprint arXiv:2104.09864. https://arxiv.org/abs/2104.09864.

Sun, Biao, Ziming Huang, Hanyu Chen, et al. 2024. “Llumnix: Dynamic Scheduling for Large Language Model Serving.” Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation. https://arxiv.org/abs/2406.03243.

Sun, Mingjie, Zhuang Liu, Anna Bair, and J. Zico Kolter. 2024. “A Simple and Effective Pruning Approach for Large Language Models.” International Conference on Learning Representations. https://arxiv.org/abs/2306.11695.

Tao, Chaofan, Qian Jia, Longxu Dou, et al. 2024. “Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2407.13623.

Thomas, Derek, Diego Maniloff, and David Holtz. 2024. “TGI Multi-LoRA: Deploy Once, Serve 30 Models.” https://huggingface.co/blog/multi-lora-serving.

Tillet, Philippe, H. T. Kung, and David Cox. 2019. “Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations.” Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL). https://doi.org/10.1145/3315508.3329973.

Touvron, Hugo, Thibaut Lavril, Gautier Izacard, et al. 2023. “LLaMA: Open and Efficient Foundation Language Models.” arXiv Preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971.

Touvron, Hugo, Louis Martin, Kevin Stone, et al. 2023. “Llama 2: Open Foundation and Fine-Tuned Chat Models.” arXiv Preprint arXiv:2307.09288. https://arxiv.org/abs/2307.09288.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” arXiv Preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762.

Verma, Shashank, and Neal Vaidya. 2023. “Mastering LLM Techniques: Inference Optimization.” https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/.

vLLM Team. 2024a. “Attention Backend Feature Support.” https://docs.vllm.ai/en/latest/design/attention_backends/.

vLLM Team. 2024b. “Benchmark CLI.” https://docs.vllm.ai/en/latest/benchmarking/cli/.

vLLM Team. 2024c. “Fusion Torch.compile Passes.” https://docs.vllm.ai/en/latest/design/fusions/.

vLLM Team. 2024d. “Quantized KV Cache.” https://docs.vllm.ai/en/latest/features/quantization/quantized_kvcache/.

vLLM Team. 2024e. “vLLM Documentation.” https://docs.vllm.ai/.

vLLM Team. 2025a. “Disaggregated Prefilling (Experimental).” https://docs.vllm.ai/en/latest/features/disagg_prefill/.

vLLM Team. 2025b. “Inside vLLM: Anatomy of a High-Throughput LLM Inference System.” https://vllm.ai/blog/anatomy-of-vllm.

vLLM Team. 2025c. “vLLM Large Scale Serving: DeepSeek at 2.2k Tok/s/H200 with Wide-EP.” https://vllm.ai/blog/large-scale-serving.

vLLM Team. 2025d. “vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-Scale Serving.” https://vllm.ai/blog/vllm-router-release.

vLLM Team. 2025e. “vLLM V1: A Major Upgrade to vLLM’s Core Architecture.” https://vllm.ai/blog/v1-alpha-release.

vLLM Team. 2026. “vLLM Triton Attention Backend Deep Dive.” https://vllm.ai/blog/vllm-triton-backend-deep-dive.

Wan, Zhongwei, Xin Wang, Che Liu, et al. 2024. “Efficient Large Language Models: A Survey.” Transactions on Machine Learning Research. https://arxiv.org/abs/2312.03863.

Weng, Lilian. 2021. “How to Train Really Large Models on Many GPUs?” https://lilianweng.github.io/posts/2021-09-25-train-large/.

Weng, Lilian. 2023a. “Large Transformer Model Inference Optimization.” https://lilianweng.github.io/posts/2023-01-10-inference-optimization/.

Weng, Lilian. 2023b. “The Transformer Family Version 2.0.” https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/.

Williams, Samuel, Andrew Waterman, and David Patterson. 2009. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM 52 (4): 65–76. https://dl.acm.org/doi/pdf/10.1145/1498765.1498785.

Wu, Bingyang, Yinmin Zhong, Zili Zhang, et al. 2023. “Fast Distributed Inference Serving for Large Language Models.” arXiv Preprint arXiv:2305.05920. https://arxiv.org/abs/2305.05920.

Xia, Heming et al. 2024. “Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding.” Findings of the Association for Computational Linguistics: ACL. https://arxiv.org/abs/2401.07851.

Xiao, Guangxuan, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.” arXiv Preprint arXiv:2211.10438. https://arxiv.org/abs/2211.10438.

Xu, Xiaohan, Ming Li, Chongyang Tao, et al. 2024. “A Survey on Knowledge Distillation of Large Language Models.” arXiv Preprint arXiv:2402.13116. https://arxiv.org/abs/2402.13116.

Yang, An, Anfeng Li, Baosong Yang, et al. 2025. “Qwen3 Technical Report.” arXiv Preprint arXiv:2505.09388. https://arxiv.org/abs/2505.09388.

Yang, Nan, Tao Ge, Liang Wang, et al. 2023. “Inference with Reference: Lossless Acceleration of Large Language Models.” arXiv Preprint arXiv:2304.04487. https://arxiv.org/abs/2304.04487.

Ye, Zihao, Lianmin Zheng, Lequn Chen, et al. 2025. “FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.” arXiv Preprint arXiv:2501.01005. https://arxiv.org/abs/2501.01005.

Yu, Gyeong-In, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI. https://www.usenix.org/conference/osdi22/presentation/yu.

Yuan, Jingyang et al. 2025. “Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention.” arXiv Preprint arXiv:2502.11089. https://arxiv.org/abs/2502.11089.

Yuan, Zhihang, Yuzhang Shang, Yang Zhou, et al. 2024. “LLM Inference Unveiled: Survey and Roofline Model Insights.” arXiv Preprint arXiv:2402.16363. https://arxiv.org/abs/2402.16363.

Zandieh, Amir, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2025. “TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate.” ICLR. https://arxiv.org/abs/2504.19874.

Zhang, Biao, and Rico Sennrich. 2019. “Root Mean Square Layer Normalization.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/1910.07467.

Zhang, Zhenyu, Ying Sheng, Tianyi Zhou, et al. 2023. “H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/2306.14048.

Zhao, Cen, Xiaodong Wang, and Jianyu Huang. 2025. “Scaling LLM Inference: Innovations in Tensor Parallelism, Context Parallelism, and Expert Parallelism.” https://engineering.fb.com/2025/10/17/ai-research/scaling-llm-inference-innovations-tensor-parallelism-context-parallelism-expert-parallelism/.

Zheng, Lianmin, Liangsheng Yin, Zhiqiang Xie, et al. 2023. “SGLang: Efficient Execution of Structured Language Model Programs.” arXiv Preprint arXiv:2312.07104. https://arxiv.org/abs/2312.07104.

Zhong, Yinmin, Shengyu Liu, Junda Chen, et al. 2024. “DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving.” arXiv Preprint arXiv:2401.09670. https://arxiv.org/abs/2401.09670.