References
Adnan, Muhammad, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya
Soloveychik, and Purushotham Kamath. 2024. “Keyformer:
KV Cache Reduction Through Key Tokens Selection for
Efficient Generative Inference.” MLSys. https://proceedings.mlsys.org/paper_files/paper/2024/hash/48fecef47b19fe501d27d338b6d52582-Abstract-Conference.html.
Agrawal, Amey, Nitin Kedia, Ashish Panwar, et al. 2024. “Taming
Throughput-Latency Tradeoff in LLM Inference with
Sarathi-Serve.” Proceedings of the 18th USENIX
Symposium on Operating Systems Design and Implementation. https://arxiv.org/abs/2403.02310.
Agrawal, Amey, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S.
Gulavani, and Ramachandran Ramjee. 2023. “Sarathi: Efficient
LLM Inference by Piggybacking Decodes with Chunked
Prefills.” arXiv Preprint arXiv:2308.16369. https://arxiv.org/abs/2308.16369.
Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy,
Federico Lebrón, and Sumit Sanghai. 2023. “GQA:
Training Generalized Multi-Query Transformer Models from Multi-Head
Checkpoints.” arXiv Preprint arXiv:2305.13245. https://arxiv.org/abs/2305.13245.
Alammar, Jay. 2018. “The Illustrated Transformer.” https://jalammar.github.io/illustrated-transformer/.
AMD ROCm Team. 2025. “The vLLM
MoE Playbook: A Practical Guide to TP,
DP, PP and Expert Parallelism.” https://rocm.blogs.amd.com/software-tools-optimization/vllm-moe-guide/README.html.
Aminabadi, Reza Yazdani, Samyam Rajbhandari, Ammar Ahmad Awan, et al.
2022. “DeepSpeed-Inference: Enabling Efficient
Inference of Transformer Models at Unprecedented Scale.”
arXiv Preprint arXiv:2207.00032. https://arxiv.org/abs/2207.00032.
Anyscale. 2024a. “How Continuous Batching Enables 23x Throughput
in LLM Inference While Reducing P50 Latency.” https://www.anyscale.com/blog/continuous-batching-llm-inference.
Anyscale. 2024b. “LLMPerf: A Tool for Benchmarking
LLM Inference.” https://github.com/ray-project/llmperf.
Austin, Jacob, Sholto Douglas, Roy Frostig, et al. 2025. “How to
Scale Your Model.” https://jax-ml.github.io/scaling-book/.
Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016.
“Layer Normalization.” arXiv Preprint
arXiv:1607.06450. https://arxiv.org/abs/1607.06450.
Bai, Yinmin, Peng Li, Hanyu Qin, et al.
2023. Fast Distributed Inference Serving for Large Language
Models. https://arxiv.org/abs/2305.05920.
Banatt, Eryk. 2025. “Understanding Multi-Head Latent
Attention.” https://planetbanatt.net/articles/mla.html.
Beltagy, Iz, Matthew E. Peters, and Arman Cohan. 2020.
“Longformer: The Long-Document Transformer.” arXiv
Preprint arXiv:2004.05150. https://arxiv.org/abs/2004.05150.
Bloem, Peter. 2019. “Transformers from Scratch.” https://peterbloem.nl/blog/transformers.
Brandon, William, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and
Jonathan Ragan Kelly. 2024. “Reducing Transformer Key-Value Cache
Size with Cross-Layer Attention.” arXiv Preprint
arXiv:2405.12981. https://arxiv.org/abs/2405.12981.
Bycroft, Brendan. 2023. “LLM Visualization.”
https://bbycroft.net/llm.
Cai, Tianle, Yuhong Li, Zhengyang Geng, et al. 2024. “Medusa:
Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads.” arXiv Preprint arXiv:2401.10774. https://arxiv.org/abs/2401.10774.
Cai, Zefan. 2024.
“Awesome-LLM-KV-Cache.” https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache.
Casey, Matt. 2024. “LLM Distillation Demystified: A
Complete Guide.” https://snorkel.ai/blog/llm-distillation-demystified-a-complete-guide/.
Chen, Carol. 2022. “Transformer Inference Arithmetic.” https://kipp.ly/transformer-inference-arithmetic/.
Chen, Charlie, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste
Lespiau, Laurent Sifre, and John Jumper. 2023. “Accelerating Large
Language Model Decoding with Speculative Sampling.” arXiv
Preprint arXiv:2302.01318. https://arxiv.org/abs/2302.01318.
Chen, Junda, Yonghao Zhuang, and Hao Zhang. 2025. “Disaggregated
Inference: 18 Months Later.” https://haoailab.com/blogs/distserve-retro/.
Chen, Lequn, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind
Krishnamurthy. 2023. “Punica: Multi-Tenant LoRA
Serving.” arXiv Preprint arXiv:2310.18547. https://arxiv.org/abs/2310.18547.
Chips and Cheese. 2023. “Nvidia’s H100: Funny
L2, and Tons of Bandwidth.” https://chipsandcheese.com/2023/07/02/nvidias-h100-funny-l2-and-tons-of-bandwidth/.
Cho, Aeree, Grace C. Kim, Alexander Karpekov, et al. 2024.
“Transformer Explainer: Interactive Learning of Text-Generative
Models.” https://poloclub.github.io/transformer-explainer/.
Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin,
et al. 2022. “PaLM: Scaling Language Modeling
with Pathways.” arXiv Preprint arXiv:2204.02311. https://arxiv.org/abs/2204.02311.
Dao, Tri. 2023. “FlashAttention-2: Faster Attention
with Better Parallelism and Work Partitioning.” arXiv
Preprint arXiv:2307.08691. https://arxiv.org/abs/2307.08691.
Dao, Tri. 2024. “FlashAttention-3: Fast and Accurate
Attention with Asynchrony and Low-Precision.” https://tridao.me/blog/2024/flash3/.
Dao, Tri, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré.
2022. “FlashAttention: Fast and Memory-Efficient
Exact Attention with IO-Awareness.” arXiv
Preprint arXiv:2205.14135. https://arxiv.org/abs/2205.14135.
Dao, Tri, Daniel Haziza, Francisco Massa, and Grigory Sizov. 2023.
Flash-Decoding for Long-Context Inference. https://crfm.stanford.edu/2023/10/12/flashdecoding.html.
Dao, Tri, Jay Shah, Ganesh Bikshandi, et al. 2025.
FlashAttention-4: Hardware-Friendly Attention on
Hopper and Blackwell GPUs.
https://tridao.me/blog/2025/flash4/.
DeepSeek-AI. 2025. FlashMLA: Efficient Multi-Head
Latent Attention Decoding Kernels. https://github.com/deepseek-ai/FlashMLA.
DeepSpeed Team. 2023. “DeepSpeed-FastGen:
High-Throughput Text Generation for LLMs via
MII and DeepSpeed-Inference.” https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/README.md.
Dettmers, Tim et al. 2021.
“Bitsandbytes: Accessible Large Language Models via k-Bit
Quantization for PyTorch.” https://github.com/bitsandbytes-foundation/bitsandbytes.
Dettmers, Tim. 2022. “LLM.int8() and
Emergent Features.” https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/.
Dong, Juechu, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He.
2024. “FlexAttention: A Programming Model for
Generating Optimized Attention Kernels.” arXiv Preprint
arXiv:2412.05496. https://arxiv.org/abs/2412.05496.
Frantar, Elias, and Dan Alistarh. 2023. “SparseGPT:
Massive Language Models Can Be Accurately Pruned in One-Shot.”
Proceedings of the 40th International Conference on Machine
Learning. https://arxiv.org/abs/2301.00774.
Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022.
“GPTQ: Accurate Post-Training Quantization for
Generative Pre-Trained Transformers.” arXiv Preprint
arXiv:2210.17323. https://arxiv.org/abs/2210.17323.
Fu, Yichao, Peter Bailis, Ion Stoica, and Hao Zhang. 2024. “Break
the Sequential Dependency of LLM Inference Using Lookahead
Decoding.” arXiv Preprint arXiv:2402.02057. https://arxiv.org/abs/2402.02057.
Gloeckle, Fabian, Badr Youbi Idrissi, Baptiste Roziere, David Lopez-Paz,
and Gabriel Synnaeve. 2024. “Better & Faster Large Language
Models via Multi-Token Prediction.” arXiv Preprint
arXiv:2404.19737. https://arxiv.org/abs/2404.19737.
Gordić, Aleksa. 2023. “ELI5:
FlashAttention.” https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad.
Grafana Labs. 2024. “Grafana Documentation.”
https://grafana.com/docs/grafana/latest/.
Grattafiori, Aaron, Abhimanyu Dubey, Abhinav
Jauhri, et al. 2024. “The Llama 3 Herd of
Models.” arXiv Preprint arXiv:2407.21783. https://arxiv.org/abs/2407.21783.
Grootendorst, Maarten. 2024. “A Visual Guide to
Quantization.” https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization.
Gu, Albert, and Tri Dao. 2023. “Mamba: Linear-Time
Sequence Modeling with Selective State Spaces.” arXiv
Preprint arXiv:2312.00752. https://arxiv.org/abs/2312.00752.
Gu, Albert, and Tri Dao. 2024. “GPUs Go Brrr.”
https://hazyresearch.stanford.edu/blog/2024-05-12-tk.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling
the Knowledge in a Neural Network.” arXiv Preprint
arXiv:1503.02531. https://arxiv.org/abs/1503.02531.
Holmes, Connor, Masahiro Tanaka, Jeff Rasley, Samyam Rajbhandari, Reza
Yazdani Aminabadi, and Yuxiong He. 2024.
“DeepSpeed-FastGen: High-Throughput Text Generation
for LLMs via MII and
DeepSpeed-Inference.” arXiv Preprint
arXiv:2401.08671. https://arxiv.org/abs/2401.08671.
Hong, Ke, Guohao Dai, Jiaming Xu, et al. 2023.
“FlashDecoding++: Faster Large Language Model
Inference on GPUs.” arXiv Preprint
arXiv:2311.01282. https://arxiv.org/abs/2311.01282.
Hu, Edward J., Yelong Shen, Phillip Wallis, et al. 2021.
“LoRA: Low-Rank Adaptation of Large Language
Models.” arXiv Preprint arXiv:2106.09685. https://arxiv.org/abs/2106.09685.
Hugging Face. 2024a. “Methods and Tools for Efficient Training on
a Single GPU.” https://huggingface.co/docs/transformers/en/perf_train_gpu_many.
Hugging Face. 2024b. “Quantization Concept Guide.” https://huggingface.co/docs/transformers/en/quantization/concept_guide.
Jacobs, Sam Ade, Masahiro Tanaka, Chengming Zhang, et al. 2023.
“DeepSpeed Ulysses: System Optimizations for Enabling
Training of Extreme Long Sequence Transformer Models.” arXiv
Preprint arXiv:2309.14509. https://arxiv.org/abs/2309.14509.
JarvisLabs. 2025. “Speculative Decoding in vLLM: Complete Guide to Faster LLM
Inference.” https://docs.jarvislabs.ai/blog/speculative-decoding-vllm-faster-llm-inference.
Jiang, Albert Q., Alexandre Sablayrolles, Antoine
Roux, Arthur Mensch, et al. 2024. “Mixtral of
Experts.” arXiv Preprint arXiv:2401.04088. https://arxiv.org/abs/2401.04088.
Kadous, Waleed, Kyle Huang, Wendi Ding, Liguang Xie, Avnish Narayan, and
Ricky Xu. 2023. “Reproducible Performance Metrics for
LLM Inference.” https://www.anyscale.com/blog/reproducible-performance-metrics-for-llm-inference.
Karpathy, Andrej. 2022. nanoGPT. https://github.com/karpathy/nanoGPT.
Karpathy, Andrej. 2023. “Let’s Build GPT: From
Scratch, in Code, Spelled Out.” https://www.youtube.com/watch?v=kCc8FmEb1nY.
Kazemnejad, Amirhossein, Inkit Padhi, Karthikeyan Natesan Ramamurthy,
Payel Das, and Siva Reddy. 2023. “The Impact of Positional
Encoding on Length Generalization in Transformers.” arXiv
Preprint arXiv:2305.19466. https://arxiv.org/abs/2305.19466.
Kimi Team, Yu Zhang, Zongyu Lin, et al. 2025. “Kimi
Linear: An Expressive, Efficient Attention Architecture.”
arXiv Preprint arXiv:2510.26692. https://arxiv.org/abs/2510.26692.
Kumar, Tanishq, Tri Dao, and Avner May. 2026. “Speculative
Speculative Decoding.” arXiv Preprint arXiv:2603.03251.
https://arxiv.org/abs/2603.03251.
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023a. “Efficient
Memory Management for Large Language Model Serving with
PagedAttention.” arXiv Preprint
arXiv:2309.06180. https://arxiv.org/abs/2309.06180.
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023b. “vLLM: Easy, Fast, and Cheap LLM
Serving with PagedAttention.” https://vllm.ai/blog/vllm.
Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2022. “Fast
Inference from Transformers via Speculative Decoding.” arXiv
Preprint arXiv:2211.17192. https://arxiv.org/abs/2211.17192.
Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2024. “Looking
Back at Speculative Decoding.” https://research.google/blog/looking-back-at-speculative-decoding/.
Li, Dacheng, Rulin Shao, Anze Xie, et al.
2023. “DistFlashAttn: Distributed Memory-Efficient
Attention for Long-Context LLMs Training.” arXiv
Preprint arXiv:2310.03294. https://arxiv.org/abs/2310.03294.
Li, Shenggui, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You.
2021. “Sequence Parallelism: Long Sequence Training from System
Perspective.” arXiv Preprint arXiv:2105.13120. https://arxiv.org/abs/2105.13120.
Li, Yuhong, Yingbing Huang, Bowen Yang, et al. 2024.
“SnapKV: LLM Knows What You Are Looking
for Before Generation.” Advances in Neural Information
Processing Systems. https://arxiv.org/abs/2404.14469.
Li, Zhuohan, Lianmin Zheng, Yinmin Zhong, et al. 2023.
“AlpaServe: Statistical Multiplexing with Model
Parallelism for Deep Learning Serving.” arXiv Preprint
arXiv:2302.11665. https://arxiv.org/abs/2302.11665.
Lin, Ji, Jiaming Tang, Haotian Tang, et al. 2023.
“AWQ: Activation-Aware Weight Quantization for
LLM Compression and Acceleration.” arXiv
Preprint arXiv:2306.00978. https://arxiv.org/abs/2306.00978.
Liu, Aixin et al. 2024a.
“DeepSeek-V2: A Strong, Economical, and Efficient
Mixture-of-Experts Language Model.” arXiv Preprint
arXiv:2405.04434. https://arxiv.org/abs/2405.04434.
Liu, Aixin et al. 2024b.
“DeepSeek-V3 Technical Report.” arXiv
Preprint arXiv:2412.19437. https://arxiv.org/abs/2412.19437.
Liu, Zirui, Jiayi Yuan, Hongye Jin, et al. 2024.
“KIVI: A Tuning-Free Asymmetric 2bit Quantization for
KV Cache.” Proceedings of the 41st International
Conference on Machine Learning. https://arxiv.org/abs/2402.02750.
LMSYS. 2024a. “Fast and Expressive LLM Inference with
RadixAttention and SGLang.” https://lmsys.org/blog/2024-01-17-sglang/.
LMSYS. 2024b. “SGLang V0.4: Zero-Overhead Batch
Scheduler, Cache-Aware Load Balancer, and Faster Structured
Outputs.” https://www.lmsys.org/blog/2024-12-04-sglang-v0-4/.
LMSYS. 2025a. “Accelerating SGLang with Multiple
Token Prediction.” https://lmsys.org/blog/2025-07-17-mtp/.
LMSYS. 2025b. “Deploying DeepSeek with
PD Disaggregation and Large-Scale Expert
Parallelism.” https://www.lmsys.org/blog/2025-05-05-large-scale-ep/.
LMSYS. 2025c. “HiCache: Hierarchical KV
Caching for SGLang.” https://lmsys.org/blog/2025-09-10-sglang-hicache/.
LMSYS. 2026. “Chunked Pipeline Parallelism in
SGLang.” https://www.lmsys.org/blog/2026-01-15-chunked-pipeline/.
Miao, Xupeng, Gabriele Oliaro, Zhihao Zhang, et al. 2023. “Towards
Efficient Generative Large Language Model Serving: A Survey from
Algorithms to Systems.” arXiv Preprint arXiv:2312.15234.
https://arxiv.org/abs/2312.15234.
Microsoft Research. 2026. “Memento: Teaching LLMs to
Manage Their Own Context.” https://www.microsoft.com/en-us/research/articles/memento-teaching-llms-to-manage-their-own-context/.
Modal. 2024. “GPU Glossary.” https://modal.com/gpu-glossary.
NVIDIA. 2021. “Accelerating Inference with Sparsity Using the
NVIDIA Ampere Architecture and
NVIDIA TensorRT.” https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/.
NVIDIA. 2022. “NVIDIA Hopper Architecture
in-Depth.” https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.
NVIDIA. 2023. “NVIDIA H100 Tensor Core GPU.”
https://www.nvidia.com/en-us/data-center/h100/.
NVIDIA. 2024a. “CUDA Programming Guide.” https://docs.nvidia.com/cuda/cuda-programming-guide/.
NVIDIA. 2024b. “Deciding Model Sharding Strategy.” https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/deciding-model-sharding-strategy.html.
NVIDIA. 2024c. “Disaggregated Serving in
TensorRT-LLM.” https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.html.
NVIDIA. 2024d. “KV Cache Reuse Optimizations in
TensorRT-LLM.” https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/.
NVIDIA. 2024e. “Optimizing llama.cpp
AI Inference with CUDA Graphs.” https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-inference-with-cuda-graphs/.
NVIDIA. 2024f. “Streamlining AI Inference Performance
and Deployment with NVIDIA TensorRT-LLM
Chunked Prefill.” https://developer.nvidia.com/blog/streamlining-ai-inference-performance-and-deployment-with-nvidia-tensorrt-llm-chunked-prefill/.
NVIDIA. 2024g. “TensorRT-LLM Documentation.”
https://nvidia.github.io/TensorRT-LLM/.
NVIDIA. 2024h. “TensorRT-LLM Expert Parallelism
Documentation.” https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html.
NVIDIA. 2024i. “TensorRT-LLM GPT
Attention Documentation.” https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html.
NVIDIA. 2024j. “TensorRT-LLM KV Cache
System Documentation.” https://nvidia.github.io/TensorRT-LLM/latest/features/kvcache.html.
NVIDIA. 2024k. “TensorRT-LLM Parallelism Strategies
Documentation.” https://nvidia.github.io/TensorRT-LLM/features/parallelism.html.
NVIDIA. 2024l. “TensorRT-LLM Release Notes.”
https://nvidia.github.io/TensorRT-LLM/release-notes.html.
NVIDIA. 2024m. “TensorRT-LLM Speculative Decoding
Documentation.” https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html.
Ong, Isaac, Amjad Almahairi, Vincent Wu, et al. 2024.
“RouteLLM: An Open-Source Framework for
Cost-Effective LLM Routing.” https://www.lmsys.org/blog/2024-07-01-routellm/.
OpenTelemetry Authors. 2024. “OpenTelemetry
Documentation.” https://opentelemetry.io/docs/.
Patel, Pratyush, Esha Choukse, Chaojie Zhang, et al. 2024.
“Splitwise: Efficient Generative LLM Inference Using
Phase Splitting.” arXiv Preprint arXiv:2311.18677. https://arxiv.org/abs/2311.18677.
Pope, Reiner, Sholto Douglas, Aakanksha Chowdhery, et al. 2022.
“Efficiently Scaling Transformer Inference.” arXiv
Preprint arXiv:2211.05102. https://arxiv.org/abs/2211.05102.
Press, Ofir, Noah A. Smith, and Mike Lewis. 2021. “Train Short,
Test Long: Attention with Linear Biases Enables Input Length
Extrapolation.” arXiv Preprint arXiv:2108.12409. https://arxiv.org/abs/2108.12409.
Prometheus Authors. 2024. “Prometheus
Documentation.” https://prometheus.io/docs/.
PyTorch Team. 2024. “A Hitchhiker’s Guide to Speculative
Decoding.” https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/.
Qin, Ruoyu, Zheming Li, Weiran He, et al. 2024. “Mooncake: A
KVCache-Centric Disaggregated Architecture for
LLM Serving.” arXiv Preprint
arXiv:2407.00079. https://arxiv.org/abs/2407.00079.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and
Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask
Learners.” OpenAI Blog. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
Raschka, Sebastian. 2025. “LLM Architecture
Gallery.” https://sebastianraschka.com/llm-architecture-gallery/.
Raschka, Sebastian. 2026. “A Visual Guide to Attention Variants in
Modern LLMs.” https://magazine.sebastianraschka.com/p/visual-attention-variants.
Red Hat. 2025. “Why vLLM Is the Best
Choice for AI Inference Today.” https://developers.redhat.com/articles/2025/10/30/why-vllm-best-choice-ai-inference-today.
Rush, Alexander. 2018. “The Annotated Transformer.” https://nlp.seas.harvard.edu/annotated-transformer/.
Sanderson, Grant. 2024. “Attention in Transformers, Visually
Explained.” https://www.3blue1brown.com/lessons/attention.
Sanseviero, Omar, Lewis Tunstall, Philipp Schmid, Sourab Mangrulkar,
Younes Belkada, and Pedro Cuenca. 2023. “Mixture of Experts
Explained.” https://huggingface.co/blog/moe.
SGLang Team. 2024. “SGLang Documentation.” https://docs.sglang.ai/.
Shah, Jay, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani,
and Tri Dao. 2024. “FlashAttention-3: Fast and
Accurate Attention with Asynchrony and Low-Precision.” arXiv
Preprint arXiv:2407.08608. https://arxiv.org/abs/2407.08608.
Shaw, Robert, and Michael Goin. 2023. “SparseGPT:
Remove 100 Billion Parameters for Free.” https://developers.redhat.com/articles/2023/03/21/sparsegpt-remove-100-billion-parameters-free.
Shazeer, Noam. 2019. “Fast Transformer Decoding: One Write-Head Is
All You Need.” arXiv Preprint arXiv:1911.02150. https://arxiv.org/abs/1911.02150.
Shazeer, Noam. 2020. “GLU Variants Improve
Transformer.” arXiv Preprint arXiv:2002.05202. https://arxiv.org/abs/2002.05202.
Sheng, Ying, Shiyi Cao, Dacheng Li, et al. 2023.
“S-LoRA: Serving Thousands of Concurrent
LoRA Adapters.” arXiv Preprint
arXiv:2311.03285. https://arxiv.org/abs/2311.03285.
Spheron. 2025. “vLLM Vs
TensorRT-LLM Vs SGLang: Benchmarks on
H100.” https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/.
SqueezeBits. 2025. “vLLM Vs
TensorRT-LLM #4: Which Scheduler Wins?” https://blog.squeezebits.com/vllm-vs-tensorrtllm-4-which-scheduler-wins--33083.
Stern, Mitchell, Noam Shazeer, and Jakob Uszkoreit. 2018.
“Blockwise Parallel Decoding for Deep Autoregressive
Models.” arXiv Preprint arXiv:1811.03115. https://arxiv.org/abs/1811.03115.
Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng
Liu. 2021. “RoFormer: Enhanced Transformer with
Rotary Position Embedding.” arXiv Preprint
arXiv:2104.09864. https://arxiv.org/abs/2104.09864.
Sun, Biao, Ziming Huang, Hanyu Chen, et al. 2024. “Llumnix:
Dynamic Scheduling for Large Language Model Serving.”
Proceedings of the 18th USENIX Symposium on Operating Systems Design
and Implementation. https://arxiv.org/abs/2406.03243.
Sun, Mingjie, Zhuang Liu, Anna Bair, and J. Zico Kolter. 2024. “A
Simple and Effective Pruning Approach for Large Language Models.”
International Conference on Learning Representations. https://arxiv.org/abs/2306.11695.
Tao, Chaofan, Qian Jia, Longxu Dou, et al. 2024. “Scaling Laws
with Vocabulary: Larger Models Deserve Larger Vocabularies.”
Advances in Neural Information Processing Systems. https://arxiv.org/abs/2407.13623.
Thomas, Derek, Diego Maniloff, and David Holtz. 2024.
“TGI Multi-LoRA: Deploy Once, Serve 30
Models.” https://huggingface.co/blog/multi-lora-serving.
Tillet, Philippe, H. T. Kung, and David Cox. 2019. “Triton: An
Intermediate Language and Compiler for Tiled Neural Network
Computations.” Proceedings of the 3rd ACM SIGPLAN
International Workshop on Machine Learning and Programming Languages
(MAPL). https://doi.org/10.1145/3315508.3329973.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, et
al. 2023. “LLaMA: Open and Efficient
Foundation Language Models.” arXiv Preprint
arXiv:2302.13971. https://arxiv.org/abs/2302.13971.
Touvron, Hugo, Louis Martin, Kevin Stone, et
al. 2023. “Llama 2: Open Foundation and
Fine-Tuned Chat Models.” arXiv Preprint
arXiv:2307.09288. https://arxiv.org/abs/2307.09288.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017.
“Attention Is All You Need.” arXiv Preprint
arXiv:1706.03762. https://arxiv.org/abs/1706.03762.
Verma, Shashank, and Neal Vaidya. 2023. “Mastering
LLM Techniques: Inference Optimization.” https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/.
vLLM Team. 2024a. “Attention Backend
Feature Support.” https://docs.vllm.ai/en/latest/design/attention_backends/.
vLLM Team. 2024d. “Quantized
KV Cache.” https://docs.vllm.ai/en/latest/features/quantization/quantized_kvcache/.
vLLM Team. 2025a. “Disaggregated
Prefilling (Experimental).” https://docs.vllm.ai/en/latest/features/disagg_prefill/.
vLLM Team. 2025b. “Inside vLLM: Anatomy of a High-Throughput
LLM Inference System.” https://vllm.ai/blog/anatomy-of-vllm.
vLLM Team. 2025c. “vLLM Large Scale Serving: DeepSeek at
2.2k Tok/s/H200 with Wide-EP.” https://vllm.ai/blog/large-scale-serving.
vLLM Team. 2025d. “vLLM Router: A High-Performance and Prefill/Decode
Aware Load Balancer for Large-Scale Serving.” https://vllm.ai/blog/vllm-router-release.
vLLM Team. 2025e. “vLLM V1: A Major Upgrade to vLLM’s Core Architecture.” https://vllm.ai/blog/v1-alpha-release.
vLLM Team. 2026. “vLLM Triton Attention Backend Deep Dive.”
https://vllm.ai/blog/vllm-triton-backend-deep-dive.
Wan, Zhongwei, Xin Wang, Che Liu, et al. 2024. “Efficient Large
Language Models: A Survey.” Transactions on Machine Learning
Research. https://arxiv.org/abs/2312.03863.
Weng, Lilian. 2021. “How to Train Really Large Models on Many
GPUs?” https://lilianweng.github.io/posts/2021-09-25-train-large/.
Weng, Lilian. 2023a. “Large Transformer Model Inference
Optimization.” https://lilianweng.github.io/posts/2023-01-10-inference-optimization/.
Weng, Lilian. 2023b. “The Transformer Family Version 2.0.”
https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/.
Williams, Samuel, Andrew Waterman, and David Patterson. 2009.
“Roofline: An Insightful Visual Performance Model for Multicore
Architectures.” Communications of the ACM 52 (4): 65–76.
https://dl.acm.org/doi/pdf/10.1145/1498765.1498785.
Wu, Bingyang, Yinmin Zhong, Zili Zhang, et al. 2023. “Fast
Distributed Inference Serving for Large Language Models.”
arXiv Preprint arXiv:2305.05920. https://arxiv.org/abs/2305.05920.
Xia, Heming et al. 2024. “Unlocking
Efficiency in Large Language Model Inference: A Comprehensive Survey of
Speculative Decoding.” Findings of the Association for
Computational Linguistics: ACL. https://arxiv.org/abs/2401.07851.
Xiao, Guangxuan, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and
Song Han. 2023. “SmoothQuant: Accurate and Efficient
Post-Training Quantization for Large Language Models.” arXiv
Preprint arXiv:2211.10438. https://arxiv.org/abs/2211.10438.
Xu, Xiaohan, Ming Li, Chongyang Tao, et al. 2024. “A Survey on
Knowledge Distillation of Large Language Models.” arXiv
Preprint arXiv:2402.13116. https://arxiv.org/abs/2402.13116.
Yang, An, Anfeng Li, Baosong Yang, et al.
2025. “Qwen3 Technical Report.” arXiv Preprint
arXiv:2505.09388. https://arxiv.org/abs/2505.09388.
Yang, Nan, Tao Ge, Liang Wang, et al. 2023. “Inference with
Reference: Lossless Acceleration of Large Language Models.”
arXiv Preprint arXiv:2304.04487. https://arxiv.org/abs/2304.04487.
Ye, Zihao, Lianmin Zheng, Lequn Chen, et al. 2025.
“FlashInfer: Efficient and Customizable Attention
Engine for LLM Inference Serving.” arXiv
Preprint arXiv:2501.01005. https://arxiv.org/abs/2501.01005.
Yu, Gyeong-In, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and
Byung-Gon Chun. 2022. “Orca: A Distributed Serving System for
Transformer-Based Generative Models.” OSDI. https://www.usenix.org/conference/osdi22/presentation/yu.
Yuan, Jingyang et al. 2025. “Native
Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention.” arXiv Preprint arXiv:2502.11089. https://arxiv.org/abs/2502.11089.
Yuan, Zhihang, Yuzhang Shang, Yang Zhou, et al. 2024.
“LLM Inference Unveiled: Survey and Roofline Model
Insights.” arXiv Preprint arXiv:2402.16363. https://arxiv.org/abs/2402.16363.
Zandieh, Amir, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2025.
“TurboQuant: Online Vector Quantization with
Near-Optimal Distortion Rate.” ICLR. https://arxiv.org/abs/2504.19874.
Zhang, Biao, and Rico Sennrich. 2019. “Root Mean Square Layer
Normalization.” Advances in Neural Information Processing
Systems. https://arxiv.org/abs/1910.07467.
Zhang, Zhenyu, Ying Sheng, Tianyi Zhou, et al. 2023.
“H2O: Heavy-Hitter Oracle for Efficient Generative
Inference of Large Language Models.” Advances in Neural
Information Processing Systems. https://arxiv.org/abs/2306.14048.
Zhao, Cen, Xiaodong Wang, and Jianyu Huang. 2025. “Scaling
LLM Inference: Innovations in Tensor Parallelism, Context
Parallelism, and Expert Parallelism.” https://engineering.fb.com/2025/10/17/ai-research/scaling-llm-inference-innovations-tensor-parallelism-context-parallelism-expert-parallelism/.
Zheng, Lianmin, Liangsheng Yin, Zhiqiang Xie, et
al. 2023. “SGLang: Efficient Execution of
Structured Language Model Programs.” arXiv Preprint
arXiv:2312.07104. https://arxiv.org/abs/2312.07104.
Zhong, Yinmin, Shengyu Liu, Junda Chen, et
al. 2024. “DistServe: Disaggregating Prefill
and Decoding for Goodput-Optimized Large Language Model Serving.”
arXiv Preprint arXiv:2401.09670. https://arxiv.org/abs/2401.09670.