References

Adnan, Muhammad, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, and Purushotham Kamath. 2024. “Keyformer: KV Cache Reduction Through Key Tokens Selection for Efficient Generative Inference.” MLSys. https://proceedings.mlsys.org/paper_files/paper/2024/hash/48fecef47b19fe501d27d338b6d52582-Abstract-Conference.html.
Agrawal, Amey, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. “Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.” arXiv Preprint arXiv:2308.16369. https://arxiv.org/abs/2308.16369.
Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.” arXiv Preprint arXiv:2305.13245. https://arxiv.org/abs/2305.13245.
Alammar, Jay. 2018. “The Illustrated Transformer.” https://jalammar.github.io/illustrated-transformer/.
Aminabadi, Reza Yazdani, Samyam Rajbhandari, Ammar Ahmad Awan, et al. 2022. DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.” arXiv Preprint arXiv:2207.00032. https://arxiv.org/abs/2207.00032.
Anyscale. 2024a. “How Continuous Batching Enables 23x Throughput in LLM Inference While Reducing P50 Latency.” https://www.anyscale.com/blog/continuous-batching-llm-inference.
Anyscale. 2024b. LLMPerf: A Tool for Benchmarking LLM Inference.” https://github.com/ray-project/llmperf.
Austin, Jacob, Sholto Douglas, Roy Frostig, et al. 2025. “How to Scale Your Model.” https://jax-ml.github.io/scaling-book/.
Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. “Layer Normalization.” arXiv Preprint arXiv:1607.06450. https://arxiv.org/abs/1607.06450.
Beltagy, Iz, Matthew E. Peters, and Arman Cohan. 2020. “Longformer: The Long-Document Transformer.” arXiv Preprint arXiv:2004.05150. https://arxiv.org/abs/2004.05150.
Bloem, Peter. 2019. “Transformers from Scratch.” https://peterbloem.nl/blog/transformers.
Brandon, William, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. 2024. “Reducing Transformer Key-Value Cache Size with Cross-Layer Attention.” arXiv Preprint arXiv:2405.12981. https://arxiv.org/abs/2405.12981.
Cai, Tianle, Yuhong Li, Zhengyang Geng, et al. 2024. “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.” arXiv Preprint arXiv:2401.10774. https://arxiv.org/abs/2401.10774.
Chen, Carol. 2022. “Transformer Inference Arithmetic.” https://kipp.ly/transformer-inference-arithmetic/.
Chen, Charlie, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. “Accelerating Large Language Model Decoding with Speculative Sampling.” arXiv Preprint arXiv:2302.01318. https://arxiv.org/abs/2302.01318.
Chen, Lequn, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. 2023. “Punica: Multi-Tenant LoRA Serving.” arXiv Preprint arXiv:2310.18547. https://arxiv.org/abs/2310.18547.
Chips and Cheese. 2023. “Nvidia’s H100: Funny L2, and Tons of Bandwidth.” https://chipsandcheese.com/2023/07/02/nvidias-h100-funny-l2-and-tons-of-bandwidth/.
Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, et al. 2022. PaLM: Scaling Language Modeling with Pathways.” arXiv Preprint arXiv:2204.02311. https://arxiv.org/abs/2204.02311.
Dao, Tri. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” arXiv Preprint arXiv:2307.08691. https://arxiv.org/abs/2307.08691.
Dao, Tri, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” arXiv Preprint arXiv:2205.14135. https://arxiv.org/abs/2205.14135.
Dettmers, Tim et al. 2021. “Bitsandbytes: Accessible Large Language Models via k-Bit Quantization for PyTorch.” https://github.com/bitsandbytes-foundation/bitsandbytes.
Dettmers, Tim, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. LLM.int8(): 8-Bit Matrix Multiplication for Transformers at Scale.” arXiv Preprint arXiv:2208.07339. https://arxiv.org/abs/2208.07339.
Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers.” arXiv Preprint arXiv:2210.17323. https://arxiv.org/abs/2210.17323.
Fu, Yichao, Peter Bailis, Ion Stoica, and Hao Zhang. 2024. “Break the Sequential Dependency of LLM Inference Using Lookahead Decoding.” arXiv Preprint arXiv:2402.02057. https://arxiv.org/abs/2402.02057.
Gloeckle, Fabian, Badr Youbi Idrissi, Baptiste Roziere, David Lopez-Paz, and Gabriel Synnaeve. 2024. “Better & Faster Large Language Models via Multi-Token Prediction.” arXiv Preprint arXiv:2404.19737. https://arxiv.org/abs/2404.19737.
Grafana Labs. 2024. Grafana Documentation.” https://grafana.com/docs/grafana/latest/.
Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. “The Llama 3 Herd of Models.” arXiv Preprint arXiv:2407.21783. https://arxiv.org/abs/2407.21783.
Gu, Albert, and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv Preprint arXiv:2312.00752. https://arxiv.org/abs/2312.00752.
Gu, Albert, and Tri Dao. 2024. GPUs Go Brrr.” https://hazyresearch.stanford.edu/blog/2024-05-12-tk.
Gu, Yuxian, Li Dong, Furu Wei, and Minlie Huang. 2024. MiniLLM: On-Policy Distillation of Large Language Models.” arXiv Preprint arXiv:2306.08543. https://arxiv.org/abs/2306.08543.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” arXiv Preprint arXiv:1503.02531. https://arxiv.org/abs/1503.02531.
Hu, Edward J., Yelong Shen, Phillip Wallis, et al. 2021. LoRA: Low-Rank Adaptation of Large Language Models.” arXiv Preprint arXiv:2106.09685. https://arxiv.org/abs/2106.09685.
Huang, Yanping, Youlong Cheng, Ankur Bapna, et al. 2019. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism.” arXiv Preprint arXiv:1811.06965. https://arxiv.org/abs/1811.06965.
Jacobs, Sam Ade, Masahiro Tanaka, Chengming Zhang, et al. 2023. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models.” arXiv Preprint arXiv:2309.14509. https://arxiv.org/abs/2309.14509.
Jiang, Albert Q., Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, et al. 2024. “Mixtral of Experts.” arXiv Preprint arXiv:2401.04088. https://arxiv.org/abs/2401.04088.
Kadous, Waleed, Kyle Huang, Wendi Ding, Liguang Xie, Avnish Narayan, and Ricky Xu. 2023. “Reproducible Performance Metrics for LLM Inference.” https://www.anyscale.com/blog/reproducible-performance-metrics-for-llm-inference.
Karpathy, Andrej. 2023. “Let’s Build GPT: From Scratch, in Code, Spelled Out.” https://www.youtube.com/watch?v=kCc8FmEb1nY.
Kazemnejad, Amirhossein, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. 2023. “The Impact of Positional Encoding on Length Generalization in Transformers.” arXiv Preprint arXiv:2305.19466. https://arxiv.org/abs/2305.19466.
Kimi Team, Yu Zhang, Zongyu Lin, et al. 2025. Kimi Linear: An Expressive, Efficient Attention Architecture.” arXiv Preprint arXiv:2510.26692. https://arxiv.org/abs/2510.26692.
Korthikanti, Vijay Anand, Jared Casper, Sangkug Lym, et al. 2022. “Reducing Activation Recomputation in Large Transformer Models.” arXiv Preprint arXiv:2205.05198. https://arxiv.org/abs/2205.05198.
Kumar, Tanishq, Tri Dao, and Avner May. 2026. “Speculative Speculative Decoding.” arXiv Preprint arXiv:2603.03251. https://arxiv.org/abs/2603.03251.
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” arXiv Preprint arXiv:2309.06180. https://arxiv.org/abs/2309.06180.
Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2022. “Fast Inference from Transformers via Speculative Decoding.” arXiv Preprint arXiv:2211.17192. https://arxiv.org/abs/2211.17192.
Li, Dacheng, Rulin Shao, Anze Xie, et al. 2023. DistFlashAttn: Distributed Memory-Efficient Attention for Long-Context LLMs Training.” arXiv Preprint arXiv:2310.03294. https://arxiv.org/abs/2310.03294.
Li, Shenggui, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2021. “Sequence Parallelism: Long Sequence Training from System Perspective.” arXiv Preprint arXiv:2105.13120. https://arxiv.org/abs/2105.13120.
Li, Zhuohan, Lianmin Zheng, Yinmin Zhong, et al. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving.” arXiv Preprint arXiv:2302.11665. https://arxiv.org/abs/2302.11665.
Lin, Ji, Jiaming Tang, Haotian Tang, et al. 2023. AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration.” arXiv Preprint arXiv:2306.00978. https://arxiv.org/abs/2306.00978.
Liu, Aixin et al. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.” arXiv Preprint arXiv:2405.04434. https://arxiv.org/abs/2405.04434.
Miao, Xupeng, Gabriele Oliaro, Zhihao Zhang, et al. 2023. “Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems.” arXiv Preprint arXiv:2312.15234. https://arxiv.org/abs/2312.15234.
Microsoft DeepSpeed Team. 2024. DeepSpeed-Inference Documentation.” https://www.deepspeed.ai/inference/.
Narayanan, Deepak, Mohammad Shoeybi, Jared Casper, et al. 2021. “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.” arXiv Preprint arXiv:2104.04473. https://arxiv.org/abs/2104.04473.
NVIDIA. 2022. NVIDIA Hopper Architecture in-Depth.” https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.
NVIDIA. 2023a. DGX Platform Solution Overview.” https://resources.nvidia.com/en-us-dgx-systems.
NVIDIA. 2023b. NVIDIA H100 Tensor Core GPU.” https://www.nvidia.com/en-us/data-center/h100/.
NVIDIA. 2024a. CUDA Programming Guide.” https://docs.nvidia.com/cuda/cuda-programming-guide/.
NVIDIA. 2024b. TensorRT-LLM Documentation.” https://nvidia.github.io/TensorRT-LLM/.
OpenTelemetry Authors. 2024. OpenTelemetry Documentation.” https://opentelemetry.io/docs/.
Patel, Pratyush, Esha Choukse, Chaojie Zhang, et al. 2024. “Splitwise: Efficient Generative LLM Inference Using Phase Splitting.” arXiv Preprint arXiv:2311.18677. https://arxiv.org/abs/2311.18677.
Pope, Reiner, Sholto Douglas, Aakanksha Chowdhery, et al. 2022. “Efficiently Scaling Transformer Inference.” arXiv Preprint arXiv:2211.05102. https://arxiv.org/abs/2211.05102.
Press, Ofir, Noah A. Smith, and Mike Lewis. 2021. “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.” arXiv Preprint arXiv:2108.12409. https://arxiv.org/abs/2108.12409.
Prometheus Authors. 2024. Prometheus Documentation.” https://prometheus.io/docs/.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners.” OpenAI Blog. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
Raschka, Sebastian. 2025. LLM Architecture Gallery.” https://sebastianraschka.com/llm-architecture-gallery/.
Rush, Alexander. 2018. “The Annotated Transformer.” https://nlp.seas.harvard.edu/annotated-transformer/.
SGLang Team. 2024. SGLang Documentation.” https://docs.sglang.ai/.
Shah, Jay, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision.” arXiv Preprint arXiv:2407.08608. https://arxiv.org/abs/2407.08608.
Shazeer, Noam. 2019. “Fast Transformer Decoding: One Write-Head Is All You Need.” arXiv Preprint arXiv:1911.02150. https://arxiv.org/abs/1911.02150.
Shazeer, Noam. 2020. GLU Variants Improve Transformer.” arXiv Preprint arXiv:2002.05202. https://arxiv.org/abs/2002.05202.
Sheng, Ying, Shiyi Cao, Dacheng Li, et al. 2023. S-LoRA: Serving Thousands of Concurrent LoRA Adapters.” arXiv Preprint arXiv:2311.03285. https://arxiv.org/abs/2311.03285.
Shoeybi, Mohammad, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” arXiv Preprint arXiv:1909.08053. https://arxiv.org/abs/1909.08053.
Spheron. 2025. vLLM Vs TensorRT-LLM Vs SGLang: Benchmarks on H100.” https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/.
Stern, Mitchell, Noam Shazeer, and Jakob Uszkoreit. 2018. “Blockwise Parallel Decoding for Deep Autoregressive Models.” arXiv Preprint arXiv:1811.03115. https://arxiv.org/abs/1811.03115.
Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv Preprint arXiv:2104.09864. https://arxiv.org/abs/2104.09864.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, et al. 2023. LLaMA: Open and Efficient Foundation Language Models.” arXiv Preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971.
Touvron, Hugo, Louis Martin, Kevin Stone, et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models.” arXiv Preprint arXiv:2307.09288. https://arxiv.org/abs/2307.09288.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” arXiv Preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762.
vLLM Team. 2024. “vLLM Documentation.” https://docs.vllm.ai/.
Weng, Lilian. 2023. “The Transformer Family Version 2.0.” https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/.
Williams, Samuel, Andrew Waterman, and David Patterson. 2009. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM 52 (4): 65–76. https://dl.acm.org/doi/pdf/10.1145/1498765.1498785.
Wu, Bingyang, Yinmin Zhong, Zili Zhang, et al. 2023. “Fast Distributed Inference Serving for Large Language Models.” arXiv Preprint arXiv:2305.05920. https://arxiv.org/abs/2305.05920.
Xiao, Guangxuan, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.” arXiv Preprint arXiv:2211.10438. https://arxiv.org/abs/2211.10438.
Yang, An, Anfeng Li, Baosong Yang, et al. 2025. “Qwen3 Technical Report.” arXiv Preprint arXiv:2505.09388. https://arxiv.org/abs/2505.09388.
Yang, Nan, Tao Ge, Liang Wang, et al. 2023. “Inference with Reference: Lossless Acceleration of Large Language Models.” arXiv Preprint arXiv:2304.04487. https://arxiv.org/abs/2304.04487.
Yu, Gyeong-In, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI. https://www.usenix.org/conference/osdi22/presentation/yu.
Yuan, Jingyang et al. 2025. “Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention.” arXiv Preprint arXiv:2502.11089. https://arxiv.org/abs/2502.11089.
Zandieh, Amir, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2025. TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate.” ICLR. https://arxiv.org/abs/2504.19874.
Zhang, Biao, and Rico Sennrich. 2019. “Root Mean Square Layer Normalization.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/1910.07467.
Zheng, Lianmin, Liangsheng Yin, Zhiqiang Xie, et al. 2023. SGLang: Efficient Execution of Structured Language Model Programs.” arXiv Preprint arXiv:2312.07104. https://arxiv.org/abs/2312.07104.
Zhong, Yinmin, Shengyu Liu, Junda Chen, et al. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving.” arXiv Preprint arXiv:2401.09670. https://arxiv.org/abs/2401.09670.
Zhu, Xunyu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. 2024. “A Survey on Model Compression for Large Language Models.” arXiv Preprint arXiv:2308.07633. https://arxiv.org/abs/2308.07633.