References
Adnan, Muhammad, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya
Soloveychik, and Purushotham Kamath. 2024. “Keyformer:
KV Cache Reduction Through Key Tokens Selection for
Efficient Generative Inference.” MLSys. https://proceedings.mlsys.org/paper_files/paper/2024/hash/48fecef47b19fe501d27d338b6d52582-Abstract-Conference.html.
Agrawal, Amey, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S.
Gulavani, and Ramachandran Ramjee. 2023. “Sarathi: Efficient
LLM Inference by Piggybacking Decodes with Chunked
Prefills.” arXiv Preprint arXiv:2308.16369. https://arxiv.org/abs/2308.16369.
Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy,
Federico Lebrón, and Sumit Sanghai. 2023. “GQA:
Training Generalized Multi-Query Transformer Models from Multi-Head
Checkpoints.” arXiv Preprint arXiv:2305.13245. https://arxiv.org/abs/2305.13245.
Alammar, Jay. 2018. “The Illustrated Transformer.” https://jalammar.github.io/illustrated-transformer/.
Aminabadi, Reza Yazdani, Samyam Rajbhandari, Ammar Ahmad Awan, et al.
2022. “DeepSpeed-Inference: Enabling Efficient
Inference of Transformer Models at Unprecedented Scale.”
arXiv Preprint arXiv:2207.00032. https://arxiv.org/abs/2207.00032.
Anyscale. 2024a. “How Continuous Batching Enables 23x Throughput
in LLM Inference While Reducing P50 Latency.” https://www.anyscale.com/blog/continuous-batching-llm-inference.
Anyscale. 2024b. “LLMPerf: A Tool for Benchmarking
LLM Inference.” https://github.com/ray-project/llmperf.
Austin, Jacob, Sholto Douglas, Roy Frostig, et al. 2025. “How to
Scale Your Model.” https://jax-ml.github.io/scaling-book/.
Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016.
“Layer Normalization.” arXiv Preprint
arXiv:1607.06450. https://arxiv.org/abs/1607.06450.
Beltagy, Iz, Matthew E. Peters, and Arman Cohan. 2020.
“Longformer: The Long-Document Transformer.” arXiv
Preprint arXiv:2004.05150. https://arxiv.org/abs/2004.05150.
Bloem, Peter. 2019. “Transformers from Scratch.” https://peterbloem.nl/blog/transformers.
Brandon, William, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and
Jonathan Ragan Kelly. 2024. “Reducing Transformer Key-Value Cache
Size with Cross-Layer Attention.” arXiv Preprint
arXiv:2405.12981. https://arxiv.org/abs/2405.12981.
Cai, Tianle, Yuhong Li, Zhengyang Geng, et al. 2024. “Medusa:
Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads.” arXiv Preprint arXiv:2401.10774. https://arxiv.org/abs/2401.10774.
Chen, Carol. 2022. “Transformer Inference Arithmetic.” https://kipp.ly/transformer-inference-arithmetic/.
Chen, Charlie, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste
Lespiau, Laurent Sifre, and John Jumper. 2023. “Accelerating Large
Language Model Decoding with Speculative Sampling.” arXiv
Preprint arXiv:2302.01318. https://arxiv.org/abs/2302.01318.
Chen, Lequn, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind
Krishnamurthy. 2023. “Punica: Multi-Tenant LoRA
Serving.” arXiv Preprint arXiv:2310.18547. https://arxiv.org/abs/2310.18547.
Chips and Cheese. 2023. “Nvidia’s H100: Funny
L2, and Tons of Bandwidth.” https://chipsandcheese.com/2023/07/02/nvidias-h100-funny-l2-and-tons-of-bandwidth/.
Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin,
et al. 2022. “PaLM: Scaling Language Modeling
with Pathways.” arXiv Preprint arXiv:2204.02311. https://arxiv.org/abs/2204.02311.
Dao, Tri. 2023. “FlashAttention-2: Faster Attention
with Better Parallelism and Work Partitioning.” arXiv
Preprint arXiv:2307.08691. https://arxiv.org/abs/2307.08691.
Dao, Tri, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré.
2022. “FlashAttention: Fast and Memory-Efficient
Exact Attention with IO-Awareness.” arXiv
Preprint arXiv:2205.14135. https://arxiv.org/abs/2205.14135.
Dettmers, Tim et al. 2021.
“Bitsandbytes: Accessible Large Language Models via k-Bit
Quantization for PyTorch.” https://github.com/bitsandbytes-foundation/bitsandbytes.
Dettmers, Tim, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022.
“LLM.int8(): 8-Bit Matrix
Multiplication for Transformers at Scale.” arXiv Preprint
arXiv:2208.07339. https://arxiv.org/abs/2208.07339.
Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022.
“GPTQ: Accurate Post-Training Quantization for
Generative Pre-Trained Transformers.” arXiv Preprint
arXiv:2210.17323. https://arxiv.org/abs/2210.17323.
Fu, Yichao, Peter Bailis, Ion Stoica, and Hao Zhang. 2024. “Break
the Sequential Dependency of LLM Inference Using Lookahead
Decoding.” arXiv Preprint arXiv:2402.02057. https://arxiv.org/abs/2402.02057.
Gloeckle, Fabian, Badr Youbi Idrissi, Baptiste Roziere, David Lopez-Paz,
and Gabriel Synnaeve. 2024. “Better & Faster Large Language
Models via Multi-Token Prediction.” arXiv Preprint
arXiv:2404.19737. https://arxiv.org/abs/2404.19737.
Grafana Labs. 2024. “Grafana Documentation.”
https://grafana.com/docs/grafana/latest/.
Grattafiori, Aaron, Abhimanyu Dubey, Abhinav
Jauhri, et al. 2024. “The Llama 3 Herd of
Models.” arXiv Preprint arXiv:2407.21783. https://arxiv.org/abs/2407.21783.
Gu, Albert, and Tri Dao. 2023. “Mamba: Linear-Time
Sequence Modeling with Selective State Spaces.” arXiv
Preprint arXiv:2312.00752. https://arxiv.org/abs/2312.00752.
Gu, Albert, and Tri Dao. 2024. “GPUs Go Brrr.”
https://hazyresearch.stanford.edu/blog/2024-05-12-tk.
Gu, Yuxian, Li Dong, Furu Wei, and Minlie Huang. 2024.
“MiniLLM: On-Policy Distillation of Large Language
Models.” arXiv Preprint arXiv:2306.08543. https://arxiv.org/abs/2306.08543.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling
the Knowledge in a Neural Network.” arXiv Preprint
arXiv:1503.02531. https://arxiv.org/abs/1503.02531.
Hu, Edward J., Yelong Shen, Phillip Wallis, et al. 2021.
“LoRA: Low-Rank Adaptation of Large Language
Models.” arXiv Preprint arXiv:2106.09685. https://arxiv.org/abs/2106.09685.
Huang, Yanping, Youlong Cheng, Ankur Bapna, et al. 2019.
“GPipe: Efficient Training of Giant Neural Networks
Using Pipeline Parallelism.” arXiv Preprint
arXiv:1811.06965. https://arxiv.org/abs/1811.06965.
Jacobs, Sam Ade, Masahiro Tanaka, Chengming Zhang, et al. 2023.
“DeepSpeed Ulysses: System Optimizations for Enabling
Training of Extreme Long Sequence Transformer Models.” arXiv
Preprint arXiv:2309.14509. https://arxiv.org/abs/2309.14509.
Jiang, Albert Q., Alexandre Sablayrolles, Antoine
Roux, Arthur Mensch, et al. 2024. “Mixtral of
Experts.” arXiv Preprint arXiv:2401.04088. https://arxiv.org/abs/2401.04088.
Kadous, Waleed, Kyle Huang, Wendi Ding, Liguang Xie, Avnish Narayan, and
Ricky Xu. 2023. “Reproducible Performance Metrics for
LLM Inference.” https://www.anyscale.com/blog/reproducible-performance-metrics-for-llm-inference.
Karpathy, Andrej. 2023. “Let’s Build GPT: From
Scratch, in Code, Spelled Out.” https://www.youtube.com/watch?v=kCc8FmEb1nY.
Kazemnejad, Amirhossein, Inkit Padhi, Karthikeyan Natesan Ramamurthy,
Payel Das, and Siva Reddy. 2023. “The Impact of Positional
Encoding on Length Generalization in Transformers.” arXiv
Preprint arXiv:2305.19466. https://arxiv.org/abs/2305.19466.
Kimi Team, Yu Zhang, Zongyu Lin, et al. 2025. “Kimi
Linear: An Expressive, Efficient Attention Architecture.”
arXiv Preprint arXiv:2510.26692. https://arxiv.org/abs/2510.26692.
Korthikanti, Vijay Anand, Jared Casper, Sangkug Lym, et al. 2022.
“Reducing Activation Recomputation in Large Transformer
Models.” arXiv Preprint arXiv:2205.05198. https://arxiv.org/abs/2205.05198.
Kumar, Tanishq, Tri Dao, and Avner May. 2026. “Speculative
Speculative Decoding.” arXiv Preprint arXiv:2603.03251.
https://arxiv.org/abs/2603.03251.
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient
Memory Management for Large Language Model Serving with
PagedAttention.” arXiv Preprint
arXiv:2309.06180. https://arxiv.org/abs/2309.06180.
Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2022. “Fast
Inference from Transformers via Speculative Decoding.” arXiv
Preprint arXiv:2211.17192. https://arxiv.org/abs/2211.17192.
Li, Dacheng, Rulin Shao, Anze Xie, et al.
2023. “DistFlashAttn: Distributed Memory-Efficient
Attention for Long-Context LLMs Training.” arXiv
Preprint arXiv:2310.03294. https://arxiv.org/abs/2310.03294.
Li, Shenggui, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You.
2021. “Sequence Parallelism: Long Sequence Training from System
Perspective.” arXiv Preprint arXiv:2105.13120. https://arxiv.org/abs/2105.13120.
Li, Zhuohan, Lianmin Zheng, Yinmin Zhong, et al. 2023.
“AlpaServe: Statistical Multiplexing with Model
Parallelism for Deep Learning Serving.” arXiv Preprint
arXiv:2302.11665. https://arxiv.org/abs/2302.11665.
Lin, Ji, Jiaming Tang, Haotian Tang, et al. 2023.
“AWQ: Activation-Aware Weight Quantization for
LLM Compression and Acceleration.” arXiv
Preprint arXiv:2306.00978. https://arxiv.org/abs/2306.00978.
Liu, Aixin et al. 2024.
“DeepSeek-V2: A Strong, Economical, and Efficient
Mixture-of-Experts Language Model.” arXiv Preprint
arXiv:2405.04434. https://arxiv.org/abs/2405.04434.
Miao, Xupeng, Gabriele Oliaro, Zhihao Zhang, et al. 2023. “Towards
Efficient Generative Large Language Model Serving: A Survey from
Algorithms to Systems.” arXiv Preprint arXiv:2312.15234.
https://arxiv.org/abs/2312.15234.
Microsoft DeepSpeed Team. 2024. “DeepSpeed-Inference
Documentation.” https://www.deepspeed.ai/inference/.
Narayanan, Deepak, Mohammad Shoeybi, Jared Casper, et al. 2021.
“Efficient Large-Scale Language Model Training on GPU
Clusters Using Megatron-LM.” arXiv Preprint
arXiv:2104.04473. https://arxiv.org/abs/2104.04473.
NVIDIA. 2022. “NVIDIA Hopper Architecture
in-Depth.” https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.
NVIDIA. 2023a. “DGX Platform Solution
Overview.” https://resources.nvidia.com/en-us-dgx-systems.
NVIDIA. 2023b. “NVIDIA H100 Tensor Core GPU.”
https://www.nvidia.com/en-us/data-center/h100/.
NVIDIA. 2024a. “CUDA Programming Guide.” https://docs.nvidia.com/cuda/cuda-programming-guide/.
NVIDIA. 2024b. “TensorRT-LLM Documentation.”
https://nvidia.github.io/TensorRT-LLM/.
OpenTelemetry Authors. 2024. “OpenTelemetry
Documentation.” https://opentelemetry.io/docs/.
Patel, Pratyush, Esha Choukse, Chaojie Zhang, et al. 2024.
“Splitwise: Efficient Generative LLM Inference Using
Phase Splitting.” arXiv Preprint arXiv:2311.18677. https://arxiv.org/abs/2311.18677.
Pope, Reiner, Sholto Douglas, Aakanksha Chowdhery, et al. 2022.
“Efficiently Scaling Transformer Inference.” arXiv
Preprint arXiv:2211.05102. https://arxiv.org/abs/2211.05102.
Press, Ofir, Noah A. Smith, and Mike Lewis. 2021. “Train Short,
Test Long: Attention with Linear Biases Enables Input Length
Extrapolation.” arXiv Preprint arXiv:2108.12409. https://arxiv.org/abs/2108.12409.
Prometheus Authors. 2024. “Prometheus
Documentation.” https://prometheus.io/docs/.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and
Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask
Learners.” OpenAI Blog. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
Raschka, Sebastian. 2025. “LLM Architecture
Gallery.” https://sebastianraschka.com/llm-architecture-gallery/.
Rush, Alexander. 2018. “The Annotated Transformer.” https://nlp.seas.harvard.edu/annotated-transformer/.
SGLang Team. 2024. “SGLang Documentation.” https://docs.sglang.ai/.
Shah, Jay, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani,
and Tri Dao. 2024. “FlashAttention-3: Fast and
Accurate Attention with Asynchrony and Low-Precision.” arXiv
Preprint arXiv:2407.08608. https://arxiv.org/abs/2407.08608.
Shazeer, Noam. 2019. “Fast Transformer Decoding: One Write-Head Is
All You Need.” arXiv Preprint arXiv:1911.02150. https://arxiv.org/abs/1911.02150.
Shazeer, Noam. 2020. “GLU Variants Improve
Transformer.” arXiv Preprint arXiv:2002.05202. https://arxiv.org/abs/2002.05202.
Sheng, Ying, Shiyi Cao, Dacheng Li, et al. 2023.
“S-LoRA: Serving Thousands of Concurrent
LoRA Adapters.” arXiv Preprint
arXiv:2311.03285. https://arxiv.org/abs/2311.03285.
Shoeybi, Mohammad, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared
Casper, and Bryan Catanzaro. 2019. “Megatron-LM:
Training Multi-Billion Parameter Language Models Using Model
Parallelism.” arXiv Preprint arXiv:1909.08053. https://arxiv.org/abs/1909.08053.
Spheron. 2025. “vLLM Vs
TensorRT-LLM Vs SGLang: Benchmarks on
H100.” https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/.
Stern, Mitchell, Noam Shazeer, and Jakob Uszkoreit. 2018.
“Blockwise Parallel Decoding for Deep Autoregressive
Models.” arXiv Preprint arXiv:1811.03115. https://arxiv.org/abs/1811.03115.
Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng
Liu. 2021. “RoFormer: Enhanced Transformer with
Rotary Position Embedding.” arXiv Preprint
arXiv:2104.09864. https://arxiv.org/abs/2104.09864.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, et
al. 2023. “LLaMA: Open and Efficient
Foundation Language Models.” arXiv Preprint
arXiv:2302.13971. https://arxiv.org/abs/2302.13971.
Touvron, Hugo, Louis Martin, Kevin Stone, et
al. 2023. “Llama 2: Open Foundation and
Fine-Tuned Chat Models.” arXiv Preprint
arXiv:2307.09288. https://arxiv.org/abs/2307.09288.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017.
“Attention Is All You Need.” arXiv Preprint
arXiv:1706.03762. https://arxiv.org/abs/1706.03762.
Weng, Lilian. 2023. “The Transformer Family Version 2.0.”
https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/.
Williams, Samuel, Andrew Waterman, and David Patterson. 2009.
“Roofline: An Insightful Visual Performance Model for Multicore
Architectures.” Communications of the ACM 52 (4): 65–76.
https://dl.acm.org/doi/pdf/10.1145/1498765.1498785.
Wu, Bingyang, Yinmin Zhong, Zili Zhang, et al. 2023. “Fast
Distributed Inference Serving for Large Language Models.”
arXiv Preprint arXiv:2305.05920. https://arxiv.org/abs/2305.05920.
Xiao, Guangxuan, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and
Song Han. 2023. “SmoothQuant: Accurate and Efficient
Post-Training Quantization for Large Language Models.” arXiv
Preprint arXiv:2211.10438. https://arxiv.org/abs/2211.10438.
Yang, An, Anfeng Li, Baosong Yang, et al.
2025. “Qwen3 Technical Report.” arXiv Preprint
arXiv:2505.09388. https://arxiv.org/abs/2505.09388.
Yang, Nan, Tao Ge, Liang Wang, et al. 2023. “Inference with
Reference: Lossless Acceleration of Large Language Models.”
arXiv Preprint arXiv:2304.04487. https://arxiv.org/abs/2304.04487.
Yu, Gyeong-In, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and
Byung-Gon Chun. 2022. “Orca: A Distributed Serving System for
Transformer-Based Generative Models.” OSDI. https://www.usenix.org/conference/osdi22/presentation/yu.
Yuan, Jingyang et al. 2025. “Native
Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention.” arXiv Preprint arXiv:2502.11089. https://arxiv.org/abs/2502.11089.
Zandieh, Amir, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2025.
“TurboQuant: Online Vector Quantization with
Near-Optimal Distortion Rate.” ICLR. https://arxiv.org/abs/2504.19874.
Zhang, Biao, and Rico Sennrich. 2019. “Root Mean Square Layer
Normalization.” Advances in Neural Information Processing
Systems. https://arxiv.org/abs/1910.07467.
Zheng, Lianmin, Liangsheng Yin, Zhiqiang Xie, et
al. 2023. “SGLang: Efficient Execution of
Structured Language Model Programs.” arXiv Preprint
arXiv:2312.07104. https://arxiv.org/abs/2312.07104.
Zhong, Yinmin, Shengyu Liu, Junda Chen, et
al. 2024. “DistServe: Disaggregating Prefill
and Decoding for Goodput-Optimized Large Language Model Serving.”
arXiv Preprint arXiv:2401.09670. https://arxiv.org/abs/2401.09670.
Zhu, Xunyu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. 2024. “A
Survey on Model Compression for Large Language Models.” arXiv
Preprint arXiv:2308.07633. https://arxiv.org/abs/2308.07633.