Preface
A short book on autoregressive LLM inference and the techniques that make it fast, efficient, and scalable.
Who this book is for
This book is an intermediate guide for ML engineers (and people with similar background) who are already familiar with neural networks, the transformer architecture, self-attention, and autoregressive decoder-only LLMs. This book evolved from research into efficient ways to do LLM inference for special cases such as RAG, Writing in the Margins, and beam search. After spending time learning about attention kernels, prefix caching, and systems such as vLLM and SGLang, I decided to share the knowledge I had gathered in one convenient place.
This book contains many detailed diagrams and is best viewed on a high resolution screen.
What this book covers
This book discusses how LLM inference works, then reviews the techniques for optimizing and scaling LLM inference. This book should not be the first resource for learning about the transformer, nor will it teach how to code LLMs. It is also not a comprehensive ML Ops or system implementation book. I hope you find it useful for providing a good intuition and the necessary details about concepts like chunked prefill, ragged batching, tensor parallelism, and speculative decoding.
How this book was written
This book started as a detailed list of topics, techniques, and research papers. The idea to use data flow diagrams and the design of most diagrams was entirely my own.
Full disclosure: I used LLMs to suggest ways to organize the material and to write initial text based on my notes. I used Claude Code to speed up the implementation of the diagrams and to debug places where the appearance of text, formulas, or figures wasn’t correct. If you are not interested in reading work that used AI assistance, that is your prerogative. For what it’s worth, AI assistance shortened the process of getting the first release of this book from months to weeks. I probably would not have taken the time to write this book if it had required several months.
Acknowledgements
Thank you to everyone who provided feedback, starting with RC and RS.
I thank you for reporting any errata or making suggestions for how this book could be better. The easiest way is to open a GitHub issue from within the electronic book.