Conventions
Figures throughout this book use a consistent visual language for tensor shapes, dimensions, and components. This page collects those conventions in one place so you can decode any diagram at a glance.
Dimension names and abbreviations
Tensors in figures are labeled with their shapes using the single-letter abbreviations below. The full names and abbreviations match those used in the text.
| Symbol | Full name | Example values (Llama 3 70B) |
|---|---|---|
| B | Batch size | 1–64 |
| D | Model dimension (\(d_\text{model}\)) | 8192 |
| \(\textcolor{#2563EB}{d_K}\) | Head dimension (\(D / H\)) | 128 |
| F | FFN intermediate dimension (\(d_\text{ff}\)) | 28672 (SwiGLU) |
| H | Number of attention heads (\(n_\text{heads}\)) | 64 |
| \(\textcolor{#C084FC}{H_\text{KV}}\) | Number of KV heads | 8 (with GQA) |
| L | Number of layers | 80 |
| S | Sequence length | Prefill: prompt length (e.g., 2048); Decode: 1 |
| V | Vocabulary size | 128000 |
Tensor edge colors
Tensor rectangles use colored edges to indicate which dimension runs along that axis. Top/bottom edges represent one dimension; left/right edges represent another. This makes it possible to see at a glance how dimensions flow through a computation.
| Edge | Dimension |
|---|---|
| Model dimension (D) | |
| Head / Q-K dimension (\(d_K\)) | |
| FFN intermediate dimension (F) | |
| Number of attention heads (H) | |
| Number of KV heads (HKV) | |
| Number of layers (L) | |
| Sequence length (S) | |
| Vocabulary (V) | |
| Default / other |
Tensor fill colors
The fill color of a tensor rectangle indicates its role in the computation.
| Fill | Meaning |
|---|---|
| Input / activations (neutral) | |
| Weight matrices | |
| Queries (Q) | |
| Keys (K) | |
| Values (V) | |
| Attention scores / weights | |
| KV cache entries |
Component colors
Architectural block diagrams use a separate set of fill colors to distinguish transformer components.
| Color | Component |
|---|---|
| Embedding layers | |
| Layer normalization / Add & Norm | |
| Self-attention blocks | |
| Feed-forward blocks | |
| Linear projections | |
| Softmax |
Reading a data flow diagram
To illustrate how these conventions combine, here is how a typical tensor rectangle should be read:
- The top and bottom edge colors tell you the dimension that runs horizontally, the number of columns.
- The left and right edge colors tell you the dimension that runs vertically, the number of rows.
- The shape label (e.g., (S, D)) confirms the dimensions explicitly.
- The fill color helps disambiguate what kind of tensor it is (query, key, weight, etc.).
When two tensors share an edge color on a matching side, it means those dimensions are compatible for the operation that connects them — a matrix multiply, a concatenation, or an elementwise operation. For matrix multiplcation, the number of columns on the lefthand tensor must match the number of rows on the righthand tensor. For elementwise operations, usually all dimensions must match, unless one of the dimensions is 1, and it is broadcast to match the other tensor.