Components of the forward pass

A forward pass is one complete trip of the input through the network, producing one output distribution. The name distinguishes it from the backward pass used during training, where errors flow back through the network to adjust the weights. A deployed model only ever runs forward: its weights are frozen, and nothing it processes changes it. This lesson names the stations the input passes through.

A transformer LLM is three stages: an embedding layer, then a tall stack of identical transformer blocks, then the LM head.

The forward pass: an embedding layer, a stack of identical transformer blocks, then the LM head producing logits that softmax turns into a probability distribution.

Stage 1: the embedding layer

Familiar territory. Token IDs select rows of the embedding matrix $E$ , position information is stamped in, and the sequence of tokens becomes a sequence of vectors. One vector per token, each with $d$ numbers, where $d$ is the model dimension from lesson 1.6.

Stage 2: the stack of transformer blocks

The body of the model is a stack of transformer blocks: anywhere from a dozen in small models to over a hundred in frontier ones. The blocks are architecturally identical, each with its own learned weights, and they run strictly in order. Each block takes in the sequence of vectors and outputs a refined sequence of the same shape: same number of tokens, same dimension $d$ , different values.

The shape never changing is the key design fact. Because every block’s output looks exactly like its input, blocks can be stacked like floors of a building, and making a model “deeper” is simply adding floors. What changes across the floors is the content of the vectors. A vector enters the stack representing roughly “this token at this position” and exits representing something far richer: that token in the light of everything relevant around it. By the top of the stack, the vector at the word “bank” in “river bank” and the one in “bank account” have moved far apart, even though both started from the same embedding row.

What happens inside one block, the actual refining machinery, is the deepest topic of this chapter and gets its own lesson (lesson 4).

Stage 3: the LM head

After the final block, the vector at the last position of the sequence is handed to the LM head: one final layer that maps a vector of $d$ numbers to $V$ numbers, one logit per vocabulary token. Softmax turns the logits into the probability distribution of lesson 1. Why the last position? Because in a model reading left to right, the last token’s vector is the one that has absorbed the entire sequence, making it the natural summary from which to predict what comes next.

The whole pass in one picture

token IDs
  → embedding layer            vectors, one per token
  → block 1 → block 2 → ... → block N     same shape, increasingly refined
  → LM head (at last position) → logits → softmax
  → probability distribution over the vocabulary

The model has now handed us a probability table with tens of thousands of entries. Somebody still has to pick exactly one token from it, and that choice is where a model’s “personality” settings like temperature live. That is the next lesson.