Table of contents
Everything in the tutorial, in order. 0 of 15 read.
Chapter 1 · Tokenization & Embeddings
- 1.1 Why models need numbers Language models compute with numbers, so text must be converted before and after every interaction. 3 min
- 1.2 What is a token? A token is the smallest unit of text a model works with, and subword tokens are the compromise every modern model uses. 4 min
- 1.3 Byte pair encoding BPE builds a tokenizer by repeatedly merging the most frequent pair of symbols, and the algorithm fits in thirty lines of Python. 7 min
- 1.4 Tokenizers in practice Encode and decode text with real production tokenizers, and learn why tokenizer and model must never be mixed. 6 min
- 1.5 Special tokens Tokenizers reserve IDs with control meanings, inserted by the pipeline rather than produced from user text. 4 min
- 1.6 From token IDs to vectors The embedding layer is a learned lookup table that turns each token ID into a vector, and its meaning emerges from training. 6 min
- 1.7 Measuring similarity Cosine similarity compares the directions of two vectors, which after training tracks similarity of meaning. 5 min
- 1.8 Position matters Positional information is added to token embeddings so the model can tell word order apart. 4 min
- 1.9 Why LLMs act the way they do Six everyday quirks of language models that follow directly from tokenization and embeddings. 5 min
- 1.10 Chapter quiz Six questions testing the core ideas of tokens and embeddings, with instant feedback. 4 min
Chapter 2 · Inside the Transformer
- 2.1 Inputs and outputs of a trained LLM A trained transformer takes token IDs in and produces one probability distribution over its entire vocabulary, predicting the next token. 4 min
- 2.2 Components of the forward pass One trip through the model passes through an embedding layer, a stack of identical transformer blocks, and the LM head. 4 min
- 2.3 Sampling and decoding Greedy decoding, temperature, top-k, and top-p are the standard strategies for choosing one token from the probability distribution. 6 min
- 2.4 Inside the transformer block Each block combines self-attention, which moves information between tokens, with a feed-forward network that processes each token. 7 min
- 2.5 Parallel processing and context size Prompts are processed in one parallel pass while generation runs token by token, and attention's quadratic cost shapes the context window. 5 min
- 2.6 Speeding up generation with the KV cache Caching every token's keys and values avoids recomputing the past at each step, trading memory for speed. 5 min
- 2.7 Chapter quiz Six questions testing the core ideas of the transformer's inner workings, with instant feedback. 4 min