HexaHype
Start reading

Table of contents

Everything in the tutorial, in order. 0 of 15 read.

Chapter 1 · Tokenization & Embeddings

  1. 1.1 Why models need numbers Language models compute with numbers, so text must be converted before and after every interaction. 3 min
  2. 1.2 What is a token? A token is the smallest unit of text a model works with, and subword tokens are the compromise every modern model uses. 4 min
  3. 1.3 Byte pair encoding BPE builds a tokenizer by repeatedly merging the most frequent pair of symbols, and the algorithm fits in thirty lines of Python. 7 min
  4. 1.4 Tokenizers in practice Encode and decode text with real production tokenizers, and learn why tokenizer and model must never be mixed. 6 min
  5. 1.5 Special tokens Tokenizers reserve IDs with control meanings, inserted by the pipeline rather than produced from user text. 4 min
  6. 1.6 From token IDs to vectors The embedding layer is a learned lookup table that turns each token ID into a vector, and its meaning emerges from training. 6 min
  7. 1.7 Measuring similarity Cosine similarity compares the directions of two vectors, which after training tracks similarity of meaning. 5 min
  8. 1.8 Position matters Positional information is added to token embeddings so the model can tell word order apart. 4 min
  9. 1.9 Why LLMs act the way they do Six everyday quirks of language models that follow directly from tokenization and embeddings. 5 min
  10. 1.10 Chapter quiz Six questions testing the core ideas of tokens and embeddings, with instant feedback. 4 min

Chapter 2 · Inside the Transformer

  1. 2.1 Inputs and outputs of a trained LLM A trained transformer takes token IDs in and produces one probability distribution over its entire vocabulary, predicting the next token. 4 min
  2. 2.2 Components of the forward pass One trip through the model passes through an embedding layer, a stack of identical transformer blocks, and the LM head. 4 min
  3. 2.3 Sampling and decoding Greedy decoding, temperature, top-k, and top-p are the standard strategies for choosing one token from the probability distribution. 6 min
  4. 2.4 Inside the transformer block Each block combines self-attention, which moves information between tokens, with a feed-forward network that processes each token. 7 min
  5. 2.5 Parallel processing and context size Prompts are processed in one parallel pass while generation runs token by token, and attention's quadratic cost shapes the context window. 5 min
  6. 2.6 Speeding up generation with the KV cache Caching every token's keys and values avoids recomputing the past at each step, trading memory for speed. 5 min
  7. 2.7 Chapter quiz Six questions testing the core ideas of the transformer's inner workings, with instant feedback. 4 min