Chapter 3 · Part 2

A score for every word

We've said the model "picks the next token." But it doesn't reach in and grab one. At every step it does something more even-handed: it assigns a score to every single token in its vocabulary — all ~50,000 of them — and turns those scores into probabilities. Then it draws one.

That probability distribution is the real output of a language model. Everything else — creativity, repetition, randomness — is just how you read and sample from it.

Scroll to watch the distribution form for the prompt "The sky is".

First, every token gets a raw score called a logit — just a number, higher means 'more likely here'.

scroll↓

From scores to probabilities: softmax

The model's final layer emits one number per token, called a logit. Logits are unbounded and not very meaningful on their own — one might be 4.2, another −1.0. To turn them into probabilities we need every value positive and the whole set to sum to 1. That's exactly what softmax does.

The long tail is the whole point

Notice how lopsided the distribution is: a handful of tokens hold most of the probability, and tens of thousands of others split the rest. That shape is what makes a model feel both coherent and capable of surprise.

If it always took the single highest token (called greedy decoding), it would be repetitive and robotic.
Because it samples instead — rolling a weighted die over the distribution — it usually picks a likely token but occasionally reaches into the tail, which reads as creativity.

sample.py — logits to a sampled token

import numpy as np

logits = model.forward(tokens)          # one score per vocab token (~50k)

# softmax → probabilities that sum to 1
probs = np.exp(logits) / np.exp(logits).sum()

# sample one token, weighted by probability
next_id = np.random.choice(len(probs), p=probs)

This is also why asking ChatGPT the same question twice can give different answers: the prompt fixes the distribution, but the sampling step rolls the die again each time.

One dial changes everything

Right now we're sampling straight from the probabilities. But there's a single knob that reshapes this distribution before we draw — flattening it for wild, surprising output or sharpening it for safe, predictable output. That knob is temperature, and it's next.