Chapter 6 · Part 4

Steering with words

We can generate images from static — but so far the model produces whatever it wants. The thing that made DALL·E and Stable Diffusion famous is that you can type a sentence and get a matching picture. This final chapter is about that steering wheel: how a text prompt bends the denoising loop toward what you asked for.

It comes down to two pieces. First, the prompt has to become something the network can use — a vector of numbers called an embedding. Second, we need a way to crank up how strongly the model obeys that prompt. That dial is classifier-free guidance.

Scroll to turn the guidance dial from "ignores your words" up to "over-cooked."

The prompt is encoded by CLIP into an embedding — a list of numbers capturing its meaning.

scroll↓

From words to a vector the model understands

A neural network can't read English; it works in numbers. So the prompt is first run through a text encoder — usually CLIP, a model trained on hundreds of millions of image–caption pairs until its text embeddings and image embeddings line up in the same space. The upshot: CLIP's embedding of "a golden sunset over green hills" lands near where images of exactly that would land. That embedding is fed into the U-Net (alongside the noisy image and timestep) at every denoising step, so the noise prediction becomes conditioned on the text.

Classifier-free guidance: a strength dial

Conditioning alone is often too timid — the model loosely gestures at your prompt. The trick that makes prompts bite is to run the network twice at each step: once with your prompt, once with an empty prompt. The difference between the two predictions points in the direction of "more like the text," and we exaggerate that difference by a guidance scale s.

guided.py — one prompt-steered denoising step

c = clip_encode(prompt)                  # text → embedding
empty = clip_encode("")                  # the unconditional embedding

for t in reversed(range(T)):
  eps_cond   = unet(x, t, c)           # prediction WITH the prompt
  eps_uncond = unet(x, t, empty)       # prediction with NO prompt

  # push away from unconditional, toward the prompt
  eps = eps_uncond + guidance * (eps_cond - eps_uncond)

  x = step_back(x, eps, t)             # same reverse step as before

Everything else — the schedule, the noise prediction, the reverse loop — is exactly what you already learned. Text conditioning just changes which noise we subtract at each step, steering the same machinery toward your words.

You now know how it works

Step back and the whole pipeline is just the ideas from this course, stacked:

Add noise to images until they're static — and learn to run it backwards.
That noise is precise, well-shaped Gaussian noise.
A fixed schedule controls it, with a closed-form jump to any step.
A U-Net learns one thing: predict the noise that was added.
Run it as a loop from static and you generate an image.
A text embedding plus guidance steers that loop toward your prompt.

That's the magic, demystified: a model that's astonishingly good at removing noise, pointed at randomness, and nudged by your words.

Thanks for reading. If you enjoyed this, the other course takes the same visual approach to how ChatGPT works.