Chapter 5 · Part 3
Generating from static
This is the chapter the whole course has been building toward. We have a network that, shown a noisy image, points to the noise inside it. Now we run it as a loop — and an image appears out of nothing.
The key move: we don't start from a photo. We start from a fresh patch of pure random static, something no one has ever seen, and ask the network to denoise that. Whatever coherent picture falls out the other end is a brand-new image. This loop is called sampling, and it's literally the forward process from Chapter 1 run in reverse.
Scroll to run the reverse trajectory: pure noise at the top, a finished image at the bottom.
We begin at t = T with pure Gaussian static — no image, just randomness we sampled ourselves.
Why remove the noise gradually?
In Chapter 4 we subtracted all the predicted noise at once, but that only worked because it was a gentle, half-noised image and we secretly knew the answer. Starting from pure static, the network's first guess is necessarily rough — there's almost no signal to go on. So instead of trusting it completely, we take a small step: remove a little noise, producing a slightly cleaner image, then feed that back in and ask again. Each step is an easier question than the last.
That iterative refinement is the entire sampling loop:
- Start with
x_T= pure Gaussian noise. - Predict the noise
ε̂in the current image with the U-Net. - Use it to step to a slightly less noisy
x₍ₜ₋₁₎(and, except at the very end, add back a touch of fresh randomness). - Repeat down to
t = 0.
x = np.random.randn(*image_shape) # start: pure static, x_T
for t in reversed(range(T)): # walk timesteps T-1 ... 0
eps_hat = unet(x, t) # predict the noise in x
x = step_back(x, eps_hat, t) # remove a little; nudge to x_{t-1}
if t > 0:
x = x + sigma(t) * np.random.randn(*image_shape) # a touch of noise
image = x # x_0: a brand-new imageA different patch of static, a different picture
Here's the part that makes these models generative rather than mere photo-restorers: the starting static is something you sample. Feed in a different random patch and the loop walks to a different image. That's why hitting "generate" twice gives two different results — same network, different noise to begin with.
Run this enough and a striking fact emerges: the model never memorized a gallery of pictures. It only ever learned to remove noise. Coherent images are what you get when an expert denoiser is pointed at randomness and asked to keep cleaning.
One thing missing
We can now conjure images from static — but they're whatever the model feels like producing. We can't yet ask for a specific one. The last chapter adds the steering wheel: text conditioning, the bridge from "an image" to "the image I described."