Chapter 5 · Part 2
Convolution & filters
This is the single most important idea in the whole course. Almost everything in Part 4 — how a convolutional network actually sees — is built on the small animation below. Take your time with it.
So far an image has just been a grid of numbers. Now we ask a question of that grid: "where does something interesting happen?" A convolution answers it by sliding a tiny window, called a kernel, across the image.
Here's a tiny 7×7 image — just numbers. Dark pixels on the left, bright on the right.
What just happened
Three things slid past each other:
- The input — our 7×7 image of numbers.
- The kernel — a 3×3 grid of weights. This particular one is a Sobel filter, hand-designed to detect vertical edges.
- The feature map — the output. One number per kernel position, measuring "how much this kernel matched here."
Because the kernel has negative weights on the left and positive on the right, it produces a big number exactly where the image goes from dark to bright — an edge — and roughly zero across flat regions. We didn't tell it where the edge was. The arithmetic found it.
The size of the output
A 3×3 kernel can't centre on the outermost pixels, so the feature map is a
little smaller than the input. For an N×N image and a K×K kernel sliding one
pixel at a time, the output is (N−K+1) on each side — here, 7 − 3 + 1 = 5.
In code, the naïve version is just four nested loops:
import numpy as np
def convolve2d(image, kernel):
N, K = image.shape[0], kernel.shape[0]
M = N - K + 1 # output size (valid convolution)
out = np.zeros((M, M))
for r in range(M):
for c in range(M):
patch = image[r:r+K, c:c+K]
out[r, c] = np.sum(patch * kernel) # multiply, then add
return out
sobel_x = np.array([[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]])Different kernels, different questions
Swap the nine weights and you ask a different question of the image. The same sliding-and-summing machinery becomes a blur, a sharpener, or an edge-detector — it all depends on the kernel:
- Blur — all weights equal (
1/9): each output is the average of its neighbourhood, smoothing the image. - Sharpen — a strong positive centre with negative neighbours: exaggerates differences.
- Sobel — what we used above: responds to edges in one direction.
Why this is the keystone
Right now we chose the kernel's nine numbers by hand. Hold onto one question as you continue: what if the machine could choose those numbers itself?
That is exactly what a convolutional neural network does. The sliding window never changes — only where the weights come from. We'll pick this animation back up, with learnable kernels, in Chapter 10.