Chapter 5 · Part 2

Convolution & filters

This is the single most important idea in the whole course. Almost everything in Part 4 — how a convolutional network actually sees — is built on the small animation below. Take your time with it.

So far an image has just been a grid of numbers. Now we ask a question of that grid: "where does something interesting happen?" A convolution answers it by sliding a tiny window, called a kernel, across the image.

Here's a tiny 7×7 image — just numbers. Dark pixels on the left, bright on the right.

scroll↓

What just happened

Three things slid past each other:

The input — our 7×7 image of numbers.
The kernel — a 3×3 grid of weights. This particular one is a Sobel filter, hand-designed to detect vertical edges.
The feature map — the output. One number per kernel position, measuring "how much this kernel matched here."

Because the kernel has negative weights on the left and positive on the right, it produces a big number exactly where the image goes from dark to bright — an edge — and roughly zero across flat regions. We didn't tell it where the edge was. The arithmetic found it.

The size of the output

A 3×3 kernel can't centre on the outermost pixels, so the feature map is a little smaller than the input. For an N×N image and a K×K kernel sliding one pixel at a time, the output is (N−K+1) on each side — here, 7 − 3 + 1 = 5.

In code, the naïve version is just four nested loops:

convolve.py — a 2D convolution from scratch

import numpy as np

def convolve2d(image, kernel):
  N, K = image.shape[0], kernel.shape[0]
  M = N - K + 1                      # output size (valid convolution)
  out = np.zeros((M, M))
  for r in range(M):
      for c in range(M):
          patch = image[r:r+K, c:c+K]
          out[r, c] = np.sum(patch * kernel)   # multiply, then add
  return out

sobel_x = np.array([[-1, 0, 1],
                  [-2, 0, 2],
                  [-1, 0, 1]])

Different kernels, different questions

Swap the nine weights and you ask a different question of the image. The same sliding-and-summing machinery becomes a blur, a sharpener, or an edge-detector — it all depends on the kernel:

Blur — all weights equal (1/9): each output is the average of its neighbourhood, smoothing the image.
Sharpen — a strong positive centre with negative neighbours: exaggerates differences.
Sobel — what we used above: responds to edges in one direction.

Why this is the keystone

Right now we chose the kernel's nine numbers by hand. Hold onto one question as you continue: what if the machine could choose those numbers itself?

That is exactly what a convolutional neural network does. The sliding window never changes — only where the weights come from. We'll pick this animation back up, with learnable kernels, in Chapter 10.