Chapter 1 · Part 1

Eight cameras, no lidar

Most self-driving prototypes you've seen have a spinning bucket on the roof — that's lidar, a laser scanner that measures distance directly. Tesla made a famous, contrarian bet: rip all of that out and drive with cameras alone. The reasoning is that humans drive with two eyes and a brain, so a car with eight cameras and a big enough neural network should be able to as well — and cameras are cheap, while lidar is expensive and doesn't read text, lane paint or traffic lights.

One important caveat up front: "Full Self-Driving" today is a driver- assistance system (SAE Level 2). It steers, accelerates and brakes, but a human must supervise and stay ready to take over. This course is about how the technology works, not a claim that it drives unsupervised.

Scroll to see how eight cameras give the car a full 360° view.

Three forward cameras — wide, main and narrow — see near-and-wide to far-and-zoomed.

scroll↓

Why cameras, and why it's hard

The case for vision-only:

The world is built for eyes. Lane lines, signs, brake lights and hand signals are all visual. Lidar sees shape and distance but is colorblind to meaning.
Cost and scale. Cameras cost a few dollars; putting them on millions of cars is feasible, which (as we'll see in the last chapter) creates an enormous data advantage.

The case for why it's hard: a camera gives you a flat 2D grid of pixels with no built-in depth. Lidar hands you distance for free; with cameras the car must infer 3D structure — how far, how fast, how big — from 2D images alone. That inference is the job of the neural networks in the next four chapters.

Eight views, lots of overlap

The cameras differ in where they point and how much they zoom:

Three forward: a wide-angle (close, broad), a main, and a narrow/telephoto for seeing far down the road (distant traffic lights, highway speeds).
Two side pillar + two side repeater: watch for cross-traffic at junctions and cars merging into blind spots.
One rear: for reversing and traffic coming up behind.

Crucially their fields of view overlap, which both removes blind spots and gives the network multiple angles on the same object — useful for estimating depth.

Where we're headed

Right now those are just eight streams of raw pixels — millions of numbers per frame, thirty-plus times a second, meaning nothing on their own. The first job is to find the things in them: cars, lanes, lights, people. Next: perception.