Chapter 6 · Part 3

Learning from the fleet

Every network in this course — perception, the bird's-eye view, prediction, planning — is only as good as the data it learned from. And here is Tesla's real advantage, the reason it bet on cheap cameras in Chapter 1: millions of cars on real roads, each one a rolling data-collection device. That fleet powers a loop Tesla calls the data engine.

Scroll to run the loop, then watch the architecture itself change.

The fleet runs the network silently and flags moments it would have gotten wrong.

scroll↓

The data engine

The bottleneck in self-driving isn't ordinary driving — it's the long tail of rare events: a couch in the road, a person in a dinosaur costume, an unusual five-way intersection. You can't sit and wait to film these. So Tesla flips it:

Shadow mode. Cars run the network in the background and compare its decision to what the human driver actually did. A disagreement is a clue something's off.
Trigger & collect. Lightweight detectors flag interesting moments — hard braking, takeovers, rare objects — and upload just those short clips.
Auto-label. Much labeling is automated: with the full clip (and hindsight), the exact 3D paths of every object can be reconstructed offline, far more accurately than in real time.
Retrain & deploy. Add the new examples, retrain, validate, ship the improved net to the whole fleet — which then surfaces the next rare case.

Each lap makes the network better exactly where it was weakest. The fleet is, in effect, one enormous distributed teacher.

From a pipeline to one network

Notice that everything so far was a pipeline of hand-designed stages: perceive → build the map → predict → plan. That's interpretable, but every hand-coded seam is a place a human guessed wrong. The newer direction (Tesla's "v12" and beyond) is to replace much of that pipeline with a single end-to-end neural network trained to map camera pixels almost directly to steering and pedals.

You now know how it works

Pull it together and a self-driving car is one loop, running thirty times a second, trained by millions more cars:

It sees with cameras only — eight of them, no lidar.
Perception turns each frame's pixels into objects and lanes.
Those fuse into a bird's-eye-view 3D map of the world.
It predicts the likely futures of everyone around it.
It plans a safe, comfortable path and steers along it.
And the fleet's data engine — increasingly one end-to-end net — keeps it improving.

Two honest reminders: today's system is Level 2 and needs an attentive human, and the long tail of rare situations is exactly why "almost solved" has stayed almost for years. But the machinery is no longer mysterious — it's perception, a world model, prediction, planning, and a lot of data, turning eight video feeds into a turn of the wheel.

Thanks for reading. If you enjoyed this, the other courses cover how images, language, meaning and recommendations work under the hood.