Chapter 3 · Part 2

A bird's-eye view of the world

You can't drive from eight separate, flat camera images. The front camera sees a car "ahead and slightly left"; the left-pillar camera sees the same car from another angle; neither knows how many meters away it is. To plan a path, the car needs all of that fused into one shared, top-down 3D map — as if a drone were hovering directly overhead. Tesla calls this the vector space (or bird's-eye view, BEV).

Scroll to watch eight 2D views fuse into one top-down map.

Each camera only delivers a flat 2D slice, from its own viewpoint.

scroll

From many perspectives to one space

The trick is letting the network fuse across cameras before deciding what's where. Instead of detecting objects in each image and trying to stitch the results (which fails at camera boundaries), the network projects features from all eight images into a common top-down grid, then reads objects and lanes off that. Because it sees overlapping views of the same scene, it can triangulate depth — recovering the 3D the cameras never measured directly.

The occupancy network

Boxes are fine for cars, but the world is full of weird shapes — a fallen ladder, an overhanging branch, a strange trailer — that don't fit any category. So alongside labeled objects, Tesla runs an occupancy network that fills the 3D space with a simple, category-free question for every little volume of space (voxel): is this occupied, and is it moving?

Why this representation matters

Once the world is a tidy top-down map in meters, everything downstream gets easier:

  • Distances and speeds are real, physical quantities — not pixel sizes.
  • It's temporally stable: the car remembers a cyclist that's momentarily hidden behind a van, because the map persists across frames.
  • Planning happens in the same space you'd draw a route on — a literal map.

Where we're headed

We now have a live, top-down model of where everything is right now. But driving isn't about now — it's about the next few seconds. Will that car change lanes? Will that pedestrian step off the curb? Next: predicting what others will do.