Chapter 3 · Part 2

Free labor: you are the labeler

Here's the turn that makes this story great. In 2007 a project called reCAPTCHA asked: those hundreds of millions of daily human readings — what if we aimed them at text that computers genuinely couldn't read? Specifically, old books and newspapers being digitized, where the scans were so faded or smudged that OCR threw up its hands.

The catch: if the computer doesn't know the answer, how can it grade you? The solution is elegant — show two words.

Scroll to see the two-word trick, and how your answer becomes truth.

Every puzzle had two words — but you couldn't tell them apart.

scroll

The control word does double duty

The brilliance is that you can't tell which word is which, so you try equally hard on both:

  • The control word has a known answer. Get it right and the system trusts you're human — and trusts your answer to the other word.
  • The unknown word is a real scan the digitization project couldn't read. Your guess is a free transcription.

One word does the security; the other does useful work. Same effort, two payoffs.

Consensus turns guesses into truth

A single person could mistype, so no one answer is trusted outright. The same unknown word is shown to many people, and when enough of them independently agree, that answer is accepted as correct — and even fed back as a new control word later.

reCAPTCHA digitized the entire archive of The New York Times and millions of Google Books this way — work that would have cost a fortune, done in the cracks of people logging into websites.

The pattern, generalized

Notice the reusable recipe, because the next chapter is just this recipe with a new kind of data:

  1. Mix a known item (verifies the human) with an unknown item (gets labeled).
  2. Use the known one to grade; use agreement on the unknown one to label.
  3. Collect at planetary scale, for free.

Once books were digitized, Google had this machine and a new question: what else needs labeling that computers can't yet do? The answer was images — and that's where the traffic lights come in. Next: why traffic lights.