Chapter 5 · Part 2

How it remembers

People talk about ChatGPT "remembering" a conversation, but it has no memory in the human sense. Every time it predicts the next word, it re-reads the entire conversation so far from scratch — your messages, its own replies, the hidden system instructions, any files you pasted. All of it.

The catch: it can only re-read a fixed amount. That fixed amount is the context window, and it's measured in tokens. Once a conversation grows past the limit, the oldest text falls outside the window — and to the model, it simply stops existing.

Scroll to watch a conversation outgrow its window.

A model can only read a fixed number of tokens at once — its context window. Early in a chat, everything fits.

scroll

A fixed budget of tokens

The window isn't just your messages — everything the model needs to read shares the same budget:

  • The system prompt (hidden instructions about how to behave).
  • The whole conversation history, both sides.
  • Your latest message and any documents in it.
  • Room reserved for the reply it's about to generate.

Add those up and they must fit inside one number — the model's maximum context length. Early models held only a couple thousand tokens; modern ones stretch to 128,000 or more (roughly a long book). But it's always finite, and a long chat or a big pasted file fills it faster than you'd expect.

What happens when you run out

When a conversation would exceed the window, something has to give. The two common strategies:

  • Truncation — drop the oldest messages until the rest fits. This is the literal "forgetting the beginning" you saw in the visual: ask about something you said an hour ago and it may be gone.
  • Summarization — replace old turns with a short recap so the gist survives in far fewer tokens. (This is roughly what longer-running assistants do behind the scenes.)
fit.py — trimming history to fit the window
MAX_CONTEXT = 128_000
RESERVED_FOR_REPLY = 4_000
budget = MAX_CONTEXT - RESERVED_FOR_REPLY

# always keep the system prompt; drop oldest turns until we fit
messages = [system_prompt] + history + [user_message]
while count_tokens(messages) > budget:
  messages.pop(1)        # remove the oldest turn after the system prompt

reply = model.generate(messages)

Why this shapes everything

Almost every quirk of prompting traces back to the window:

  • Long chats drift. Once the start scrolls out, the model loses the framing you set up early — so it helps to restate key facts.
  • "Read this document" has a ceiling. A file only helps if it fits; that's why big documents get chunked.
  • RAG exists because of this. Retrieval-augmented generation searches a huge knowledge base and pastes only the most relevant snippets into the window, rather than trying to cram everything in.

So the model can hold a lot in view — but only what fits, and only right now. And notice what it's doing with all that context: predicting plausible next tokens. It never actually checks whether any of it is true. That gap is where our last chapter lives: why it makes things up, next.