Chapter 5 · Part 2

How it remembers

People talk about ChatGPT "remembering" a conversation, but it has no memory in the human sense. Every time it predicts the next word, it re-reads the entire conversation so far from scratch — your messages, its own replies, the hidden system instructions, any files you pasted. All of it.

The catch: it can only re-read a fixed amount. That fixed amount is the context window, and it's measured in tokens. Once a conversation grows past the limit, the oldest text falls outside the window — and to the model, it simply stops existing.

Scroll to watch a conversation outgrow its window.

A model can only read a fixed number of tokens at once — its context window. Early in a chat, everything fits.

scroll↓

A fixed budget of tokens

The window isn't just your messages — everything the model needs to read shares the same budget:

The system prompt (hidden instructions about how to behave).
The whole conversation history, both sides.
Your latest message and any documents in it.
Room reserved for the reply it's about to generate.

Add those up and they must fit inside one number — the model's maximum context length. Early models held only a couple thousand tokens; modern ones stretch to 128,000 or more (roughly a long book). But it's always finite, and a long chat or a big pasted file fills it faster than you'd expect.

What happens when you run out

When a conversation would exceed the window, something has to give. The two common strategies:

Truncation — drop the oldest messages until the rest fits. This is the literal "forgetting the beginning" you saw in the visual: ask about something you said an hour ago and it may be gone.
Summarization — replace old turns with a short recap so the gist survives in far fewer tokens. (This is roughly what longer-running assistants do behind the scenes.)

fit.py — trimming history to fit the window

MAX_CONTEXT = 128_000
RESERVED_FOR_REPLY = 4_000
budget = MAX_CONTEXT - RESERVED_FOR_REPLY

# always keep the system prompt; drop oldest turns until we fit
messages = [system_prompt] + history + [user_message]
while count_tokens(messages) > budget:
  messages.pop(1)        # remove the oldest turn after the system prompt

reply = model.generate(messages)

Why this shapes everything

Almost every quirk of prompting traces back to the window:

Long chats drift. Once the start scrolls out, the model loses the framing you set up early — so it helps to restate key facts.
"Read this document" has a ceiling. A file only helps if it fits; that's why big documents get chunked.
RAG exists because of this. Retrieval-augmented generation searches a huge knowledge base and pastes only the most relevant snippets into the window, rather than trying to cram everything in.

So the model can hold a lot in view — but only what fits, and only right now. And notice what it's doing with all that context: predicting plausible next tokens. It never actually checks whether any of it is true. That gap is where our last chapter lives: why it makes things up, next.