Chapter 5 · Part 2
How it remembers
People talk about ChatGPT "remembering" a conversation, but it has no memory in the human sense. Every time it predicts the next word, it re-reads the entire conversation so far from scratch — your messages, its own replies, the hidden system instructions, any files you pasted. All of it.
The catch: it can only re-read a fixed amount. That fixed amount is the context window, and it's measured in tokens. Once a conversation grows past the limit, the oldest text falls outside the window — and to the model, it simply stops existing.
Scroll to watch a conversation outgrow its window.
A model can only read a fixed number of tokens at once — its context window. Early in a chat, everything fits.
A fixed budget of tokens
The window isn't just your messages — everything the model needs to read shares the same budget:
- The system prompt (hidden instructions about how to behave).
- The whole conversation history, both sides.
- Your latest message and any documents in it.
- Room reserved for the reply it's about to generate.
Add those up and they must fit inside one number — the model's maximum context length. Early models held only a couple thousand tokens; modern ones stretch to 128,000 or more (roughly a long book). But it's always finite, and a long chat or a big pasted file fills it faster than you'd expect.
What happens when you run out
When a conversation would exceed the window, something has to give. The two common strategies:
- Truncation — drop the oldest messages until the rest fits. This is the literal "forgetting the beginning" you saw in the visual: ask about something you said an hour ago and it may be gone.
- Summarization — replace old turns with a short recap so the gist survives in far fewer tokens. (This is roughly what longer-running assistants do behind the scenes.)
MAX_CONTEXT = 128_000
RESERVED_FOR_REPLY = 4_000
budget = MAX_CONTEXT - RESERVED_FOR_REPLY
# always keep the system prompt; drop oldest turns until we fit
messages = [system_prompt] + history + [user_message]
while count_tokens(messages) > budget:
messages.pop(1) # remove the oldest turn after the system prompt
reply = model.generate(messages)Why this shapes everything
Almost every quirk of prompting traces back to the window:
- Long chats drift. Once the start scrolls out, the model loses the framing you set up early — so it helps to restate key facts.
- "Read this document" has a ceiling. A file only helps if it fits; that's why big documents get chunked.
- RAG exists because of this. Retrieval-augmented generation searches a huge knowledge base and pastes only the most relevant snippets into the window, rather than trying to cram everything in.
So the model can hold a lot in view — but only what fits, and only right now. And notice what it's doing with all that context: predicting plausible next tokens. It never actually checks whether any of it is true. That gap is where our last chapter lives: why it makes things up, next.