Context Management

Context is the model's working memory — and managing it is the main engineering problem

Long context windows make many problems tractable, but they don't make context free. Every token costs money, adds latency, and competes for the model's attention. The real engineering discipline is deciding what to put in the context, when, and for how long.

The Context Budget

Treat the context window as a budget, not a bucket. Typical allocation:

System prompt — stable instructions, tool definitions.
Shared context — retrieved docs, conversation history.
Task-specific input — the current user turn and anything it needs.
Output headroom — leave room for the model's response.

Going over the window is a failure mode. So is using 90% of the window with noise.

Prompt Caching

Every major provider now supports prompt caching: prefixes that are reused across calls are charged at a deep discount and served faster. The structural implication is that your prompts should be stable-prefix, variable-suffix:

[SYSTEM PROMPT]        ← cached
[TOOL DEFINITIONS]     ← cached
[SHARED DOCUMENTS]     ← cached
[CONVERSATION HISTORY] ← partially cached
[CURRENT USER TURN]    ← not cached

Designing with caching in mind can cut costs by 70–90% and latency materially.

Conversation History

As conversations grow, you have options:

Keep everything — simplest, works until you hit the window.
Truncate oldest — drop the earliest turns once you're near the limit.
Summarize — periodically replace old turns with a summary.
Retrieve — index past turns and pull only relevant ones into the current prompt.

Summarization is the right default for long-lived conversations; retrieval matters once the history is too large to even summarize.

Lost in the middle — the model attends less to tokens in the middle of a long context.
Cost — input tokens are cheap per token, expensive in bulk.
Latency — TTFT grows with input size on most providers.
Coherence — the model can lose track of the original instruction buried 100K tokens up.

Short, focused, cacheable context usually beats long, unfocused context.

The Context Budget

Prompt Caching

Conversation History

Retrieval as Context Management

Compaction and Middle-Out

What Breaks When Context Gets Long

On this page