Steven's Knowledge

Context Management

Context is the model's working memory — and managing it is the main engineering problem

Long context windows make many problems tractable, but they don't make context free. Every token costs money, adds latency, and competes for the model's attention. The real engineering discipline is deciding what to put in the context, when, and for how long.

The Context Budget

Treat the context window as a budget, not a bucket. Typical allocation:

  • System prompt — stable instructions, tool definitions.
  • Shared context — retrieved docs, conversation history.
  • Task-specific input — the current user turn and anything it needs.
  • Output headroom — leave room for the model's response.

Going over the window is a failure mode. So is using 90% of the window with noise.

Prompt Caching

Every major provider now supports prompt caching: prefixes that are reused across calls are charged at a deep discount and served faster. The structural implication is that your prompts should be stable-prefix, variable-suffix:

[SYSTEM PROMPT]        ← cached
[TOOL DEFINITIONS]     ← cached
[SHARED DOCUMENTS]     ← cached
[CONVERSATION HISTORY] ← partially cached
[CURRENT USER TURN]    ← not cached

Designing with caching in mind can cut costs by 70–90% and latency materially.

Conversation History

As conversations grow, you have options:

  • Keep everything — simplest, works until you hit the window.
  • Truncate oldest — drop the earliest turns once you're near the limit.
  • Summarize — periodically replace old turns with a summary.
  • Retrieve — index past turns and pull only relevant ones into the current prompt.

Summarization is the right default for long-lived conversations; retrieval matters once the history is too large to even summarize.

Retrieval as Context Management

Retrieval isn't just for RAG in the product sense — it's a general tool for keeping context focused. Pull in the few specific things relevant to the current turn instead of trying to keep everything in context "just in case."

Compaction and Middle-Out

Some frameworks support compaction strategies: detecting when you're running out of room and compressing the middle of the context into a summary. Useful for long agentic runs where you can't know in advance how long the interaction will be.

What Breaks When Context Gets Long

  • Lost in the middle — the model attends less to tokens in the middle of a long context.
  • Cost — input tokens are cheap per token, expensive in bulk.
  • Latency — TTFT grows with input size on most providers.
  • Coherence — the model can lose track of the original instruction buried 100K tokens up.

Short, focused, cacheable context usually beats long, unfocused context.

On this page