Tokenization & Context Windows
The unit of work for an LLM is a token, not a word — and that has consequences everywhere
LLMs don't see characters or words; they see tokens. Tokens are the unit of cost, the unit of throughput, and the boundary that defines what the model can hold in mind at once. Almost every operational concern with LLMs — pricing, latency, context limits, retrieval design — comes back to tokens.
What a Token Is
A token is a sub-word unit produced by a learned tokenizer (BPE, SentencePiece, tiktoken). For English, one token averages about four characters or 0.75 words. For code, dense languages, and most non-English text, the ratio is worse — sometimes much worse, which has real cost implications.
Why Tokenization Matters
- Pricing — APIs charge per input and output token. A poorly tokenized language can cost 2–3x more for the same content.
- Latency — generation is sequential; output token count drives end-to-end latency more than anything else.
- Edge cases — many famous LLM "failures" (counting letters in a word, arithmetic on long numbers) are tokenization artifacts.
Context Windows
The context window is the maximum number of tokens the model can attend to at once. Modern frontier models offer 128K, 200K, 1M, and beyond. But useful context is not the same as advertised context:
- Lost in the middle — many models attend more strongly to the start and end of the context than the middle.
- Effective vs nominal — a 1M-token window doesn't mean the model can reason equally well across all of it.
- Cost scales with input — a million-token prompt is a million input tokens, every call.
Practical Context Engineering
- Put the question last. Models tend to weight the end of the prompt more.
- Use structure — XML tags, markdown headers, numbered sections. Models follow structured prompts more reliably.
- Cache aggressively — prompt caching (Anthropic, OpenAI, Gemini) makes long shared prefixes nearly free on repeat calls.
- Trim retrieved context — the goal of RAG isn't to dump everything in; it's to put the right small amount in.
Output Tokens Are Different
Input tokens are read in parallel. Output tokens are generated one at a time. That asymmetry is why long outputs are slow and expensive, and why streaming matters for UX. When you can, push work into structured output and tool calls rather than long prose responses.