Latency
LLM latency has unusual shape — first token, output rate, and total time are three different problems
LLM latency doesn't behave like a typical service call. There's the time before the first token appears (TTFT), there's the rate at which subsequent tokens arrive (output rate / TPS), and there's the total time the user waits. Each is driven by different factors and demands different optimizations.
The Three Numbers
- Time to first token (TTFT) — how long until the user sees anything. Dominated by input length and provider-side queueing.
- Output rate — tokens per second once generation starts. Dominated by model size and the provider's hardware.
- End-to-end latency — total wall time. The output token count multiplied by output rate, plus TTFT, plus everything around it.
Optimizing the wrong one is wasted work. Make sure you know which is biting.
Streaming Changes Perceived Latency
A 4-second response feels long; a 4-second response that starts producing tokens at 200ms feels fast. Streaming is the highest-impact UX change in most LLM features. Implementations:
- Server-Sent Events (SSE) for HTTP.
- WebSockets for bidirectional voice / agentic interactions.
- Chunked rendering — render partial markdown, code, and JSON as it arrives.
Streaming forces you to handle partial state on the client. Plan for it from the start.
Reducing TTFT
- Shorter inputs. TTFT scales with input length on most providers.
- Prompt caching. Cache hits dramatically reduce TTFT.
- Smaller models for the first hop in routed workflows.
- Geographic proximity. Provider region matters; use the closest available.
Reducing Output Time
- Shorter outputs. Concise instructions, structured output, max-tokens caps.
- Faster models. Model choice often dwarfs prompt-level optimizations.
- Speculative / draft decoding is a provider-side optimization; some APIs expose it, often invisibly.
- Parallelism. When generating multiple independent things, parallel calls beat one long generation.
Pipeline Latency
Multi-step flows add latency at each hop. Patterns that help:
- Speculative execution — start the next step in parallel with the current one if it's likely needed.
- Result streaming between steps — pipe tokens from step N into step N+1 as they're produced, not after step N completes.
- Eager retrieval — fire off retrieval the moment intent is clear, in parallel with reasoning.
Tool Calls Are Latency Multipliers
A tool-using turn waits for: model decides → tool executes → model reads → model responds. Each round-trip is latency on top of latency. Mitigations:
- Parallel tool calls — supported by most modern APIs; use them.
- Cheap, fast tools — slow tools dominate the loop.
- Bound tool depth — agents that loop more than necessary feel sluggish even if each turn is fast.
When You're Latency-Bound on Capacity
Provider rate limits and queueing become latency at high traffic. Mitigations:
- Multi-region / multi-provider failover.
- Priority tiers — most providers offer paid tiers with better latency guarantees.
- Request shedding — drop or downgrade non-critical traffic before queues build up.
- Self-hosted inference for predictable workloads (see Infrastructure).
What "Fast Enough" Means
Different products need different latency budgets. Conversational chat tolerates a few seconds. Inline code completion has to feel instant — under 300ms or it's worse than no completion. Voice agents need barge-in handling and sub-second turn latency. Pick your budget early; design backwards from it.