Steven's Knowledge

Streaming & Loading States

Make slow feel fast — the single highest-impact UX move for LLM features

LLM responses are slow by web-app standards. A 5-second response served as a single blob feels broken; the same 5-second response streamed token-by-token feels fluid. Streaming and the design of loading states is where most LLM features win or lose perceived quality.

Why Streaming Wins

A few hundred milliseconds is the threshold where users start noticing latency. LLMs blow past that on almost any non-trivial output. Streaming sidesteps the problem by surfacing partial results immediately:

  • Time to first token (TTFT) becomes the perceived latency, not total response time.
  • The user starts reading while the model is still generating, eliminating most of the wait.
  • Trust improves. A blinking cursor or a wall of nothing for 5 seconds reads as "broken." A response actively appearing reads as "working."

The Mechanical Layer

Modern provider APIs all support streaming via Server-Sent Events (SSE) or similar. Each token (or small group of tokens) arrives as soon as the model produces it. The client appends to the rendered output. Implementation specifics vary by framework, but the shape is universal: open a stream, consume chunks, render incrementally.

For agents and multi-step flows, streaming gets richer:

  • Tool call announcements ("calling search...").
  • Intermediate reasoning ("checking the data...").
  • Status events independent of the final output.

Designing the protocol for these events is part of the UX, not just the backend.

What to Stream

Not everything benefits from streaming:

  • Conversational text. Always stream. The single biggest win.
  • Code blocks. Stream, but render in a monospace block as it grows. Syntax highlighting can come at the end or progressively.
  • Markdown. Stream and re-render — most markdown libraries handle partial input gracefully.
  • Structured output / JSON. Tricky. Partial JSON isn't valid; you need either parsers that handle partials or a "structured" rendering mode that builds up known fields as they appear.
  • Tool calls. Show that a tool is being called as soon as the decision is made; don't wait for the result.

For structured output specifically, "streaming JSON" usually means parsing partial JSON with a tolerant parser and rendering the fields that have arrived so far.

Loading States Before Streaming Starts

Even with streaming, there's still a gap before the first token. Design for it:

  • Immediate acknowledgment. Disable the input, show a "thinking" indicator, the moment the user submits. No 200ms gap where the UI looks unchanged.
  • Skeleton states. For complex outputs (cards, structured results), show the shape before the content.
  • Progressive disclosure. "Searching docs..." → "Drafting answer..." → "Polishing..." gives users a sense of progress for long operations.
  • Personality-appropriate copy. "Thinking..." is fine. "Reticulating splines..." gets old fast unless your product can carry it.

Loading States Inside Streams

For long agentic flows, a stream of tokens isn't enough. Users want to know:

  • What step are we on?
  • What tool is being called?
  • How long has this been running?
  • Can I cancel?

Surface that with a structured event stream — separate from the output — that updates a progress UI alongside the streamed response.

Cancellation

Long-running responses must be cancelable. Patterns:

  • Stop button. Always visible during generation. Single-click stop.
  • Server-side cancellation. Stop billing the moment the user cancels; many providers support cancel signals.
  • Preserve partial output. If the user cancels at token 100, keep the first 100. Don't blank the screen.

A feature without cancellation feels uncontrollable. Users disengage.

Error During Streaming

Network blips and provider errors mid-stream are common:

  • Detect early. Watchdog timers — if no token arrives for N seconds, surface a state.
  • Retry from where it stopped if the model and prompt support it.
  • Graceful failure. Make it clear what happened, preserve what was already generated, offer to retry.

A stream that just stops silently is the worst possible failure mode.

Long Outputs and Scrolling

A streaming response can outrun the viewport:

  • Auto-scroll to follow until the user manually scrolls up.
  • Detach when the user scrolls. Don't fight the user; let them read past output.
  • Reattach button. "Jump to bottom" appears once the user is scrolled away.

Slack-style scroll behavior is what users expect.

Sound and Motion

Subtle motion during generation reads as alive and working. Examples:

  • A pulsing cursor at the generation point.
  • A gentle shimmer on the skeleton state.
  • A spinning indicator next to a streaming tool call.

Sounds, on the other hand, get old fast and disable themselves in shared environments. Stick to motion.

A Quality Bar

The bar to clear: a user submits, sees something change immediately, sees content begin to flow within a second, can stop at any time, knows what's happening if it's slow, gets a graceful state if it fails. Every LLM feature that meets that bar feels good. Most that don't, don't.

On this page