Steven's Knowledge

Voice Agents

Real-time voice changes the rules — latency, turn-taking, and barge-in are first-class concerns

Voice interfaces have moved from "command-and-control" to actual conversation. With end-to-end speech models and sub-second turn latency, voice agents are now usable in the wild — and the engineering pattern is meaningfully different from text chat.

The Two Architectures

Cascaded — three separate models in sequence:

  1. Speech-to-text (STT) — transcribe the user.
  2. LLM — produce a text response.
  3. Text-to-speech (TTS) — speak it back.

Easy to debug, easy to swap parts, predictable failure modes. The dominant pattern until recently.

End-to-end speech — one model takes audio in and produces audio out (e.g., GPT-4o realtime, Gemini Live). Lower latency, more natural prosody and turn-taking, harder to introspect or fine-tune.

End-to-end is winning at the frontier; cascaded is still the right default when you need control over each stage, when you're integrating with existing TTS providers, or when your STT/TTS quality bar matters more than turn latency.

Latency Budget

Voice has a much harder latency budget than chat. Targets:

  • Total turn latency — under 800ms feels conversational; under 500ms feels human.
  • First audio out — should start as soon as the model has enough to begin speaking.

Where the time goes in a cascaded pipeline:

  • Endpointing (detecting end of user speech): 200–400ms
  • STT transcription: 100–500ms
  • LLM time-to-first-token: 200–600ms
  • TTS time-to-first-audio: 100–300ms

You can't afford to run these sequentially; they have to overlap.

Streaming Everything

Every stage needs to stream:

  • Streaming STT — produce partial transcripts as the user speaks.
  • Streaming LLM — start generating once you have the transcript, ideally before the user finishes speaking on predictable inputs.
  • Streaming TTS — start producing audio as soon as the first sentence-like chunk is ready.

Buffering at any stage adds latency you can't get back.

Endpointing and VAD

When does the user stop talking? Voice activity detection (VAD) decides. Get it wrong and you either cut the user off or wait awkwardly. Modern systems use:

  • Energy-based VAD — fast, dumb, fine in quiet environments.
  • Neural VAD (Silero, WebRTC's RNN-VAD) — better with noise and disfluencies.
  • Semantic endpointing — actually understand whether the user has finished a thought, not just paused.

Semantic endpointing is what end-to-end speech models do implicitly and what cascaded systems are starting to add explicitly.

Barge-In

Real conversations interrupt. A voice agent that can't be interrupted feels broken:

  • Detect user speech mid-response.
  • Stop TTS playback immediately.
  • Cancel in-flight LLM generation.
  • Shift attention to the new user input.

Barge-in handling is the single biggest UX win in voice agents and the single most often overlooked.

Tone, Prosody, and Persona

Voice carries information text doesn't:

  • Pace — fast for confirmation, slow for important details.
  • Tone — friendly, formal, urgent, calm.
  • Pauses — natural ones make speech understandable; missing ones make it feel robotic.

End-to-end speech models handle this implicitly. Cascaded systems need explicit prosody control — SSML, voice presets, or fine-tuned voices — and a lot of testing.

Tool Use Mid-Conversation

The same tool-use loop applies, but tools have to fit the latency budget:

  • Acknowledge first ("let me check that for you") so the user isn't left in silence.
  • Run tools in parallel with continued speech where possible.
  • Stream partial results — start speaking as soon as you know how to begin the answer.

A 3-second tool call is fine in chat and nearly fatal in voice without these patterns.

Common Use Cases

  • Customer support — replacing or augmenting IVR systems.
  • Outbound calling — appointment confirmations, surveys.
  • In-car / hands-free — navigation, controls, entertainment.
  • Accessibility — interfaces for users who can't or don't want to type.
  • Real-time translation — speech-to-speech across languages.

What's Still Hard

  • Background noise. Real environments aren't quiet studios.
  • Code-switching. Users mixing languages in a single sentence.
  • Long conversations. Context management gets messier when transcripts grow.
  • Emotion and empathy. Voice raises user expectations of social awareness.
  • Compliance recording. Voice data has different privacy and retention rules than text.

Provider Landscape

  • End-to-end — OpenAI Realtime, Google Gemini Live, others appearing fast.
  • Cascaded STT — Deepgram, AssemblyAI, OpenAI Whisper (open and API), Google.
  • Cascaded TTS — ElevenLabs, OpenAI TTS, Google, Cartesia.
  • Voice infrastructure — Vapi, Retell, LiveKit, Pipecat. Wrap the above into deployable agents.

The ecosystem is moving fast. Build with provider abstraction; don't lock in.

On this page