Voice Automation

Real-time, bidirectional voice — LLMs that take and place phone calls instead of analyzing them after the fact

Scenario Abstraction

A business takes (or makes) phone calls today that follow a recognizable script: appointment booking, intake screening, outbound reminders, simple support, balance lookups, account changes, lead qualification. The work is real-time, conversational, and turn-bound — the caller is on the line, the system has milliseconds to respond, the conversation only ends when something has been accomplished or escalated.

This is distinct from Conversation Intelligence, which analyzes recordings after the call. Voice automation is the call. The same business can run both: a voice agent handles the call, and a conversation-intelligence pass scores and learns from it afterward.

Solution Shape

Telephony layer — SIP / WebRTC / PSTN trunk that brings the audio in and out, manages call control (hold, transfer, hang up, DTMF).
Streaming ASR — incremental transcription with partial hypotheses; endpointing decides when the user finished speaking.
Dialog policy — what the agent is allowed to do, what intents it covers, what tools it can call, when it must escalate.
LLM turn — given conversation state + current user utterance + retrieved business context, decide the next response and any tool calls.
Tool calls — lookup customer, check availability, place a hold, charge a card, book a slot, write a note, send an SMS confirmation.
Streaming TTS — generate audio as the response forms; goal is sub-second first-word latency.
Barge-in handling — if the user starts talking, stop the TTS and listen.
Escalation — clean handoff to a human with full context (transcript so far, identified caller, attempted action).

The non-obvious work is latency budgets and turn-taking. Users tolerate ~1s of silence; everything above that feels broken. Streaming both directions and aggressive pre-fetching are how that budget gets met.

Key Building Blocks

Telephony provider with low-latency media (Twilio, Vonage, Telnyx, Plivo, Daily, LiveKit).
Streaming ASR with custom vocabulary for names / products / IDs.
Realtime-capable LLM (speech-in/speech-out models, or fast text LLM + low-latency TTS).
Tool layer — typed, idempotent, fast. Slow tools kill UX; cache aggressively.
Dialog state store — survives reconnects and short drops.
Guardrails — refusal patterns, PII handling, "I can't help with that" handoff scripts.
Recording + post-call analytics pipeline — feeds conversation intelligence.

Concrete Cases

Inbound appointment booking. Dental / clinic / salon receives calls, agent looks up availability, books the slot, sends an SMS confirmation, falls back to a human for complex cases.
Outbound appointment reminders & confirmations. Agent calls patients the day before, confirms / reschedules / cancels, writes back to the EMR.
Restaurant reservations. Reservation-only line that handles overflow when staff are busy; integrates with the table-management system.
Lead qualification calls. Inbound web lead is called within seconds; agent qualifies (budget, role, timeline), books a human meeting if qualified.
First-line customer support. Identify caller, look up account, answer common questions (balance, status, "where is my order"), transfer the rest with context.
Field-service dispatch intake. Customer calls about a broken appliance; agent gathers the symptoms, schedules a tech, sends ETA SMS.
Insurance FNOL intake. Capture first notice of loss over the phone, fill the structured claim form, schedule callback.
Survey & research calls. Outbound interview agent runs a structured questionnaire, captures answers, codes free-text responses inline.
Collections reminders (low-sensitivity). Friendly reminder calls, payment-method updates, payment-plan setup.
Voice-driven IVR replacement. Replace push-button menus with conversational routing across complex call trees.

Similar Scenarios

Voice-mode in-app assistants — same realtime stack, in-product instead of over the PSTN.
Drive-thru / kiosk order taking — voice in noisy environments, structured menu output.
Multilingual interpreter — real-time bidirectional translation as a third party on a call.
Voice notes → structured records — async, not realtime; closer to Document-to-Action.

Pitfalls & Evaluation

Latency tax. Each 200ms of added latency feels like a worse agent. Profile and budget end-to-end (capture → ASR → LLM → tool → TTS → playback).
Endpointing errors. Cutting users off mid-sentence is uniquely irritating in voice. Tune endpoint silence thresholds per population (older callers pause more).
ASR on names / IDs. Custom vocabulary, NATO-alphabet fallback prompts ("did you say B as in boy?"), DTMF as escape hatch.
Hallucinated business state. Voice users can't see the screen; if the agent says "your appointment is confirmed for Tuesday at 3," that had better be true. Always ground state changes in tool calls, never the model's memory.
Escalation that loses context. Transferring to a human who then asks the caller to repeat everything is worse than no agent at all. Pass the transcript and the structured state.
Recording consent and regulation. Jurisdiction-specific. Build the consent into the opening, log it, and skip recording for jurisdictions where it's not granted.

Useful metrics: containment rate (calls fully handled without escalation) per intent, average handle time, first-word latency p50/p95, escalation reason mix, customer-satisfaction (CSAT) on automated vs human-handled calls, downstream booking / payment / resolution completion rates.