Voice Agents in Production

The latency stack, conversation flow design, graceful handoff, and the cost economics that make AI voice receptionists viable for service businesses.

TL;DR

Latency is the defining constraint. The full stack (speech-to-text, model inference, text-to-speech, network) must stay under 800ms for conversations to feel natural. Above 1,200ms, users doubt the call is working.
Voice agents succeed or fail on conversation flow design, not model quality. Structured states with explicit transitions outperform open-ended dialogue for every repeatable business workflow.
The economics are clear: $0.05–$0.15 per AI call vs $1.25–$2.30 for a human receptionist, available 24/7, with zero turnover.

Voice AI has a commercially mature application right now: the inbound service call. Booking appointments, answering FAQs, routing enquiries, capturing lead information. Service businesses miss 20 to 40% of inbound calls because staff aren't available. A voice agent is always available, costs roughly 10x less per call than a human, and needs no rostering.

The pattern holds across production deployments: conversation design matters far more than the underlying model. The best voice agents in production aren't running the most capable LLM. They're running the most disciplined conversation flow.

The latency stack

Voice is unforgiving about latency in a way that text is not. A 3-second delay in a chat interface is annoying. A 3-second silence in a phone call feels like a dropped connection. Users hang up.

The total latency stack has four components:

Component	Typical range	Notes
Speech-to-text (STT)	100–300ms	Streaming STT is meaningfully faster than batch
LLM inference	200–800ms	First token latency matters more than full completion time
Text-to-speech (TTS)	100–300ms	Begin streaming TTS before the full response is ready
Network round-trip	50–100ms	Depends on infrastructure proximity to telephony

Target total: under 800ms. Above 1,200ms, conversation quality degrades noticeably. Two tactics keep latency in range.

Stream everything. Don't wait for the full LLM response before starting TTS. Begin synthesising speech as tokens arrive. This alone cuts perceived latency by 40 to 50%.

Use filler responses. When processing will exceed 500ms, insert a conversational filler: "Let me check that for you" or "One moment." This buys one to two seconds without the conversation feeling broken. Vary the phrases or they sound scripted.

Conversation flow design

The quality gap between voice agents is not in the model. It's in the flow.

Open-ended voice agents (a system prompt plus free-form conversation) fail in production. They go off-track, handle ambiguity poorly, and improvise edge case responses that confuse or frustrate callers. The fix is structured states: discrete conversation stages with defined inputs, outputs, and transitions.

A booking flow for a service business has four states.

Opening. Greet, identify the caller's intent, and route to the correct state. Keep this under 10 seconds. Callers are impatient and on mobile.

Information gathering. Collect the required inputs for the task: name, preferred time, service type. One question at a time. Confirm each answer before moving on. Don't ask for three things at once.

Confirmation. Summarise what was captured, state what happens next, and get explicit agreement. "I've booked you in for Thursday at 2pm. You'll get a confirmation SMS in the next few minutes. Is there anything else?" This step is not optional. It builds trust and catches errors before they become support tickets.

Edge case handling. Define explicit paths for the scenarios you know will occur: account not found, requested time unavailable, out-of-scope request. Don't let the model improvise these. Script them. Improvised edge case responses are where voice agents embarrass the business they represent.

Outside these states, route to a human. Every voice agent needs a clean handoff path.

Graceful handoff

Handoff design is where most voice agent projects fail.

A good handoff transfers three things simultaneously: the call itself, the full conversation transcript, and a structured summary of what was attempted, what was gathered, and why the handoff is happening. The human agent should be able to continue the conversation without asking the caller to repeat anything.

What this requires technically:

Real-time transcript sent to the receiving agent's screen before the call arrives
A structured data object (name, intent, captured fields, reason for handoff) passed via the telephony platform
A warm transfer, not a cold blind transfer: the AI briefly introduces the human before dropping off

The warm transfer script: "I'm going to connect you now with [Name] who can help with this. I'll let them know what we've discussed." Then transfer. The AI doesn't say goodbye and hang up. It bridges the caller to the human.

This matters because users evaluate the entire call experience, not just the AI portion. A technically competent AI that hands off badly poisons the retrospective. Users remember the last thing that happened.

Where voice agents work and where they fail

Voice works for workflows that are:

High-volume, repetitive, and structured (bookings, FAQs, routing, lead capture)
Time-sensitive for the caller (a missed call has a real cost to the business)
Short: under 3 minutes end-to-end

Voice fails for workflows that are:

Emotionally charged (complaints, disputes, cancellations). These need humans.
Highly variable with unpredictable branching
Dependent on context the agent can't access (an account lookup with no API integration)

The common mistake: scoping the voice agent too broadly at launch. Start with the narrowest, highest-volume workflow. Get it to 95%+ task completion rate. Then expand. Adding scope before the baseline is solid multiplies failure modes.

Production checklist

Before shipping a voice agent to real callers:

Latency tested at the 95th percentile, not average (averages hide the tail)
All four conversation states tested with real calls, not synthetic prompts
Edge cases scripted and tested: no-account, unavailable-slot, out-of-scope intent
Handoff tested end-to-end with a real human receiving the transfer
Transcript delivery confirmed before the handoff completes
Disclosure language in place (many jurisdictions require callers be informed they're speaking with AI)
Fallback defined: what happens if the telephony integration fails mid-call

The economics

	AI voice agent	Human receptionist
Cost per call	$0.05–$0.15	$1.25–$2.30
Availability	24/7	Business hours
Consistency	100%	Variable
Ramp time	0	2–4 weeks

For a business fielding 200 calls per month, that's roughly $20 in AI costs vs $300+ for part-time human coverage. The economics work at modest call volumes.

The non-obvious savings: reduced missed calls, no roster management, no sick days, no turnover. For small service businesses, those operational costs often exceed the direct wage.

What voice agent PMs look like

Behaviour	In practice
Latency-obsessed	Measures 95th percentile, not average. Treats anything above 1,200ms as a blocking issue before launch.
Flow designers	Scripts conversation states explicitly. Doesn't rely on model improvisation for predictable scenarios.
Handoff engineers	Designs the human transfer as carefully as the AI conversation. Tests it live, not in simulation.
Scope disciplined	Starts with the narrowest viable workflow. Resists complexity until the baseline completion rate is solid.