As of 2026-05-30

Voice is the most latency-sensitive AI use case in production. Chat users wait three seconds for a response and keep using your product. Voice users wait three seconds and the conversation is broken — the human on the other end thinks the call dropped, or the agent is malfunctioning, or worse, the human starts talking again and the agent interrupts. The latency budget is not a nice-to-have.

This article is the working engineer's breakdown of where the time actually goes in a voice agent, the published numbers from production teams, and the levers that meaningfully move each one.

The 600-millisecond classical-pipeline budget

The clearest published breakdown of a production voice agent pipeline comes from Retell AI's 2026 article on how real-time voice AI actually works. For a well-tuned classical STT-to-LLM-to-TTS pipeline, the typical layer budget from end of user speech to first audio out:

Layer Budget
Network round-trip 30 to 80 ms
VAD / turn-taking 150 to 300 ms
LLM time-to-first-token 150 to 400 ms
TTS time-to-first-audio 100 to 200 ms
Total round-trip ~600 ms

The number is not magic; it is the sum of four streaming layers each tuned independently. The interesting thing about voice latency engineering is that the layers stream in parallel — STT runs while the user is still talking, so its incremental contribution at the end is small; TTS starts producing audio while the LLM is still generating tokens, so its impact is also smaller than the per-layer budget suggests. The numbers above are the incremental latency each layer adds at the critical end-of-speech-to-first-audio point.

Speech-to-speech architectures (OpenAI Realtime, Gemini Live) collapse the STT and TTS layers into the model itself. LiveKit's voice agent architecture article puts these systems at "under 500 milliseconds" from end of user speech to first audio, which is consistent with shaving 100 to 200 milliseconds off the classical pipeline by removing the explicit STT and TTS hops.

The thresholds that matter

What the latency budget is buying you is the difference between a conversation that feels human and one that feels like talking to an old IVR system. The thresholds, roughly:

  • Under 500 milliseconds total round-trip — feels conversational. Users do not consciously notice the gap.
  • 500 to 800 milliseconds — feels fine. There is a small noticeable pause but it does not break the flow.
  • 800 milliseconds to 1.5 seconds — feels slow. Users start to wonder if the agent is processing or stuck.
  • Over 1.5 seconds — feels broken. Users either start talking again (causing barge-in collisions) or hang up.

These are observable in production. Voice agent platforms that cannot hit sub-second round-trips on average lose to platforms that can. The sub-second threshold is non-negotiable for any production-grade voice agent in 2026.

Layer-by-layer, what you control

The honest version of "how do I make my voice agent faster" goes layer by layer.

Network: 30 to 80 milliseconds

Almost entirely out of your direct control once you have picked a region. The dominant factor is the geographic distance between the user, your service, and the model API. Two things you can do:

  • Run in the same region as your model provider. OpenAI, Anthropic, and Google all have multi-region endpoints. Match yours to the one the user is closest to.
  • Use the model provider's edge / accelerated endpoints if they offer them. Some vendors expose lower-latency edge endpoints for premium tiers.

Beyond that, network is fixed. Do not spend engineering time here unless your traceroutes show pathological routing.

VAD and turn-taking: 150 to 300 milliseconds

The variable that makes a voice agent feel responsive or sluggish. Two levers:

  • Tune the end-of-utterance detection. A pure silence-duration threshold (say, 200 milliseconds) is the naive baseline. Production systems use prosody-aware end-of-utterance models that can hit ~150 milliseconds without interrupting users. Most voice agent platforms expose this as a configurable parameter; Vapi, Retell, and LiveKit Agents all do. The trade-off is interruption rate against responsiveness.
  • Use a barge-in-tolerant design. If the agent occasionally interrupts the user because turn-taking was too aggressive, the recovery should be graceful — agent stops, listens, responds. A good barge-in design makes a 200-millisecond turn-taking budget feel fine even on the occasional false positive.

This is the layer where platform expertise matters most. The big platforms have spent years tuning these models on real call data. Building your own from scratch is possible but rarely worth it.

LLM time-to-first-token: 150 to 400 milliseconds

The biggest software-side lever and the one most teams under-optimize.

  • Enable prompt caching. A stable system prompt cached on the provider side drops TTFT by 200 to 400 milliseconds in many configurations. See Prompt Caching, Explained. For voice, this is the single biggest cost-free win.
  • Pick a faster model. A small-tier model (GPT-4o-mini-class, Claude Haiku-class, Gemini Flash-class) typically has 2x to 3x faster TTFT than a frontier model. For conversational voice, the accuracy gap is often invisible while the latency gap is very visible. See When to Use a Smaller Model.
  • Shorten the system prompt. Less prefill is less latency. Tight, focused system prompts win on voice in a way they do not on text chat.
  • Skip reasoning models for voice. A reasoning model that thinks for 2 seconds before answering is unusable for live voice. Save reasoning models for use cases where the wait is acceptable.

TTS time-to-first-audio: 100 to 200 milliseconds

The layer most teams ignore because the numbers are already small.

  • Pick a fast TTS provider. OpenAI's TTS, Cartesia, and Deepgram's Aura all target sub-200-millisecond TTFA. ElevenLabs is excellent for quality but historically slower on TTFA. Match the provider to your latency target.
  • Stream the audio chunks. Do not buffer the full response before playing. The first chunk should reach the user's ear within 100 to 200 milliseconds of the TTS receiving the first LLM tokens; subsequent chunks follow as the LLM produces more text.
  • Pre-warm the connection. Cold TTS connections add 50 to 100 milliseconds. Keeping a warm connection per active call removes this.

Speech-to-speech: the ~500-millisecond budget

The speech-to-speech architectures (OpenAI Realtime, Gemini Live) cut the latency budget by collapsing STT, LLM, and TTS into a single audio-native multimodal model. The budget shape:

Layer Budget
Network round-trip 30 to 80 ms
VAD / turn-taking 150 to 300 ms
Speech-to-speech model TTFA Under 200 ms (typical)
Total round-trip Under 500 ms

The model itself absorbs what would have been the STT, LLM TTFT, and TTS TTFA layers. This is the architectural win.

The price you pay: you are bound to the speech-to-speech vendor's voice options, their model behavior, and their pricing model. The classical pipeline keeps each layer swappable; speech-to-speech does not. For applications where the latency improvement materially changes the UX (real-time multilingual interpretation, gaming dialogue, certain accessibility uses), speech-to-speech is worth the lock-in. For most phone-based voice agents, the classical pipeline at 600 milliseconds is still the default.

What does not actually help

Three optimizations that look like they should help and mostly do not:

  • Streaming partial transcripts to the LLM before VAD fires. Tempting because you save the VAD wait, but in practice the LLM produces low-quality responses when fed unfinished user input. The few teams that ship this end up reverting it.
  • Pre-generating common responses. Looks clever, breaks the moment the user says anything unexpected, and produces a brittle experience.
  • Squeezing the network layer. The network budget is already small. Optimization effort there has diminishing returns.

The wins are where the budget is fat: VAD, LLM TTFT, and (if you have not already done it) prompt caching. The wins are not where the budget is already thin.

The cost-of-latency trade-off

A specific note worth flagging: there is a real cost-versus-latency curve at the LLM layer.

  • A frontier model with reasoning enabled: highest quality, 2 to 10 seconds TTFT, unusable for live voice.
  • A frontier model without reasoning: high quality, 400 to 800 milliseconds TTFT, on the edge of acceptable.
  • A mid-tier model: very good quality, 200 to 400 milliseconds TTFT, comfortable.
  • A small-tier model: good-enough quality for conversational use, 100 to 250 milliseconds TTFT, fast.

Most production voice agents in 2026 default to mid-tier or small-tier models for the conversational LLM, and reserve frontier models for specific complex tools (an explicit "let me check the policy on that" step) where a 3-second pause is acceptable. The "use the biggest model for everything" instinct from chat does not transfer to voice.