As of 2026-05-30
A voice agent is the thing that picks up your call, listens to what you said, decides what to do, and says something back — all in under a second if it is built well, with the conversation feeling natural rather than like a long-distance walkie-talkie. In 2026 two architectures dominate new production voice agents, with various hybrid and legacy patterns still in service (NLU-plus-IVR plus LLM, hand-written dialogue flows wrapping an LLM, and so on). The comparison below covers the two dominant shapes, since they share enough structure that the picture is mostly about which layers you control and where the latency goes.
This article is the working engineer's read on the end-to-end pipeline, where the design choices actually live, and what separates a polished voice agent from one that feels broken.
The two architectures
Architecture A — the classical streaming pipeline. Audio comes in. A voice activity detector decides when the user has stopped speaking. A streaming speech-to-text model emits partial transcripts as the user talks. The transcript goes to an LLM, which streams tokens back. A streaming text-to-speech model converts those tokens into audio chunks as they arrive. The audio is played back to the user. Every stage streams; nothing waits for the previous stage to finish before starting.
This is the architecture every major voice agent platform — Vapi, Retell, Bland, LiveKit Agents — supports as the default. It is what almost every shipped phone-based voice agent in 2026 actually runs.
Architecture B — speech-to-speech. Audio comes in. It goes directly to a multimodal model (OpenAI Realtime, Google Gemini Live) that ingests raw audio and emits raw audio without an intermediate text representation. The model handles the understanding, the response generation, and the speech synthesis as one unified step. There is no separate STT or TTS layer in your stack — they live inside the model.
This is the newer pattern. LiveKit's voice agent architecture article describes it as "audio in, audio out," with the speech-to-speech model handling everything internally. The win is latency: by skipping the explicit STT and TTS hops, speech-to-speech models can hit conversational round-trip times under 500 milliseconds, against roughly 600 milliseconds for the classical pipeline.
Why most production voice agents still use the pipeline
A reasonable question: if speech-to-speech is lower latency and conceptually simpler, why does the classical pipeline still dominate production?
Three reasons:
Provider freedom on each layer. With a pipeline, you pick the STT model independently from the LLM independently from the TTS. You can use Deepgram Nova for STT, Claude or GPT for the LLM, and ElevenLabs or Cartesia for TTS. With speech-to-speech, you take what the vendor's audio-native model gives you. If you want a specific voice that the speech-to-speech vendor does not offer, you cannot have it.
Telephony integration depth. The classical pipeline maps cleanly onto telephony's media model: Twilio, Telnyx, or Daily streams audio to your service over WebRTC or SIP; you decode it; you run it through STT. Speech-to-speech telephony has matured fast and there are documented production patterns for OpenAI Realtime and Gemini Live behind SIP/WebRTC gateways, but the classical pipeline still has the broader integration ecosystem in 2026. See Telephony + LLM — the Architecture That Actually Ships for the integration patterns.
Eval and observability are easier with text in the middle. When the model's understanding and response are routed through visible text (the STT transcript and the LLM output), you can log, evaluate, and debug the conversation as a sequence of strings. In a pure speech-to-speech architecture, the model's internal state is audio-native and not as cleanly inspectable. For production debugging this matters.
The cost of the pipeline is the latency hit — roughly 100 to 200 milliseconds — and a more complex stack to maintain.
The pipeline, layer by layer
Walking through the classical pipeline in the order audio flows:
1. Telephony / media transport. If this is a phone-based agent, calls arrive via a Twilio, Telnyx, or LiveKit integration. They terminate the call, encode the audio (typically Opus), and stream it to your service over WebRTC or SIP. If this is a web-based voice agent, the browser captures audio via the WebRTC API and streams it to your service the same way.
2. Voice activity detection (VAD) and turn-taking. The agent needs to know when the user has finished speaking so it can start responding. A VAD model — usually a small classifier running on a sliding window of audio — decides this. The naive version waits for a fixed silence period; better versions use an end-of-utterance detection model that reads prosody and intonation. Retell's production breakdown puts VAD and turn-taking at 150 to 300 milliseconds of latency, and it is one of the dominant contributors to total round-trip time.
3. Streaming speech-to-text. The audio is fed continuously to an STT model that emits partial transcripts as it goes. The major options in 2026: OpenAI Whisper large-v3 (multilingual, the standard open-source fallback) and its faster sibling large-v3-turbo, the community Distil-Whisper variants for lower latency, Deepgram Nova (commercial, low latency), and NVIDIA Parakeet for ultra-low-latency streaming. The Open ASR Leaderboard is the standard reference for current English WER rankings if you want to pick on accuracy. Streaming STT emits partials every ~50 milliseconds; the final transcript arrives shortly after end of speech, often masked by the VAD turn-detection wait.
4. The LLM. The transcript goes to the LLM, which streams tokens back. Time-to-first-token is the variable that matters most here — typically 150 to 400 milliseconds for a frontier model on a short conversational turn, longer if the system prompt is long, longer still if you have not enabled prompt caching. For voice, picking a fast model usually beats picking a "smart" model that takes 1.5 seconds to first token. See Latency Budgets for Real-Time Voice for the budget math.
5. Streaming text-to-speech. As the LLM's tokens arrive, they are fed to a streaming TTS model that emits audio chunks. Modern cloud TTS (OpenAI, Gemini, ElevenLabs, Cartesia) typically hits 100 to 200 milliseconds time-to-first-audio, with subsequent chunks arriving every few hundred milliseconds. The chunks are played back to the user as they arrive.
6. Audio back to the user. The same telephony or WebRTC channel that brought audio in carries audio back out. The user hears speech start while the rest of the response is still being generated.
The entire pipeline streams. There is no point at which one stage waits for the previous stage to fully complete. This is the design discipline that makes the latency budget work; abandon it anywhere and the whole stack collapses to "round-trip = sum of all stages serialized," which is unusable.
The speech-to-speech variant, simplified
The speech-to-speech architecture collapses steps 3, 4, and 5 into the model:
- Telephony / media transport (same as classical).
- VAD and turn-taking (still needed — the speech-to-speech model needs to know when to respond).
- Audio frames to the speech-to-speech endpoint (OpenAI Realtime, Gemini Live).
- Audio frames back from the endpoint.
- Audio back to the user (same as classical).
The model handles the understanding, the response generation, and the speech synthesis as one unified call. The wire format is a bidirectional event stream — audio frames in, audio frames out, with metadata events for things like tool calls and partial transcripts.
The win: roughly 100 to 200 milliseconds of latency saved by not serializing through text. The trade-off: you give up the flexibility of picking each layer independently, and observability gets harder.
Barge-in and turn-taking are the user-facing details
Two specific behaviors separate voice agents that feel alive from voice agents that feel like a 1990s IVR system:
Barge-in. When the user starts speaking while the agent is still talking, the agent should stop. Immediately. Then listen, and respond to what the user just said. Without barge-in, the user has to either yell over the agent or wait for it to finish, and the conversation feels one-directional. The implementation is a continuous VAD running on the user's input channel even while the agent is talking, with a cancel signal sent to the TTS stream when user speech is detected.
Turn-taking. When the user pauses mid-sentence, the agent should not immediately leap in to respond — that produces awkward overlap. When the user stops at the end of their thought, the agent should respond quickly without waiting through several seconds of silence. The end-of-utterance prediction is hard; the best models look at prosody (falling intonation, completed grammatical units) rather than pure silence-duration heuristics.
Both behaviors are platform-level concerns that the voice agent platforms (Vapi, Retell, Bland, LiveKit Agents) have spent years tuning. Building them from scratch is non-trivial and is one of the main reasons to use a platform rather than wiring the layers together yourself.
Where this hub goes next
The rest of this cluster goes deeper on each piece:
- Latency Budgets for Real-Time Voice walks through the layer-by-layer budget and what you can move where.
- Voice Agent Platforms Compared covers Vapi, Retell, Bland, LiveKit Agents, and when to use a platform versus build from primitives.
- Telephony + LLM — the Architecture That Actually Ships goes into the call-handling integration patterns.
- TTS in 2026 — What Got Good and What Still Hasn't covers the synthesis layer in depth.
The shape of the architecture has been stable for about eighteen months. The model layers underneath move every few months. Pick the architecture first, then keep the model layers swappable.