As of 2026-05-30

If you have written a chat-style LLM application, the jump to a phone-based voice agent is mostly about understanding the telephony layer underneath. The LLM part is essentially the same. The new vocabulary is media transport, voice activity detection, barge-in, and the standard ways that a phone call becomes a stream of audio frames your service can do something with.

This article is the working engineer's read of how the telephony plus LLM stack actually fits together in 2026, the four providers that own the space, and the failure modes that surface when you wire it together yourself.

What the telephony layer does

A phone call originates somewhere — a regular phone, a softphone, a SIP client — and traverses the PSTN (public switched telephone network) or a VoIP network until it reaches a number you control. That number is provisioned with one of the telephony providers, and the provider terminates the call: they decode the audio (usually from G.711 or Opus), handle the signaling, and expose the call to your application as a media stream.

The standard provider responsibilities:

  • Phone number provisioning (DIDs). A direct inward dial number you own, mapped to your application.
  • Inbound call routing. When the number rings, the provider notifies your application and hands off the audio stream.
  • Outbound call origination. You tell the provider to dial a number; it places the call and bridges the audio stream to your application.
  • Media transport. Audio frames flow between the caller and your service over WebRTC (for browser-based participants), SIP/RTP (for traditional telephony), or a vendor-specific streaming protocol.
  • Call lifecycle events. Ringing, answered, busy, no-answer, hangup, DTMF presses — all surfaced as events your application can react to.

What you do not get from the telephony layer: voice activity detection, turn-taking, STT, the LLM, TTS. Those are your responsibility (or the responsibility of the voice agent platform you use on top of the telephony provider).

The four providers worth knowing

Twilio. The reference telephony provider. Broadest international coverage, the most mature developer tools, the largest community, and the highest pricing. Twilio's Media Streams API is the standard way to receive raw audio from a call: you tell Twilio to stream the call's audio to your WebSocket; you receive base64-encoded mu-law frames; you do whatever you want with them. For most production voice agents, Twilio is the safe default if budget is not the binding constraint.

Telnyx. The cheaper, technically-strong alternative. Telnyx's Voice API exposes essentially the same primitives as Twilio (media streaming, call control, DTMF) often at meaningfully lower per-minute pricing. The trade-off is a smaller community and a slightly less polished developer experience. For volume workloads where Twilio's pricing becomes the dominant cost line, Telnyx is the standard alternative.

Daily. WebRTC-first. Daily's strength is browser-and-mobile real-time audio with very low latency and good media quality. If your voice agent is primarily web-based — embedded in a SaaS product, called via a "talk to support" button rather than a phone number — Daily is often a cleaner fit than Twilio. Daily also supports SIP for telephony integration, but it is not their primary surface.

LiveKit. The full WebRTC media stack plus an Agents framework for building voice agents on top. LiveKit is what you pick when you want the media plane and the agent runtime in the same stack, especially for use cases involving multi-party audio (group calls, virtual rooms, real-time translation across participants). Open source and self-hostable if you have a reason to. See Voice Agent Platforms Compared for how LiveKit Agents fits in the broader platform comparison.

The standard architecture

Bringing it together, the realistic shape of a production telephony plus LLM application:

  1. Telephony provider terminates the call (Twilio, Telnyx, Daily, or LiveKit). The provider authenticates the call, handles the PSTN/SIP layer, and exposes the call as a media stream plus a control channel.

  2. Media gateway — your service, or the voice agent platform you are using — receives the audio stream. The gateway handles:

    • Audio decoding from the wire format (mu-law, Opus, etc.) into PCM frames.
    • Voice activity detection — deciding when the user is speaking vs silent.
    • Turn-taking — deciding when the user has finished their thought.
    • Barge-in — stopping the agent's audio output when the user starts speaking.
    • Audio encoding back into the wire format for outbound audio.
  3. STT / LLM / TTS pipeline — the conversational AI proper. Streaming audio in, streaming text through the LLM, streaming audio out. See How a Voice Agent Actually Works End to End for the layer-by-layer pipeline.

  4. Tool execution and side effects. When the LLM decides to look up an account, transfer to a human, send a confirmation email, or schedule a callback, your application executes those tools and feeds the results back into the conversation.

  5. Call lifecycle and termination. When the user hangs up (or the agent transfers the call, or the agent ends it after the conversation completes), the telephony provider closes the call and emits a final event. Your application logs the call, records billing, and persists whatever state needs to live past the call.

The layers labelled 2 and 3 — media gateway plus pipeline — are the bulk of what the voice agent platforms ship. The telephony layer (1) and the side-effect layer (4) are your responsibility regardless of which platform you use.

What people actually get wrong wiring this themselves

Three failure modes that show up across teams who roll their own telephony plus LLM stack:

Underestimating the media gateway. "Just stream audio to my STT" looks simple in the diagram. In practice the gateway has to handle audio decoding, jitter buffering, partial frames, packet loss, codec negotiation, mid-call codec changes, mute/unmute events, and a dozen other operational concerns. Twilio Media Streams documentation makes this look like a one-evening project; production-quality gateway code is months of engineering.

Tuning VAD and turn-taking from scratch. End-of-utterance detection is a research problem. Voice agent platforms have models trained on millions of call minutes. A naive 200-millisecond silence threshold works on quiet lab audio and breaks immediately on background noise, accents, slow speakers, fast speakers, hold-and-resume patterns, and phone-system artifacts.

Underestimating reliability requirements. A 1-percent failure rate on a chat application is annoying. A 1-percent failure rate on a phone-based customer support agent is unacceptable. Voice agents need clean handling of dropped calls, mid-call provider failovers, partial pipeline failures, and graceful degradation. Building this reliability from scratch is real work.

The implication: unless you have a specific reason to build the media gateway yourself, use a voice agent platform. Twilio plus a voice agent platform (Vapi, Retell, Bland, LiveKit Agents) is the standard production architecture in 2026 for good reasons.

Outbound vs inbound

Two patterns worth distinguishing because they have different operational shapes:

Inbound. A user calls a number you control. Your application picks up. The flow is reactive: respond to whatever the user says. The challenges are call quality, voice clarity, and the conversational handling itself.

Outbound. Your application dials a number. The user picks up. The flow is goal-directed: the agent has something it is trying to do (qualify a lead, confirm an appointment, follow up on an issue). The challenges include: detecting whether a human or voicemail picked up, navigating IVR menus on the other end, handling holds and transfers, and crucially, regulatory compliance (TCPA in the US, similar regimes elsewhere). Outbound at scale is a legal compliance exercise in addition to a technical one. Bland AI is specifically focused on this use case for that reason.

For most teams the right starting point is inbound: simpler to ship, lower legal risk, the same underlying voice agent technology. Outbound at volume opens additional concerns that are worth understanding before you build it.