As of 2026-05-30
Text-to-speech is the layer of the voice AI stack that quietly became extremely good while no one was paying attention. In 2022, AI-generated speech mostly sounded synthetic — flat prosody, weird emphasis, occasional artifacts. By 2026, the major proprietary systems sit close enough to human-recorded speech in blind evaluations that the question shifted from "can users tell" to "does it matter."
This article is the working engineer's read of where TTS actually is in 2026, the vendors that own the production space, what the open-source alternatives can do, and the things synthesis still fails at.
The MOS-4.2 club: proprietary TTS in 2026
Mean Opinion Score (MOS) is the standard subjective quality metric for synthesized speech — a 1-to-5 scale where human listeners rate how natural the audio sounds. Human-recorded speech typically scores 4.4 to 4.6 on this scale. The threshold for "near-indistinguishable from a person" is roughly 4.2.
Per Inworld's 2026 speech-to-speech benchmark roundup and a 2026 TTS comparison piece from Coval AI, the major proprietary TTS systems in 2026 sit in the 4.2 to 4.3 MOS range for English:
- OpenAI TTS (the same family that powers GPT-4o's speech output) — strong naturalness, good latency, integrated with the broader OpenAI stack.
- Google Gemini TTS — equivalent quality tier, native integration with Gemini Live.
- ElevenLabs — historically the quality leader; broad voice catalog, strong emotional expression, voice cloning available.
- Cartesia — specialist for low-latency, controllable voices; popular in voice-agent platform stacks for its TTFA characteristics.
- Deepgram Aura — Deepgram's TTS, optimized for real-time agent use; pairs naturally with their Nova STT.
The gap between these four is small enough that the choice comes down to: voice catalog (ElevenLabs has the widest), latency (Cartesia and Deepgram Aura target the lowest TTFA), integration with the rest of your stack (OpenAI if you are already on GPT-4o; Gemini if you are on Google), and price.
Open-source TTS closed most of the gap
The story that did not get enough coverage: open-source TTS has gotten extremely good.
Kokoro is the headline open-source TTS as of mid-2026. The 2026 TTS benchmark video puts it just behind the major proprietary systems on MOS for English, often within 0.2 points. For a permissively-licensed open model that runs on modest hardware, this is the largest single quality jump open TTS has seen.
Piper covers the lightweight-neural-TTS space. Designed for edge devices and mobile, Piper trades some quality for footprint — a model that runs comfortably on a Raspberry Pi or in-browser. MOS scores above 4.0 in good configurations, well below Kokoro but acceptable for many use cases.
Other open options. A long tail of models has emerged (XTTS, OuteTTS, MaskGCT, and others) covering specific niches — multilingual cloning, expressive prosody, specific language coverage. The space moves fast; current rankings shift every few months.
The implication: for accessibility tooling, internal-use products, high-volume content generation, or anywhere per-character pricing matters, open-source TTS in 2026 is genuinely viable in a way it was not eighteen months ago. The decision is no longer "use the proprietary one or accept robotic synthesis" — it is "does the quality gap matter enough for your use case to pay the per-character premium."
Latency: the under-discussed dimension
Quality dominates the discourse, but for voice agents latency is what determines whether TTS feels live or not.
Production TTS targets in 2026, per Retell's latency breakdown:
- Time-to-first-audio (TTFA): 100 to 200 milliseconds from the TTS receiving the first text tokens to the first audio chunk being available.
- Chunk size: 200 to 400 milliseconds of audio per chunk in streaming mode.
The vendors hit this differently. Cartesia and Deepgram Aura target the lower end. OpenAI TTS lands in the middle. ElevenLabs has historically been slower on TTFA — quality-tuned more than latency-tuned — though they have been closing the gap.
The practical implication for voice agent design: if your TTS adds 500 milliseconds to TTFA, your entire latency budget breaks. The TTS provider choice is a real engineering decision, not just a brand preference.
What synthesis still does badly
Despite the quality progress, several specific failures persist:
Emotion the script does not explicitly call out. TTS can render "I'm so sorry to hear that" sympathetically when the input is well-formatted, but it does not infer emotion from the situation. If your LLM produces a neutral response to a user who is clearly upset, the TTS will not soften it. You need to either tag the text with emotion markers or use a speech-to-speech model that does the inference inside the model.
Long-form expressive reading. A 2,000-word passage that needs to vary pace, emphasis, and tone across paragraphs is still uneven. Synthesized speech tends to flatten over long stretches, losing the natural variation a human reader would bring. For audiobook-style content, this matters.
Code-switching mid-sentence. A sentence that mixes English and another language ("Let me check ese archivo for you") often produces unnatural pronunciation at the language boundary. Multilingual TTS has improved a lot for monolingual passages; mid-sentence switches remain a weakness.
Less-resourced languages. English is at MOS 4.2 to 4.3. Other major languages (Mandarin, Spanish, German, French, Japanese) are usually within 0.2 to 0.3 points of that. Lower-resourced languages can be a full point below, sometimes more. If your product ships in many languages, expect uneven quality.
Multispeaker dialogue. Generating a conversation between two distinct voices with appropriate turn-taking, prosody handoffs, and emotional consistency is still hard. Most TTS engines render one speaker at a time and stitch the output; full multi-speaker rendering is a research-grade feature.
Real-time reaction. TTS renders a known script. If the agent needs to react to events mid-utterance — "wait, hold on, I see you just typed something" — the synthesis has to stop, possibly mid-word, and restart. The barge-in mechanics handle this clumsily at the audio layer; truly fluid mid-utterance reaction is not yet a solved problem.
Voice cloning, separately
Voice cloning sits in a different product category than standard TTS. The technical requirements are different (few-shot voice learning), the ethical and legal concerns differ (consent, impersonation, copyright), and the use cases are different (content production, dubbing, accessibility) rather than live agents.
The state in 2026:
- ElevenLabs is the most visible vendor; their voice cloning is the de facto reference for the category.
- OpenAI, Cartesia, and others ship limited cloning capabilities under various consent gates.
- Open-source options (XTTS family, some others) make voice cloning increasingly accessible, with all the safety and consent concerns that implies.
For voice agents specifically, cloned voices are rarely the right call — users tend to find an obviously-AI voice clearer about expectations than a near-perfect clone of a specific person. The use cases for cloning live more in content production, dubbing, and accessibility tools.
What to pick
A short decision guide for the TTS layer of a voice agent in 2026:
- Default for production voice agents: OpenAI TTS or Cartesia, on grounds of latency and quality balance.
- If voice variety matters more than latency: ElevenLabs.
- If you are already on Gemini Live: Gemini TTS (it is just part of the same stack).
- If you need to self-host or per-character cost is critical: Kokoro.
- If you are deploying to edge devices or browsers: Piper.
- If quality on less-resourced languages is essential: evaluate each vendor on your specific language; assume nothing.
The TTS layer is one of the easier ones to swap once it is wrapped in the voice agent platform. Pick a sensible default, monitor user complaints about voice quality, and revisit if a specific failure mode shows up enough to matter.