As of 2026-05-22
As of 2026-05-22
Audio and video are where multimodal prompting diverges most clearly from text prompting. The shape of the work is different, the latency model is different, and the provider landscape is more fragmented. The good news is the choices cluster into a small set.
Speech-to-text: two paths
Path A — Batch transcription with a specialist. Whisper from OpenAI is still the default. The model is open-source, the API is cheap, and the workflow is straightforward: send an audio file, get back a transcript with optional timestamps and segments. Alternatives at similar quality: Deepgram, AssemblyAI, Azure Speech, Google Cloud Speech-to-Text. Whisper wins on cost; Deepgram and AssemblyAI add features like speaker diarization, custom vocabulary, and lower latency.
When this path fits: archives, podcasts, meetings, anything where the transcript is the deliverable or feeds into a downstream text pipeline.
Path B — Native multimodal audio input. OpenAI's GPT-4o and 5.x family, Google's Gemini, and a growing list of others ingest audio directly. You send the audio file as part of the prompt; the model produces a text or audio response without an explicit transcription step. Useful when you want the model to do something with the audio beyond transcribing it ("summarize this call," "what was the customer upset about," "did the speaker commit to a deadline").
The two paths are not in direct competition. Most production pipelines use both — Whisper for bulk transcription at the storage layer, native multimodal for downstream understanding when needed.
Realtime voice: a different shape
If you want a voice assistant, a phone agent, or any back-and-forth voice interface, neither batch transcription nor native-audio-input is what you want. You want realtime APIs.
- OpenAI Realtime API — a WebSocket-based streaming API where you send microphone audio and receive model audio (or text) in real time. The model handles speech understanding, reasoning, and speech generation in a single connection. Tool use and function calling work inside the Realtime API. Latency is in the few-hundred-millisecond range for typical turns.
- Gemini Live API — Google's equivalent. Streaming bidirectional audio, low-latency, with similar tool-use integration. Strong on long conversations.
- Specialized voice platforms — Vapi, Retell, Pipecat and others wrap these APIs (and others) for phone-native applications. Worth considering if you are building specifically for phone calls.
Realtime APIs are dramatically more natural to interact with than the older "press a button, wait for transcribe, wait for response" loops. They are also harder to test, debug, and observe — building a robust voice agent is a real engineering exercise on top of these primitives.
Text-to-speech (TTS)
The TTS market is its own ecosystem:
- ElevenLabs — widely considered the quality leader. Multiple voices, voice cloning, multilingual. The default for high-quality TTS in production applications.
- OpenAI TTS — solid quality, cheaper, integrated with the OpenAI API. Lower customization than ElevenLabs.
- Google Cloud TTS / Microsoft Azure TTS — long-standing enterprise options with broad language support.
- Native multimodal output — GPT-4o/5.x and Gemini can produce audio directly as part of a response. Quality has caught up to dedicated TTS services for many use cases; cost is often lower because it shares the same model call.
For most products in 2026, the question is "ElevenLabs or native multimodal output?" The answer often comes down to whether you need specific voice characteristics (cloning, custom personas) — that is ElevenLabs territory — or whether you just need a reasonable voice attached to your LLM (native is fine).
Video understanding
Video is the most uneven modality across providers:
- Gemini — the most mature native video understanding. Send a video file directly; the model handles frame sampling internally. Useful for "describe this video," "find the moment when X happens," "transcribe the dialog and align it to scenes." Long-context handles longer videos well.
- GPT-5.x family — video support is in the API; quality is competitive but the workflow is younger than Gemini's.
- Everyone else — typically requires you to do the frame sampling yourself. Pull frames at 1 fps (or some other rate), pass them as a sequence of images, optionally pair with a separate audio transcription. Works, but more fragile and more expensive than native video.
A common pattern that works across providers: pre-process the video into a structured representation (a few keyframes plus a transcript) and pass that to a text-only LLM. Loses some information but is cheap, reliable, and easy to debug.
Common patterns
A few patterns that work across the audio/video space:
Two-pass audio understanding. Pass 1: transcribe (Whisper or native). Pass 2: ask the LLM questions about the transcript. Cheaper than running a multimodal model end-to-end on a long file; works for most "understand this recording" tasks.
Streaming TTS with sentence boundaries. When generating audio output, stream the model's text out and feed it to TTS sentence by sentence rather than waiting for the whole response. Latency drops from "wait for the full answer" to "hear the first word in 1-2 seconds."
Audio for input, text for output. Even with native multimodal, returning text instead of audio is often cleaner for testing, logging, and rendering. Mix: take audio in (so users can talk naturally), produce text out (or text + optional TTS), so you can audit what the model said.
Frame sampling rate for video. For most non-Gemini video work, 1 frame per second is a reasonable default. Action-heavy content benefits from 2-4 fps; static content can drop to 0.25 fps. Higher rates are almost always wasteful — the LLM does not gain much from successive nearly-identical frames.
What is changing fast
The multimodal landscape moves faster than other parts of the LLM stack. The shapes (batch vs realtime, native vs specialist, frame-sampled vs native video) are likely to stay; the specific best-in-class tool inside each shape changes every few months. Bookmark provider docs, re-check before major architectural decisions, and treat this article as a map of the categories — the specific stars in each category may have moved by the time you read it.