Do all the big models support all modalities?

No, not quite yet. Image input is universal across the frontier and most open-weight models. Document/PDF as a first-class input is supported by Claude and Gemini, partial elsewhere. Audio input is solid on OpenAI (Realtime API + Whisper) and Gemini (Live API); other providers usually require you to transcribe first. Video input is most developed on Gemini; partial on OpenAI; experimental elsewhere. The picture changes quickly — check the docs for your target provider.

How does prompting actually change when you add an image?

Two main differences. First, you usually want to anchor the prompt to the visual content explicitly ("In the image above, count the cars..." rather than just "Count the cars"). Second, vision models hallucinate visual details that are not present — so prompts that ask the model to ground its answer ("describe what you see before answering") and to admit uncertainty ("if you cannot tell, say so") improve accuracy noticeably.

Should I use a native multimodal model or chain separate models?

It depends on the task. For tightly-coupled understanding (a document where text and images interact, a video where audio matches what is happening on screen), native multimodal wins. For decoupled tasks — transcribe an audio file, then summarize the text — chaining a specialist (Whisper) with a text LLM often costs less and works better. Native does not always beat specialized; check both on your task.

Prompting Multimodal Models

As of 2026-05-22

For the first few years of LLMs, "prompting" meant typing English at a model and reading English back. By 2026 the major models all accept images and most accept more — documents, audio, sometimes video. The fundamentals do not change: instruction, context, examples, output spec still drive output quality. What changes is the shape of the levers, the failure modes, and the cost.

The major models, by modality

The mid-2026 picture, by provider:

OpenAI — GPT-4o and the GPT-5.x family are positioned as "one model for text, image, and audio." Image and audio input both via the standard API; audio also via the Realtime API for streaming. Video on the GPT-5 generation is supported in the API, with frame-sampling for non-Realtime use. Primary docs at platform.openai.com/docs.

Anthropic — Claude accepts text and images natively. PDF input is a first-class capability: each page is converted to an image and Claude reads both text and layout in one pass. Page-count limits apply (around 100 pages of visual analysis on current models). Audio is via tool-use integrations or third-party transcription; not native. Docs at docs.anthropic.com.

Google Gemini — The most modality-complete of the major models. Native image, document, audio, and video input. Gemini Live API handles streaming voice. Long-context behavior on multimodal input is genuinely useful (entire films, long meetings). Docs at ai.google.dev.

Open-weight (Llama, Qwen) — Vision is well-supported (Llama-3.2-Vision, Qwen2.5-VL, the Qwen-VL line). Audio and video are partial or experimental, usually through companion models rather than a single multimodal stack. Hugging Face is the canonical source per family.

The honest 2026 picture: vision is standardized, document understanding is converging, audio is bifurcated (batch vs realtime), video is mostly Gemini.

What stays the same

The prompting fundamentals carry over. From how-to-prompt-ai-effectively:

Instruction — say what you want the model to do.
Context — give the model what it needs (the image, the document, the audio).
Examples — show what good output looks like.
Output spec — say what shape the answer should take.

A vision prompt that does all four still beats a vision prompt that does one. The fundamentals do not stop working because the input is now a JPEG.

What changes

Three things shift when modalities enter:

The cost model. Images and audio are not "free" to send. Each image is billed as a chunk of tokens (typically a few hundred to a few thousand depending on resolution and provider), and audio is billed by duration or token equivalent. A casually-uploaded high-resolution photo can quietly use more tokens than a long text document. Compress, downscale, or sample where you can.

The failure modes. Vision models hallucinate visual details with confidence. Audio models mis-transcribe accents and acronyms. Document models miss handwritten annotations. The mitigations differ per modality but the meta-pattern is the same: ground the answer ("describe before reasoning"), allow uncertainty ("say so if you cannot tell"), verify what matters.

The output shape. A vision-language task often benefits from a two-step prompt: first describe what you see, then answer the question. This is the multimodal version of chain-of-thought, and it consistently improves vision accuracy on the same evals that text-CoT improves on for reasoning.

When to use a native multimodal model vs chain specialists

The native-vs-chain question comes up constantly. A useful framing:

Use a native multimodal model when the modalities interact. A document where chart trends contradict body text. A video where the speaker's facial expression matters. A screenshot where the layout matters as much as the text. These are tasks where decoupling loses information.
Use a specialist + text LLM when modalities are independent. Transcribe a podcast, then summarize. Extract text from a scan, then classify. Detect objects in an image, then write copy. These pipelines often beat native multimodal on cost and quality because each stage uses a model tuned for one job.

A reasonable rule of thumb: if you find yourself wanting to "ask questions about" the multimedia, use native. If you want to "convert it to text and process," chain specialists.

What this cluster covers

Each linked article goes deeper on one modality:

Prompting Vision Models — image input across providers, common tasks, failure modes, techniques.
Working with Document Images — PDFs, scans, screenshots, when to use vision LLMs vs OCR, structured extraction.
Audio and Video Prompting — speech-to-text, text-to-speech, video understanding, realtime voice.

If you are already prompting text-only models effectively, the multimodal cluster is mostly about the new failure modes and the new cost model. The fundamentals stay.

Prompting Multimodal Models

Article summary

The major models, by modality

What stays the same

What changes

When to use a native multimodal model vs chain specialists

What this cluster covers

Frequently asked questions

See also

Where to go next

Comments 1