Subject
Prompting when the input is not just text: images, document scans, audio, and video. How the major multimodal models handle each modality, where they reliably fail, and what changes when you stop typing prompts and start mixing media.
As of 2026-05-22
As of 2026-05-22
For the first few years of LLMs, "prompting" meant typing English at a model and reading English back. By 2026 the major models all accept images and most accept more — documents, audio, sometimes video. The fundamentals do not change: instruction, context, examples, output spec still drive output quality. What changes is the shape of the levers, the failure modes, and the cost.
The mid-2026 picture, by provider:
OpenAI — GPT-4o and the GPT-5.x family are positioned as "one model for text, image, and audio." Image and audio input both via the standard API; audio also via the Realtime API for streaming. Video on the GPT-5 generation is supported in the API, with frame-sampling for non-Realtime use. Primary docs at platform.openai.com/docs.
Anthropic — Claude accepts text and images natively. PDF input is a first-class capability: each page is converted to an image and Claude reads both text and layout in one pass. Page-count limits apply (around 100 pages of visual analysis on current models). Audio is via tool-use integrations or third-party transcription; not native. Docs at docs.anthropic.com.
Google Gemini — The most modality-complete of the major models. Native image, document, audio, and video input. Gemini Live API handles streaming voice. Long-context behavior on multimodal input is genuinely useful (entire films, long meetings). Docs at ai.google.dev.
Open-weight (Llama, Qwen) — Vision is well-supported (Llama-3.2-Vision, Qwen2.5-VL, the Qwen-VL line). Audio and video are partial or experimental, usually through companion models rather than a single multimodal stack. Hugging Face is the canonical source per family.
The honest 2026 picture: vision is standardized, document understanding is converging, audio is bifurcated (batch vs realtime), video is mostly Gemini.
The prompting fundamentals carry over. From how-to-prompt-ai-effectively:
A vision prompt that does all four still beats a vision prompt that does one. The fundamentals do not stop working because the input is now a JPEG.
Three things shift when modalities enter:
The cost model. Images and audio are not "free" to send. Each image is billed as a chunk of tokens (typically a few hundred to a few thousand depending on resolution and provider), and audio is billed by duration or token equivalent. A casually-uploaded high-resolution photo can quietly use more tokens than a long text document. Compress, downscale, or sample where you can.
The failure modes. Vision models hallucinate visual details with confidence. Audio models mis-transcribe accents and acronyms. Document models miss handwritten annotations. The mitigations differ per modality but the meta-pattern is the same: ground the answer ("describe before reasoning"), allow uncertainty ("say so if you cannot tell"), verify what matters.
The output shape. A vision-language task often benefits from a two-step prompt: first describe what you see, then answer the question. This is the multimodal version of chain-of-thought, and it consistently improves vision accuracy on the same evals that text-CoT improves on for reasoning.
The native-vs-chain question comes up constantly. A useful framing:
A reasonable rule of thumb: if you find yourself wanting to "ask questions about" the multimedia, use native. If you want to "convert it to text and process," chain specialists.
Each linked article goes deeper on one modality:
If you are already prompting text-only models effectively, the multimodal cluster is mostly about the new failure modes and the new cost model. The fundamentals stay.
A short editorial reading list. Pick whichever fits how you like to learn.