Subject

Multimodal Prompting

Prompting when the input is not just text: images, document scans, audio, and video. How the major multimodal models handle each modality, where they reliably fail, and what changes when you stop typing prompts and start mixing media.

As of 2026-05-22

For the first few years of LLMs, "prompting" meant typing English at a model and reading English back. By 2026 the major models all accept images and most accept more — documents, audio, sometimes video. The fundamentals do not change: instruction, context, examples, output spec still drive output quality. What changes is the shape of the levers, the failure modes, and the cost.

The major models, by modality

The mid-2026 picture, by provider:

OpenAI — GPT-4o and the GPT-5.x family are positioned as "one model for text, image, and audio." Image and audio input both via the standard API; audio also via the Realtime API for streaming. Video on the GPT-5 generation is supported in the API, with frame-sampling for non-Realtime use. Primary docs at platform.openai.com/docs.

Anthropic — Claude accepts text and images natively. PDF input is a first-class capability: each page is converted to an image and Claude reads both text and layout in one pass. Page-count limits apply (around 100 pages of visual analysis on current models). Audio is via tool-use integrations or third-party transcription; not native. Docs at docs.anthropic.com.

Google Gemini — The most modality-complete of the major models. Native image, document, audio, and video input. Gemini Live API handles streaming voice. Long-context behavior on multimodal input is genuinely useful (entire films, long meetings). Docs at ai.google.dev.

Open-weight (Llama, Qwen) — Vision is well-supported (Llama-3.2-Vision, Qwen2.5-VL, the Qwen-VL line). Audio and video are partial or experimental, usually through companion models rather than a single multimodal stack. Hugging Face is the canonical source per family.

The honest 2026 picture: vision is standardized, document understanding is converging, audio is bifurcated (batch vs realtime), video is mostly Gemini.

What stays the same

The prompting fundamentals carry over. From how-to-prompt-ai-effectively:

Instruction — say what you want the model to do.
Context — give the model what it needs (the image, the document, the audio).
Examples — show what good output looks like.
Output spec — say what shape the answer should take.

A vision prompt that does all four still beats a vision prompt that does one. The fundamentals do not stop working because the input is now a JPEG.

What changes

Three things shift when modalities enter:

The cost model. Images and audio are not "free" to send. Each image is billed as a chunk of tokens (typically a few hundred to a few thousand depending on resolution and provider), and audio is billed by duration or token equivalent. A casually-uploaded high-resolution photo can quietly use more tokens than a long text document. Compress, downscale, or sample where you can.

The failure modes. Vision models hallucinate visual details with confidence. Audio models mis-transcribe accents and acronyms. Document models miss handwritten annotations. The mitigations differ per modality but the meta-pattern is the same: ground the answer ("describe before reasoning"), allow uncertainty ("say so if you cannot tell"), verify what matters.

The output shape. A vision-language task often benefits from a two-step prompt: first describe what you see, then answer the question. This is the multimodal version of chain-of-thought, and it consistently improves vision accuracy on the same evals that text-CoT improves on for reasoning.

When to use a native multimodal model vs chain specialists

The native-vs-chain question comes up constantly. A useful framing:

Use a native multimodal model when the modalities interact. A document where chart trends contradict body text. A video where the speaker's facial expression matters. A screenshot where the layout matters as much as the text. These are tasks where decoupling loses information.
Use a specialist + text LLM when modalities are independent. Transcribe a podcast, then summarize. Extract text from a scan, then classify. Detect objects in an image, then write copy. These pipelines often beat native multimodal on cost and quality because each stage uses a model tuned for one job.

A reasonable rule of thumb: if you find yourself wanting to "ask questions about" the multimedia, use native. If you want to "convert it to text and process," chain specialists.

What this cluster covers

Each linked article goes deeper on one modality:

Prompting Vision Models — image input across providers, common tasks, failure modes, techniques.
Working with Document Images — PDFs, scans, screenshots, when to use vision LLMs vs OCR, structured extraction.
Audio and Video Prompting — speech-to-text, text-to-speech, video understanding, realtime voice.

If you are already prompting text-only models effectively, the multimodal cluster is mostly about the new failure modes and the new cost model. The fundamentals stay.

Forthcoming

Chart and Table Understanding
Realtime Voice Prompting
Multimodal Evals

Where to go next

A short editorial reading list. Pick whichever fits how you like to learn.

NerdSip: 5-minute AI micro-course on almost any topic, on iOS and Android

Comments 0

No comments yet. Be the first to share your thoughts.