Why does the model miss small text in my image?

Two reasons usually. One, the providers downscale large images before processing (varies by provider; OpenAI, Anthropic, and Gemini all do this) so the effective resolution the model sees is lower than what you uploaded. Two, OCR-level reading of dense text is genuinely harder than picture-level understanding. For dense documents, either upload higher-resolution targeted crops, use a provider with explicit high-detail mode (OpenAI's "detail: high"), or chain with a dedicated OCR step first.

How do I get the model to count accurately?

Counting is one of the weakest spots for vision models. Three mitigations: explicitly ask the model to count by listing each item it sees (forces enumeration), use a structured output that demands one entry per item, and verify with a second model or a non-vision check where possible. For high-stakes counting, do not rely on a single shot.

Should I describe the image first, then ask the question?

Yes, usually. The vision equivalent of chain-of-thought — asking the model to describe before answering — produces measurably better answers on multi-step visual reasoning. It costs more output tokens but the accuracy gain is generally worth it. For simple one-shot tasks (caption an image, transcribe a sign) it is unnecessary overhead.

Prompting Vision Models

Vision models in 2026 — Claude with image input, OpenAI GPT-4o/5.x with vision, Gemini's multimodal stack, Llama-3.2-Vision, Qwen-VL — share enough common shape that the prompting techniques transfer. They also share enough common failure modes that the same mitigations apply.

The five tasks vision models do well

In rough order of how reliably they work:

1. Image description. Tell the model "describe this image" and you get a decent description. Refinements come from constraining what to describe (scene, objects, text, colors) and asking the model not to infer beyond what is visible.

A working prompt shape:

Describe what you can clearly see in this image, without guessing.
Cover: scene, main objects, any visible text, notable colors.
If you cannot tell something, say "not clearly visible."

2. OCR on clean text. Reading printed text from images works well on signs, slides, screenshots, and reasonably clean documents. Failure cases: handwriting, low-resolution scans, anything where the text is small relative to the image. For dense documents, see working-with-document-images.

3. Visual question answering. "Is the door open?" "What's the brand of car?" "Does the chart show growth?" These work reliably on clearly-visible features and start to fail on edge cases (small features, ambiguous answers, things the model wants to hedge on).

4. Chart and table reading. Modern vision models read bar charts, line charts, and pie charts well enough to extract trends. They are less reliable on small numeric values, dense legends, and unconventional chart types. Tables get read as structured data on Claude and Gemini particularly well; OpenAI is competitive.

5. UI and screenshot understanding. Identifying what is on a UI, what buttons do, where the user is in a flow. Critical for agentic tools and assistant use cases. Mostly reliable on clear screenshots; less so on dense interfaces or screenshots taken at very small sizes.

The three failure modes

The failures cluster predictably:

Small text. Providers downscale large images before the model sees them, so a photo with a tiny "no parking" sign in the corner is often invisible to the model. Mitigations: upload higher resolution where the provider supports it (OpenAI's detail: high mode), crop to the region you care about, or chain with explicit OCR for the small-text portion.

Counting and finding. "How many cars are in this picture?" gets wrong answers more often than you would expect, especially as the count grows. "Find the person wearing a red shirt" is often confidently wrong about which person they meant. Mitigations: ask the model to enumerate each item it sees before counting, use structured output that forces one entry per item, verify high-stakes counts.

Hallucinated details. Vision models will sometimes describe details that are not in the image at all — text that does not exist, objects that are not there, plausibly-but-falsely identified people or places. The mitigations: explicit grounding ("describe what you actually see, do not infer"), ask for uncertainty, never trust a single shot for high-stakes claims.

Techniques that actually move accuracy

A short list of patterns that consistently help:

Ground before reasoning. Ask the model to describe what it sees before it answers your question. This is the visual equivalent of chain-of-thought and produces measurable accuracy gains on multi-step tasks.

First, describe what you see in the image.
Then, answer the question: [question]

Allow uncertainty. "If you cannot tell, say so" is one of the highest-ROI sentences in vision prompting. Without it, the model will produce plausible answers to questions it cannot actually answer; with it, you get useful "I'm not sure because..." responses that you can route differently downstream.

Constrain output structure. A free-form vision answer drifts. A structured output (JSON with explicit fields) constrains the model and makes the answer mechanically parseable.

Return a JSON object with these fields:
- objects: list of objects you can clearly see
- text: any text visible in the image
- scene: indoor/outdoor and rough setting
- uncertain: list of things you saw partially but cannot confirm

Detail mode where supported. OpenAI's vision API has detail: low and detail: high modes. Low is cheaper but lower-resolution; high uses more tokens but reads small details better. Anthropic and Gemini handle this more automatically but you can influence it by image dimensions. For dense detail, use high; for general scene understanding, low is usually fine.

Multi-image comparison. All three major providers handle multiple images in one prompt. "Compare these two charts and identify what changed" is a natural multi-image task. Watch the cost: multiple images mean multiple per-image token charges.

Provider-specific notes

A few specifics worth knowing:

Anthropic Claude — explicitly recommends honesty about uncertainty in vision prompts. Their docs encourage telling the model to admit when it cannot see something. Strong at chart and document understanding.
OpenAI — detail: low/high modes give you explicit control over the cost/accuracy tradeoff. Strong at general scene description.
Google Gemini — the longest practical context for multi-image work. Good for "compare these 50 product photos" style tasks. Strong at video too.
Open-weight (Llama-3.2-Vision, Qwen-VL) — competitive on general tasks; the gap closes every release. Lower per-call cost; you handle the infrastructure.

The picking question rarely turns on raw vision capability at this point — all the frontier models are good enough at the basics. It turns on cost, the rest of your stack, and which provider's docs you find easier to read.

Comments 2

u/cv_anna · 1 month ago

the tip about being explicit about what to look at in the image is gold. vision models love to describe everything except the one thing you actually asked about
u/vis_v2 · 1 month ago

being explicit about the region of interest is the exact tip i needed. vision models will describe the whole image unless you pin them right down

Prompting Vision Models

Article summary

The five tasks vision models do well

The three failure modes

Techniques that actually move accuracy

Provider-specific notes

Frequently asked questions

See also

Where to go next

Comments 2