Should I OCR first and then send the text, or send the PDF directly?

Depends on the task. For semantic Q&A over a document, send the PDF directly (or page images) — the model uses layout cues that get lost in plain text. For bulk text extraction at scale, OCR first is cheaper and usually accurate enough. For structured extraction (forms, tables), test both — vision LLMs often win on accuracy but cost more per page.

How many pages can I send in a single prompt?

Provider-specific. Claude documents a ~100-page limit for visual PDF analysis on current models. Gemini handles larger PDFs with its long-context window but with rising cost. OpenAI handles PDFs through tools or via per-page image conversion. For long documents, chunk by page or section, process in batches, and aggregate the results — exactly the same pattern as RAG, just with page chunks instead of text chunks.

How do I handle handwriting and low-quality scans?

Vision LLMs handle clear handwriting better than they used to but still degrade fast on cursive, small handwriting, or low-resolution scans. For mixed-quality archives, run a quality check first (resolution, contrast, skew); pre-process (deskew, denoise, upscale) before sending. For high-stakes handwriting (historical archives, medical records), specialized handwriting OCR systems still outperform general vision LLMs.

Working with Document Images

Document processing was one of the first places vision LLMs showed they could compete with specialized tools. The 2026 picture is honest: vision LLMs are good enough to be the default for many document tasks, but traditional OCR and document AI services still earn their keep at the bulk and the high-accuracy ends of the spectrum.

When to use a vision LLM

Vision LLMs handle documents differently from OCR. They convert each page to an image, read both text and layout, and reason about the content. The implication: they understand what they are reading in a way pure OCR does not.

Vision LLMs are the right choice when:

You want answers, not just text. "Summarize the findings of this report," "what does page 3 say about indemnification," "extract the action items from this meeting summary." Tasks where the output is an interpretation of the content, not a transcription of it.
Layout matters. Tables, forms, contracts where the structure carries meaning. Vision LLMs preserve the spatial relationships pure OCR loses.
Charts and figures are part of the content. A research paper with key results in a chart, a financial filing with embedded tables, a slide deck. Vision LLMs read these natively; OCR misses the visual content.
You need flexible extraction. Vision LLMs can be told what to extract in plain English ("pull every reference to a deadline"); OCR pipelines require either rigid templates or significant post-processing.

The canonical providers:

Claude PDF support — Anthropic processes each page as an image alongside extracted text. First-class on the Opus and Sonnet tiers. ~100-page visual analysis limit on current models. Docs at docs.anthropic.com/en/docs/build-with-claude/pdf-support.
Gemini multimodal — accepts PDFs and page images directly; long-context window helps with long documents. Docs at ai.google.dev/gemini-api/docs/document-processing.
GPT-4o/5.x with vision — handles documents via image conversion (you typically render pages to images then send). Works well; less native than Claude or Gemini for PDF specifically.

When traditional OCR still wins

Some workloads still benefit from specialized OCR or document AI:

High-volume bulk text extraction. Tesseract is free and processes thousands of pages cheaply. AWS Textract, Azure Document Intelligence, and Google Document AI are higher quality and still cheaper per page than a vision LLM at scale.
Forms with rigid structure. If you process the same form template at scale, Textract Forms / Document Intelligence Layout / Mistral OCR often beats general LLMs because they are tuned for form fields and tables.
Maximum text accuracy on clean documents. Specialized OCR is generally more accurate than vision LLMs for raw character recognition on clean printed text.
Compliance and audit settings. When you need to defend why a particular character was read a particular way, OCR with confidence scores is more legible than "the LLM said so."

The hybrid pattern that wins

Production document pipelines in 2026 increasingly use both. A common shape:

Run OCR first for bulk text extraction. Cheap, fast, scales to millions of pages.
Index the text for retrieval (see how-rag-actually-works).
For semantic questions or extraction, retrieve the relevant pages and send the page images to a vision LLM with the OCR text as a starting point. The vision LLM uses both — the structure from the image and the text from the OCR — and produces better extractions than either alone.

This is the document equivalent of hybrid retrieval: OCR as the BM25-like exact-text layer, vision LLM as the semantic layer.

Practical prompting for documents

When you do use a vision LLM directly on documents, a few patterns help:

Provide the page count and ask the model to acknowledge it. Vision models occasionally miss pages or process them out of order. A simple "This document has N pages. After reading, confirm the number of pages you analyzed" catches the error.

Ask for structured extraction with citations. Instead of "summarize this contract," ask "list every party named in this contract, with the page where they first appear." Citations let you verify and route exceptions.

For each party named in this contract, return:
- name: the party as written in the contract
- role: their role (buyer, seller, etc.)
- first_page: the page number where they first appear

Format as JSON. If something is ambiguous, include "uncertain": true.

Tables and forms benefit from explicit format requests. "Return this table as JSON with the column headers as keys" produces cleaner output than "extract the table." Same for forms: "return all the form fields as a JSON object."

For multi-page documents, work in passes. A first pass to identify the structure ("which pages contain the financials?"), then targeted second passes on the relevant pages. Cheaper than asking the model to reason across the full document in one shot.

What to skip

A few patterns that look productive but are not:

Sending high-resolution PDFs without consideration. Some providers downscale aggressively, so your high-res scan is processed at much lower effective resolution. Either confirm your provider preserves resolution or pre-process to a target resolution before sending.
One-shot extraction on long documents. Asking the model to "extract everything important" from a 200-page report produces something, but it is rarely complete. Split the work.
Relying on a single response for compliance-critical extraction. Vision LLMs are confidently wrong sometimes. For anything that matters, verify against the source page and use a second model or a human reviewer on the bottom-confidence outputs.

Documents were one of the first places LLMs proved they were not just text-only. By 2026 they are a routine part of LLM applications. The skill is knowing which tool covers which part of the pipeline — and the answer is usually "more than one."

Comments 2

u/doc_parser · 1 month ago

ocr plus vision models changed how we handle scanned docs completly. wish the article went a bit deeper on handling multi page documents tho
u/doc_d2 · 1 month ago

ocr plus a vision model for layout is the combo that finally works for our scanned forms. wish it had covered table extraction more tho

Working with Document Images

Article summary

When to use a vision LLM

When traditional OCR still wins

The hybrid pattern that wins

Practical prompting for documents

What to skip

Frequently asked questions

See also

Where to go next

Comments 2