What chunk size should I start with?

For generic prose, 512–800 tokens with 50–100 tokens of overlap is a reasonable default in 2026 — large enough to carry full ideas, small enough to embed cleanly. For dense technical documentation, go smaller (256–400). For long-form narrative, go larger (1,000–1,500). Adjust based on your retrieval evals, not on what feels right.

Do I need semantic chunking?

For most cases, no. Fixed-size chunking with structural awareness (split on paragraph boundaries, never mid-sentence) covers the 90% case. Semantic chunking — splitting at detected topic boundaries — helps for highly variable document lengths but is sensitive to the splitter's judgment and not always worth the complexity. Start with structure-aware fixed-size; add semantic if your evals point to it.

What is Anthropic's "contextual retrieval"?

A pre-processing step published by Anthropic in 2024: before embedding each chunk, you prepend a short context-summary generated by an LLM ("This chunk is from the Q3 earnings call discussing X..."). The added context dramatically improves retrieval precision because the embedding now carries information about where the chunk came from, not just what it says. Worth using if quality matters more than ingestion cost.

Chunking Strategies, Explained

Chunking is the part of RAG where naive implementations quietly bleed quality. The default move — split every document at exactly 500 tokens, no overlap, ignore structure — works on slides but fails on real documents. The good news is that the variables that matter are a small set, and the defaults that work for most teams are well established.

What chunking actually decides

Three things change downstream when you change chunking:

Embedding quality. Embeddings work best on coherent passages. Splitting a sentence in half produces two vectors that mean less than the original.
Retrieval granularity. A chunk is the unit that gets returned. If your chunks are paragraphs, you can retrieve paragraphs. If they are 500-token blocks that cross paragraph boundaries, your retrieval is structurally noisier.
Context efficiency. Smaller chunks give the model less to read per retrieved unit; larger chunks give more context per unit but fewer units fit in the budget.

Chunking is a tradeoff curve, not a "correct answer" problem.

Size: what teams actually use

The honest 2026 ranges, drawn from LlamaIndex/LangChain reference apps, Pinecone guidance, and the Anthropic contextual-retrieval examples:

Short — 256–512 tokens. Dense technical content, code documentation, support tickets. Each chunk is one tight idea.
Medium — 512–1,024 tokens. The most common production range. Generic prose, knowledge base articles, internal docs.
Large — 1,500–2,000 tokens. Long-form narrative, research papers, legal sections. Used when the model needs the full surrounding context to answer well.

The historical advice of "200–400 tokens" came from a 2023 era of smaller context windows; current practice runs larger because frontier models handle longer chunks fine and re-rankers do better with more text per candidate.

Overlap: small but important

Without overlap, a sentence that straddles a chunk boundary effectively disappears — it ends one chunk and begins the next, and neither chunk carries the full thought. Overlap fixes this cheaply.

Typical defaults: 10–20% overlap, e.g., 100 tokens of overlap on a 600-token chunk. Higher overlap (30%+) is overkill and wastes index space without measurably helping retrieval.

Structure-aware chunking beats fixed-size

The biggest quality win in chunking is not the size; it is respecting document structure. Concretely:

Never split mid-sentence. Use paragraph or sentence boundaries, even if it makes the chunk a bit smaller or larger than the target.
Never split inside a code block. Code chunks should be one logical unit (function, class, file section).
Never split a table. Markdown tables and structured data should chunk as a unit, even if it exceeds the size target.
Respect headings. A chunk should ideally start at a heading and include the heading text, so the embedding "knows" what section the content is from.

Most modern chunkers (LangChain's RecursiveCharacterTextSplitter, LlamaIndex's SentenceSplitter, custom Markdown-aware splitters) handle this with a small amount of configuration. Use them.

Hierarchical / parent-child chunking

A pattern that has become standard: index small chunks for retrieval, but feed larger parent chunks (or full sections) to the model.

The flow:

Split documents into small chunks (256–400 tokens) for embedding.
Also store the larger parent (a section, a page, the original document).
At query time, retrieve by the small chunks; pass the corresponding parent to the model.

This gives you the best of both worlds: precise retrieval signal (small chunks are easy to match by query) plus rich context for the model (large parents have room to answer well).

Both LangChain's ParentDocumentRetriever and LlamaIndex's hierarchical retrievers implement this directly. It is one of the highest-leverage upgrades you can make over naive chunking.

Semantic and contextual approaches

Two newer techniques worth knowing about:

Semantic chunking splits at detected topic boundaries — usually by computing embeddings of consecutive sentences and cutting where the embedding distance jumps. Useful when document structure is unreliable; sensitive to the threshold you pick. Not a default move.

Contextual retrieval (Anthropic, 2024): before embedding each chunk, prepend a short LLM-generated summary of where the chunk came from. So instead of embedding "the rate was reduced by 25 basis points," you embed "From the Q3 earnings transcript discussing Federal Reserve policy: the rate was reduced by 25 basis points." Substantial precision improvement, especially on corpora where chunks lack context on their own. The cost is one extra LLM call per chunk at ingestion time. See Anthropic's contextual retrieval post.

Document-type adaptations

Generic settings work for prose. They do not work for everything else. A few adaptations that pay off:

Code. Chunk by function or class boundaries, not by line count. Keep imports with the code they support. Include the file path in the chunk metadata.
Tables. Linearize structured rows into "field: value" formats when chunking; index each row or small group of rows separately if rows answer queries on their own.
Slides and PDFs. OCR + layout-aware extraction first (unstructured.io). Chunk by slide or page when the document structure is meaningful; chunk by token count when it is not.
Conversational logs. Chunk by turn or by exchange (Q + A together). Strip system metadata that does not help retrieval.

What to do this week

If you have a working RAG setup and you want to improve it:

Measure your current retrieval recall on a 20-input eval set. Most teams have not actually measured it.
Switch to structure-aware chunking with paragraph boundaries if you have not.
Add 10–15% overlap if you have none.
Try parent-child retrieval. The biggest single quality win for most setups.
Test contextual retrieval if precision is still your bottleneck and the ingestion cost is acceptable.

Steps 2 and 3 are usually one configuration change. Step 4 is a refactor but a small one. Step 5 is a paid upgrade in compute that often pays for itself in answer quality.

Chunking Strategies, Explained

Article summary