Chunking is the part of RAG where naive implementations quietly bleed quality. The default move — split every document at exactly 500 tokens, no overlap, ignore structure — works on slides but fails on real documents. The good news is that the variables that matter are a small set, and the defaults that work for most teams are well established.
What chunking actually decides
Three things change downstream when you change chunking:
- Embedding quality. Embeddings work best on coherent passages. Splitting a sentence in half produces two vectors that mean less than the original.
- Retrieval granularity. A chunk is the unit that gets returned. If your chunks are paragraphs, you can retrieve paragraphs. If they are 500-token blocks that cross paragraph boundaries, your retrieval is structurally noisier.
- Context efficiency. Smaller chunks give the model less to read per retrieved unit; larger chunks give more context per unit but fewer units fit in the budget.
Chunking is a tradeoff curve, not a "correct answer" problem.
Size: what teams actually use
The honest 2026 ranges, drawn from LlamaIndex/LangChain reference apps, Pinecone guidance, and the Anthropic contextual-retrieval examples:
- Short — 256–512 tokens. Dense technical content, code documentation, support tickets. Each chunk is one tight idea.
- Medium — 512–1,024 tokens. The most common production range. Generic prose, knowledge base articles, internal docs.
- Large — 1,500–2,000 tokens. Long-form narrative, research papers, legal sections. Used when the model needs the full surrounding context to answer well.
The historical advice of "200–400 tokens" came from a 2023 era of smaller context windows; current practice runs larger because frontier models handle longer chunks fine and re-rankers do better with more text per candidate.
Overlap: small but important
Without overlap, a sentence that straddles a chunk boundary effectively disappears — it ends one chunk and begins the next, and neither chunk carries the full thought. Overlap fixes this cheaply.
Typical defaults: 10–20% overlap, e.g., 100 tokens of overlap on a 600-token chunk. Higher overlap (30%+) is overkill and wastes index space without measurably helping retrieval.
Structure-aware chunking beats fixed-size
The biggest quality win in chunking is not the size; it is respecting document structure. Concretely:
- Never split mid-sentence. Use paragraph or sentence boundaries, even if it makes the chunk a bit smaller or larger than the target.
- Never split inside a code block. Code chunks should be one logical unit (function, class, file section).
- Never split a table. Markdown tables and structured data should chunk as a unit, even if it exceeds the size target.
- Respect headings. A chunk should ideally start at a heading and include the heading text, so the embedding "knows" what section the content is from.
Most modern chunkers (LangChain's RecursiveCharacterTextSplitter, LlamaIndex's SentenceSplitter, custom Markdown-aware splitters) handle this with a small amount of configuration. Use them.
Hierarchical / parent-child chunking
A pattern that has become standard: index small chunks for retrieval, but feed larger parent chunks (or full sections) to the model.
The flow:
- Split documents into small chunks (256–400 tokens) for embedding.
- Also store the larger parent (a section, a page, the original document).
- At query time, retrieve by the small chunks; pass the corresponding parent to the model.
This gives you the best of both worlds: precise retrieval signal (small chunks are easy to match by query) plus rich context for the model (large parents have room to answer well).
Both LangChain's ParentDocumentRetriever and LlamaIndex's hierarchical retrievers implement this directly. It is one of the highest-leverage upgrades you can make over naive chunking.
Semantic and contextual approaches
Two newer techniques worth knowing about:
Semantic chunking splits at detected topic boundaries — usually by computing embeddings of consecutive sentences and cutting where the embedding distance jumps. Useful when document structure is unreliable; sensitive to the threshold you pick. Not a default move.
Contextual retrieval (Anthropic, 2024): before embedding each chunk, prepend a short LLM-generated summary of where the chunk came from. So instead of embedding "the rate was reduced by 25 basis points," you embed "From the Q3 earnings transcript discussing Federal Reserve policy: the rate was reduced by 25 basis points." Substantial precision improvement, especially on corpora where chunks lack context on their own. The cost is one extra LLM call per chunk at ingestion time. See Anthropic's contextual retrieval post.
Document-type adaptations
Generic settings work for prose. They do not work for everything else. A few adaptations that pay off:
- Code. Chunk by function or class boundaries, not by line count. Keep imports with the code they support. Include the file path in the chunk metadata.
- Tables. Linearize structured rows into "field: value" formats when chunking; index each row or small group of rows separately if rows answer queries on their own.
- Slides and PDFs. OCR + layout-aware extraction first (unstructured.io). Chunk by slide or page when the document structure is meaningful; chunk by token count when it is not.
- Conversational logs. Chunk by turn or by exchange (Q + A together). Strip system metadata that does not help retrieval.
What to do this week
If you have a working RAG setup and you want to improve it:
- Measure your current retrieval recall on a 20-input eval set. Most teams have not actually measured it.
- Switch to structure-aware chunking with paragraph boundaries if you have not.
- Add 10–15% overlap if you have none.
- Try parent-child retrieval. The biggest single quality win for most setups.
- Test contextual retrieval if precision is still your bottleneck and the ingestion cost is acceptable.
Steps 2 and 3 are usually one configuration change. Step 4 is a refactor but a small one. Step 5 is a paid upgrade in compute that often pays for itself in answer quality.