Before you embed anything longer than a paragraph, you have to decide how to cut it up. That step is chunking, and it quietly determines how well your whole retrieval system works. Get it wrong and even a great embedding model returns the wrong passages.
Why chunk at all
Two reasons force the issue.
First, embedding models have an input limit. Feed a model more tokens than its maximum and it silently truncates, dropping the end of your text. A book cannot become one vector.
Second, and more important, a whole document embedded as a single vector is too blurry to retrieve well. An embedding is one point in meaning-space. If that point has to represent a 30-page document covering ten topics, it lands in some bland average position that is near nothing in particular. A user asking about topic seven will not find it, because the document's single vector does not point at topic seven, it points at the mush of all ten. Splitting the document into focused chunks gives each idea its own sharp vector that a query can actually match.
The three common strategies
Fixed-size chunking cuts every N tokens. It is the simple baseline: predictable, fast, trivial to implement. Its weakness is that it ignores meaning and will happily slice through the middle of a sentence or a table.
Recursive chunking splits on a priority list of natural boundaries before falling back to a hard cut. It tries paragraph breaks first, then sentence breaks, then a raw token split only if a piece is still too big. This keeps most chunks aligned with the document's structure while still respecting the size cap. For most corpora this is the right default.
Semantic chunking places boundaries where the topic actually shifts, often by embedding sentences and cutting where consecutive sentences stop being similar. It can produce the cleanest, most self-contained chunks, but it is slower and depends on the splitter's judgment. It is the step up you take when measurement says recursive chunking is not enough.
The deeper treatment of these, including parent-child retrieval and document-type-aware splitting, lives in chunking strategies explained. This article is about how chunking interacts specifically with embeddings.
Overlap and the core tradeoff
Whatever you split on, add a little overlap so each chunk repeats a slice of its neighbor. Overlap protects ideas that straddle a boundary: without it, a thought split across two chunks embeds badly in both. A common starting point is 10 to 20 percent of the chunk size.
Then there is the tradeoff that sits at the center of all of this:
- Chunks too small lose context. A 50-token fragment may be too thin to carry a complete idea, and its vector matches surface phrasing more than meaning. Retrieval finds snippets that lack the surrounding information the chat model needs to answer.
- Chunks too big dilute the vector. The blurry-average problem from a whole document returns in miniature: a 2,000-token chunk covering several sub-topics produces a vague embedding that matches everything weakly and nothing strongly.
The right size lives between these, and it depends on your documents, your embedding model's input limit, and your queries.
Settle it by measuring retrieval
You cannot eyeball the best chunk size. Measure it. Build a small set of real queries each paired with the chunk that should be retrieved, then sweep the parameters:
for size in [256, 512, 768, 1024]:
for overlap in [0, 0.1, 0.2]:
chunks = chunk(corpus, size=size, overlap=overlap)
index = embed_and_index(chunks)
score = recall_at_k(index, eval_queries, k=5)
print(size, overlap, score)
Pick the configuration with the best recall on your own pairs, not the one that sounded reasonable. Chunking is a tuning problem with a measurable answer, and a couple of hours of sweeping beats months of guessing.
Comments 0
No comments yet. Be the first to share your thoughts.