Chunking Text for Embeddings

Article summary

You chunk text because embedding models have an input limit and because a whole document embedded as one vector is too blurry to retrieve precisely. Fixed-size chunking is the simple baseline, recursive chunking splits on natural boundaries first, and semantic chunking splits where the topic shifts. Add a little overlap so ideas are not cut in half. The core tradeoff: chunks too small lose context, chunks too big dilute the vector. Settle it by measuring retrieval quality, not by intuition.

Before you embed anything longer than a paragraph, you have to decide how to cut it up. That step is chunking, and it quietly determines how well your whole retrieval system works. Get it wrong and even a great embedding model returns the wrong passages.

Why chunk at all

Two reasons force the issue.

First, embedding models have an input limit. Feed a model more tokens than its maximum and it silently truncates, dropping the end of your text. A book cannot become one vector.

Second, and more important, a whole document embedded as a single vector is too blurry to retrieve well. An embedding is one point in meaning-space. If that point has to represent a 30-page document covering ten topics, it lands in some bland average position that is near nothing in particular. A user asking about topic seven will not find it, because the document's single vector does not point at topic seven, it points at the mush of all ten. Splitting the document into focused chunks gives each idea its own sharp vector that a query can actually match.

The three common strategies

Fixed-size chunking cuts every N tokens. It is the simple baseline: predictable, fast, trivial to implement. Its weakness is that it ignores meaning and will happily slice through the middle of a sentence or a table.

Recursive chunking splits on a priority list of natural boundaries before falling back to a hard cut. It tries paragraph breaks first, then sentence breaks, then a raw token split only if a piece is still too big. This keeps most chunks aligned with the document's structure while still respecting the size cap. For most corpora this is the right default.

Semantic chunking places boundaries where the topic actually shifts, often by embedding sentences and cutting where consecutive sentences stop being similar. It can produce the cleanest, most self-contained chunks, but it is slower and depends on the splitter's judgment. It is the step up you take when measurement says recursive chunking is not enough.

The deeper treatment of these, including parent-child retrieval and document-type-aware splitting, lives in chunking strategies explained. This article is about how chunking interacts specifically with embeddings.

Overlap and the core tradeoff

Whatever you split on, add a little overlap so each chunk repeats a slice of its neighbor. Overlap protects ideas that straddle a boundary: without it, a thought split across two chunks embeds badly in both. A common starting point is 10 to 20 percent of the chunk size.

Then there is the tradeoff that sits at the center of all of this:

Chunks too small lose context. A 50-token fragment may be too thin to carry a complete idea, and its vector matches surface phrasing more than meaning. Retrieval finds snippets that lack the surrounding information the chat model needs to answer.
Chunks too big dilute the vector. The blurry-average problem from a whole document returns in miniature: a 2,000-token chunk covering several sub-topics produces a vague embedding that matches everything weakly and nothing strongly.

The right size lives between these, and it depends on your documents, your embedding model's input limit, and your queries.

Settle it by measuring retrieval

You cannot eyeball the best chunk size. Measure it. Build a small set of real queries each paired with the chunk that should be retrieved, then sweep the parameters:

for size in [256, 512, 768, 1024]:
    for overlap in [0, 0.1, 0.2]:
        chunks = chunk(corpus, size=size, overlap=overlap)
        index = embed_and_index(chunks)
        score = recall_at_k(index, eval_queries, k=5)
        print(size, overlap, score)

Pick the configuration with the best recall on your own pairs, not the one that sounded reasonable. Chunking is a tuning problem with a measurable answer, and a couple of hours of sweeping beats months of guessing.

Frequently asked questions

What chunk size should I start with?

For general prose, somewhere around 256 to 800 tokens with a small overlap of 10 to 20 percent is a reasonable default. Dense technical text usually wants smaller chunks so each vector stays focused; long narrative tolerates larger ones. The number that matters is not what feels right but what scores best on your retrieval eval. Start in this range, then move the size based on measured recall.

Why add overlap between chunks?

Overlap means each chunk repeats a slice of the previous one. Without it, a sentence or idea that straddles a chunk boundary gets cut in half, and neither resulting chunk embeds it well, so a query about that idea retrieves poorly. A modest overlap (often 10 to 20 percent of chunk size) gives boundary-crossing ideas a fair chance of living intact in at least one chunk. Too much overlap wastes storage and returns near-duplicate results.

Is semantic chunking worth the complexity?

Sometimes. Semantic chunking splits where the meaning shifts rather than at a fixed token count, which can produce cleaner, more self-contained chunks. But it is slower, depends on the splitter making good judgments, and does not always beat a structure-aware fixed-size approach. Start with recursive splitting that respects paragraphs and headings. Only reach for semantic chunking if your retrieval eval shows the fixed-size approach is leaving quality on the table.