Subject
Retrieval-augmented generation, end to end: the pipeline that actually fires in production, how to chunk documents without breaking meaning, when to use vector vs keyword vs hybrid retrieval, and when RAG beats long-context (and when it does not).
Retrieval-augmented generation has gone from a research idea in 2020 to the default architecture for "talk to my documents" applications. In 2026 it sits inside almost every production LLM stack that touches non-public data — customer support, internal documentation, research tools, knowledge bots. The basic shape is straightforward; the production version is not.
Conceptually, RAG is a small pipeline with one offline half and one online half.
Offline (runs once per document, then on updates):
Online (runs per user query):
That is RAG. Every production stack is a more careful version of this shape.
The gap between a demo that works on five documents and a system that handles a real corpus is mostly in five places:
Data quality. Most RAG failures trace back to bad ingestion. Garbled PDF parsing, duplicate content, leftover navigation, stale data. The amount of effort production teams put into ingestion pipelines surprises people — but no amount of clever retrieval recovers from data that was wrong before it was embedded.
Chunking. Naive fixed-size chunks at 500 tokens with no overlap break sentences in half, lose document structure, and miss section boundaries. Production setups use structure-aware chunking (respecting headings, code blocks, tables) and often parent-child setups where the retrieved small chunk leads back to a larger parent for context.
Retrieval intelligence. Pure vector search misses exact-term matches (error codes, API names, product SKUs). Pure keyword search misses paraphrase. Hybrid retrieval — vector + BM25, combined via reciprocal rank fusion — beats either alone on most realistic corpora.
Re-ranking. The top-k from your retriever is usually noisy. A re-ranker (Cohere Rerank, BGE rerankers, Voyage rerank) takes 20–50 candidates and reorders by true relevance to the query. The latency cost is real; the recall improvement is usually larger.
Evaluation. Without a measured eval set you cannot tell whether changes help or hurt. The Evals cluster on this site covers this in detail; for RAG specifically, Ragas provides metrics targeted at retrieval quality (context precision, context recall) and answer faithfulness.
There is no single "right" stack, but the choices have converged enough that most production setups look like a variant of:
text-embedding-3-large, Cohere embed-v3, Voyage voyage-3, or open-source BGE-large / E5-large via a self-hosted runner. For specialized corpora, fine-tuned embeddings outperform generic ones.This cluster goes deeper on the parts that matter:
If you are also doing context engineering work, the Context Engineering cluster pairs naturally with this one — RAG is a specific shape of context construction, and most RAG quality problems are context-engineering problems in disguise.
A short editorial reading list. Pick whichever fits how you like to learn.