Subject

RAG & Retrieval

Retrieval-augmented generation, end to end: the pipeline that actually fires in production, how to chunk documents without breaking meaning, when to use vector vs keyword vs hybrid retrieval, and when RAG beats long-context (and when it does not).

Retrieval-augmented generation has gone from a research idea in 2020 to the default architecture for "talk to my documents" applications. In 2026 it sits inside almost every production LLM stack that touches non-public data — customer support, internal documentation, research tools, knowledge bots. The basic shape is straightforward; the production version is not.

The basic pipeline

Conceptually, RAG is a small pipeline with one offline half and one online half.

Offline (runs once per document, then on updates):

Ingest. Pull documents from their source (filesystem, CMS, Drive, Confluence, database).
Parse and clean. Convert to text, strip boilerplate, normalize structure. PDFs and Office docs need real parsers (unstructured.io, Apache Tika, Mistral OCR for hard cases).
Chunk. Split each document into retrievable units, usually a few hundred tokens each. See chunking-strategies-explained.
Embed. Run each chunk through an embedding model to produce a fixed-dimensional vector. For the plain-English picture of what an embedding actually represents, Feynmanpedia's What Are Embeddings? covers the underlying intuition.
Store. Index the vectors in a vector database (or in a hybrid store that also keeps the original text for BM25).

Online (runs per user query):

Query processing. Sometimes the query gets rewritten, expanded, or decomposed before retrieval.
Retrieve. Embed the (rewritten) query, run a similarity search, return the top-k chunks. Optionally combine with BM25 keyword search.
Re-rank. Pass the candidates through a cross-encoder or specialized re-ranker to reorder them by true relevance. Cuts noise dramatically.
Generate. Feed the top chunks plus the original query to the LLM, with an instruction to answer from the retrieved context.
Post-process. Citation tracking, faithfulness checks, logging.

That is RAG. Every production stack is a more careful version of this shape.

What "production-grade" actually means

The gap between a demo that works on five documents and a system that handles a real corpus is mostly in five places:

Data quality. Most RAG failures trace back to bad ingestion. Garbled PDF parsing, duplicate content, leftover navigation, stale data. The amount of effort production teams put into ingestion pipelines surprises people — but no amount of clever retrieval recovers from data that was wrong before it was embedded.

Chunking. Naive fixed-size chunks at 500 tokens with no overlap break sentences in half, lose document structure, and miss section boundaries. Production setups use structure-aware chunking (respecting headings, code blocks, tables) and often parent-child setups where the retrieved small chunk leads back to a larger parent for context.

Retrieval intelligence. Pure vector search misses exact-term matches (error codes, API names, product SKUs). Pure keyword search misses paraphrase. Hybrid retrieval — vector + BM25, combined via reciprocal rank fusion — beats either alone on most realistic corpora.

Re-ranking. The top-k from your retriever is usually noisy. A re-ranker (Cohere Rerank, BGE rerankers, Voyage rerank) takes 20–50 candidates and reorders by true relevance to the query. The latency cost is real; the recall improvement is usually larger.

Evaluation. Without a measured eval set you cannot tell whether changes help or hurt. The Evals cluster on this site covers this in detail; for RAG specifically, Ragas provides metrics targeted at retrieval quality (context precision, context recall) and answer faithfulness.

The standard 2026 stack

There is no single "right" stack, but the choices have converged enough that most production setups look like a variant of:

Embeddings: OpenAI text-embedding-3-large, Cohere embed-v3, Voyage voyage-3, or open-source BGE-large / E5-large via a self-hosted runner. For specialized corpora, fine-tuned embeddings outperform generic ones.
Vector store: Pinecone, Weaviate, Qdrant, or pgvector on Postgres. The right choice depends mostly on scale, hosting preference, and whether you need hybrid search built in.
Retrieval + generation orchestration: LangChain, LlamaIndex, or in-house code. Frameworks help with the wiring; they are not magic.
Re-ranker: Cohere Rerank, BGE-reranker-large, or Voyage Rerank, typically called on the top 25–50 candidates.
Generation model: whichever frontier or open-weight model fits your cost/capability target.
Evaluation: Ragas plus your golden set; observability via Phoenix, Braintrust, or LangSmith.

Where this fits on the site

This cluster goes deeper on the parts that matter:

Chunking Strategies, Explained — fixed-size, semantic, hierarchical, and document-type-aware chunking.
Hybrid vs. Vector vs. Keyword Retrieval — what each retrieval mode is good at and how hybrid works.
When RAG Beats Long-Context — the cost/quality tradeoff against million-token context windows.

If you are also doing context engineering work, the Context Engineering cluster pairs naturally with this one — RAG is a specific shape of context construction, and most RAG quality problems are context-engineering problems in disguise.

Forthcoming

Embedding Models Compared
Reranking Explained
Evaluating Rag Quality

Where to go next

A short editorial reading list. Pick whichever fits how you like to learn.

NerdSip: 5-minute AI micro-course on almost any topic, on iOS and Android

Comments 0

No comments yet. Be the first to share your thoughts.