Vector search dominated the 2023 RAG conversation, partly because vector databases were a new and shiny thing to sell. By 2026 the picture is more honest: vector search is one tool among three, and the production answer for most real corpora is hybrid retrieval with a re-ranker on top.

What each mode is doing

Keyword (BM25) retrieval scores documents by how many query terms they contain, weighted by term frequency, inverse document frequency, and document length. It is the classic information-retrieval algorithm — well-understood, fast, and the engine behind Lucene, Elasticsearch, and most full-text search systems. Modern reference: Manning, Raghavan, Schütze, Introduction to Information Retrieval, ch. 11.

Vector (dense) retrieval scores documents by the cosine similarity (or dot product) of an embedding of the document against an embedding of the query. The embedding model has been trained to put semantically related text close together in the vector space, so the retrieval is by meaning rather than by literal word overlap.

Hybrid retrieval runs both, then merges the result lists. The standard merger is reciprocal rank fusion (RRF): for each document, sum 1/(k + rank) across all retrievers (k usually 60), and sort by the fused score.

When each wins

BM25 wins when:

  • The corpus uses precise vocabulary (legal, medical, technical) and queries reuse that vocabulary.
  • Rare terms carry meaning (error codes, API names, drug names, product SKUs). BM25's IDF weighting pumps these up; vector embeddings often treat them as noise.
  • You want explainable retrieval — BM25 results are trivially interpretable ("this document contains your query terms"), embeddings are not.
  • The corpus is small or recently changed and you have not had time to optimize embeddings.

Vector search wins when:

  • Users phrase queries differently than the documents are worded. ("How do I speed up my website?" vs documentation that talks about "optimizing page load time.")
  • The corpus is large and well-curated, and you have a good embedding model.
  • Conceptual matching matters more than exact terms.
  • The application is conversational and user queries vary widely in phrasing.

Hybrid wins on most realistic corpora because real queries mix both. A support question might be "I'm getting error 504 on /v2/payments/create after a refund" — the error codes and endpoints are BM25 territory, the surrounding phrasing is vector territory. BM25 finds documents with the codes; vector finds documents about refunds; hybrid finds documents about refunds that mention the codes.

The Weaviate hybrid docs, Qdrant hybrid examples, and LlamaIndex hybrid retrievers all use this same shape with RRF or a configurable alpha weighting.

A concrete hybrid setup

In production, the most common pattern is:

  1. Run both retrievers in parallel. Each returns its top-N (often 50–100 candidates).
  2. Merge via RRF. Produces a single ranked list of candidates.
  3. Re-rank the top 25–50. A cross-encoder re-ranker (Cohere Rerank, BGE-reranker-large, Voyage Rerank) scores query-document pairs more carefully than embeddings can. Slower per pair, but only applied to a small set.
  4. Pass the final top-k to the model. Usually 3–8 chunks.

Latency adds up, but most of the cost is in the re-ranker and that step is the highest quality lift per millisecond. A reasonable budget: 100ms hybrid retrieve + 300ms re-rank + the LLM call.

Re-ranking, briefly

A re-ranker is a model that takes a query and a candidate document together and produces a relevance score. Because it sees the pair jointly (unlike an embedding model, which sees them separately), it can score relevance much more sharply.

Three options you will see:

  • Cohere Rerank — hosted API, multilingual, low latency. The default commercial choice.
  • BGE rerankers — open-source, free to self-host, competitive quality.
  • Voyage Rerank — hosted, specialized variants for different domains.

The latency penalty is typically 100–400ms for 25–50 candidates. The recall improvement is usually large enough to be worth it — most production RAG systems report a 10–30% improvement in answer quality after adding re-ranking, with diminishing returns past that.

What to skip

  • Pure vector search with no re-ranking on a corpus larger than a few thousand documents. You are leaving quality on the table.
  • Hybrid without a fusion strategy. Don't merge by just concatenating result lists; use RRF or weighted-alpha fusion.
  • Re-ranking the entire result set. Re-rankers are slow per pair. Retrieve broadly, re-rank narrowly.
  • Buying a vector database when you have 5,000 chunks. Postgres + pgvector handles this fine. Scale up the storage layer when retrieval latency or filtering complexity demands it.

The honest 2026 default

For a new RAG project on a non-trivial corpus, the default stack is:

  • Hybrid retrieval (vector + BM25) merged via RRF.
  • Re-rank the top 25–50 candidates with Cohere or BGE.
  • Pass the final top 3–8 to the model.

This is more complex than "just use vector search," and the complexity earns its keep. The lower bound on RAG quality changes substantially between "pure vector + no rerank" and "hybrid + rerank," even with the same embeddings and the same generation model.