As of 2026-05-28
As of 2026-05-28
There is no best embedding model, only the best one for your data and your constraints. The good news is that the decision breaks into five concrete axes, and four of them you can reason about before you run a single test.
1. Dimensions: quality versus cost
The dimension is how many numbers each vector holds. Common values run from 384 up to a few thousand. Higher dimensions can encode finer distinctions, but every extra dimension is extra bytes to store and extra math per similarity comparison. At a million vectors, the difference between 384 and 1,536 dimensions is the difference between a few hundred megabytes and a couple of gigabytes of index, plus proportionally slower search.
The quality curve flattens, too. Doubling dimensions does not double retrieval quality. Some newer models are trained so you can truncate the vector (keep the first 256 of 1,024 dimensions) and lose only a little accuracy, which lets you tune the tradeoff per deployment. Start at the smaller end and only go bigger if your eval shows it helps.
2. API versus open-weight or local
Hosted embedding APIs (from the major model providers) are the path of least resistance: one HTTP call, no GPU, always available. You pay per token and your text leaves your machine.
Open-weight models you run yourself (via libraries like sentence-transformers, or a local server) cost nothing per call after the hardware, keep data in-house, and can be fine-tuned on your domain. The cost is operational: you host it, you scale it, you patch it. For privacy-sensitive data or very high volume, local usually wins. For a prototype or modest volume, an API is faster to ship.
3. The MTEB leaderboard, as a starting point only
The MTEB leaderboard (Massive Text Embedding Benchmark) ranks embedding models across dozens of retrieval, clustering, and classification tasks. It is the right place to build a shortlist and to filter by model size and dimension.
It is the wrong place to make the final call. MTEB scores are averages over public datasets that are almost certainly not your data. A model tuned to climb the leaderboard can underperform on your specific documents. Treat the top five to ten as candidates, not as a ranking you inherit.
4. Domain match and sequence length
Two practical filters narrow the shortlist fast:
- Domain match. General-purpose models blur fine distinctions in specialized fields. If your corpus is legal contracts, clinical notes, or code, a model trained or tuned on that kind of text usually beats a stronger general model. Check what the model was trained on.
- Max sequence length. Every embedding model has a token limit on its input. If you feed it more, it truncates, silently dropping the tail of your chunk. Match the model's max length to your chunk size from chunking. A model with a 512-token limit cannot honestly embed a 1,000-token chunk.
5. Run your own eval
This is the step people skip and regret. Build a small evaluation set: 30 to 50 real queries, each paired with the document or chunk that should be the top result. For each candidate model, embed your corpus and your queries, run the search, and measure how often the right answer lands in the top k (recall@k) or how high it ranks (MRR).
# sketch: for each model, score retrieval on your own pairs
for model in shortlist:
corpus_vecs = embed(model, corpus)
hits = 0
for query, gold_id in eval_pairs:
top_ids = search(embed(model, query), corpus_vecs, k=5)
if gold_id in top_ids:
hits += 1
print(model, "recall@5 =", hits / len(eval_pairs))
Half a day of this beats a week of arguing about leaderboard ranks. The model that wins on your pairs is the model you ship.
Comments 0
No comments yet. Be the first to share your thoughts.