As of 2026-05-19
As of 2026-05-19
There are now enough LLM eval frameworks that picking one feels like its own evaluation problem. The good news: they cluster cleanly by shape and by what they optimize for. Pick the cluster that matches your workflow first; pick a specific tool within it after.
The shape clusters
Most production teams pick one tool from each of two clusters:
- Eval runner — the thing that runs your tests. CLI-first (lives in your repo, runs in CI) or SaaS (lives behind a UI, with hosted state).
- Observability — the thing that captures production traces. Often the same product as the eval runner, sometimes a separate observability tool that feeds traces into the eval runner.
Below, by tool, the brief view.
Inspect — AI Safety Institute (UK)
What it is. An open-source AI evaluation platform from the UK AI Safety Institute. Designed for frontier-model safety and capability evals: jailbreakability, autonomy, harmful content, dangerous-capability assessments.
Best at. Structured evaluation pipelines with governance and safety emphasis. Suitable for regulators, labs, and teams running serious risk-flavor evaluations rather than routine product QA.
Shape. Open source Python package + web app, self-hosted, no SaaS.
Primary docs. inspect.aisi.org.uk and github.com/UKGovernmentBEIS/inspect_ai.
Promptfoo
What it is. An open-source CLI and SDK for prompt testing. Config-driven (YAML/JSON), heavily CI-focused.
Best at. Prompt regression testing with rich assertion types (exact match, contains, regex, model-graded, JSON schema, etc.). Excellent for "treat prompts like code" workflows.
Shape. Open-source CLI + library; optional hosted dashboard. Runs locally, in CI, anywhere.
Primary docs. promptfoo.dev/docs.
Braintrust
What it is. A data-centric eval and observability platform. Strong dataset management, scoring functions, and experiment tracking. Started in the open-source orbit and is now a hosted product with significant free tier.
Best at. Production teams that want a hosted UI for managing eval datasets and comparing prompt versions, with first-class CI integration.
Shape. SaaS + open-source SDK; self-hosted option exists.
Primary docs. braintrust.dev/docs.
LangSmith — LangChain
What it is. LangChain's production platform: tracing, datasets, experiments, evals. Tight integration with the LangChain library, but works without it.
Best at. Teams already using LangChain or LangGraph who want production traces, dataset management, and prompt-version experiments in one place.
Shape. SaaS + self-hosted option.
Primary docs. docs.smith.langchain.com.
DeepEval — Confident AI
What it is. Open-source Python SDK that markets itself as "pytest for LLMs." Defines tests as Python functions; offers a library of metrics (faithfulness, toxicity, bias, summarization, RAG).
Best at. Teams that want eval inside their normal Python testing flow, no separate CLI, no SaaS account required.
Shape. Open-source Python package; companion hosted platform (Confident AI) for trace storage.
Primary docs. deepeval.com and github.com/confident-ai/deepeval.
HELM — Stanford CRFM
What it is. HELM — a public model-vs-model benchmark suite from Stanford. Not really an "application eval" framework; it is a public leaderboard infrastructure.
Best at. Comparing research models against standardized tasks. Reading published numbers. Not useful for evaluating your specific application against your specific users.
Shape. Open-source code, public leaderboards.
Primary docs. crfm.stanford.edu/helm.
Phoenix — Arize AI
What it is. An open-source LLM observability and eval platform. Strong on tracing, retrieval evaluation, and hallucination detection.
Best at. Teams that want tracing-first observability with built-in evals for RAG, agents, and retrieval.
Shape. Open-source server + Python SDK; commercial Arize platform behind it.
Primary docs. docs.arize.com/phoenix.
Weave — Weights & Biases
What it is. W&B's product for LLM application tracing, evals, and experiment tracking. Integrates with the broader W&B ecosystem (model training, experiment management).
Best at. Teams already on W&B for ML experiments who want their LLM app evals in the same place.
Shape. SaaS (W&B platform) + open-source SDK.
Primary docs. weave-docs.wandb.ai.
Ragas
What it is. An open-source framework specifically for evaluating RAG (retrieval-augmented generation) systems. Defines RAG-specific metrics: context precision, context recall, faithfulness, answer relevance.
Best at. Teams whose application is primarily RAG and who want metrics that target retrieval quality directly.
Shape. Open-source Python package; integrates with most of the above platforms.
Primary docs. docs.ragas.io and github.com/explodinggradients/ragas.
How to pick
A decision tree that fits most teams:
- What do you mainly evaluate?
- Prompts (and prompt variants in CI) → Promptfoo first.
- RAG quality → Ragas (sometimes alongside one of the others).
- Production traces of an agent or chat app → Braintrust, LangSmith, Phoenix, or Weave.
- Frontier-model safety/capability evals → Inspect.
- Where do you want state to live?
- In your repo and CI → Promptfoo, DeepEval, Inspect.
- In a hosted UI → Braintrust, LangSmith, Phoenix, Weave.
- What is already in your stack?
- Already on LangChain → LangSmith is the easiest integration.
- Already on Weights & Biases → Weave is the easiest integration.
- Nothing yet → Promptfoo + Phoenix is a common starter combo (CLI evals + open-source observability).
A reasonable starting setup for most teams: Promptfoo in CI for prompt regression, plus whichever observability tool already has your production traces (or Phoenix if you do not have one yet). Add Ragas if you are doing RAG; add LangSmith or Braintrust later when the hosted UI starts paying for itself.
What you should not do is adopt three tools at once and then never use any of them. Pick the smallest setup that catches regressions on the cases you care about, and grow from there.