Promptfoo or LangSmith?

They optimize for different shapes. Promptfoo is CLI-first, config-driven, lives in your repo and CI. LangSmith is a hosted platform with a trace UI, a dataset manager, and tight LangChain integration. If your prompts live in code and you want CI gates, Promptfoo is the simpler path. If you also want to inspect production traces and you are already on LangChain, LangSmith's integration tightens the loop.

Do I need a framework at all?

Not strictly. The simplest eval — a Python script that reads a CSV, calls the API, asserts on outputs — is a perfectly good first version. Frameworks add value when you want CI integration, score history, hosted UIs, dataset management, or judge orchestration. Reach for one when you can name the specific thing it gives you that a script does not.

How do I pick between two SaaS options?

Three questions. Where do your traces already go? (Pick the framework that already speaks your tracing.) How much do you value being able to manage datasets and prompts in the UI vs in your repo? (More UI-driven: Braintrust/LangSmith. Repo-driven: Promptfoo.) Do you care about pricing at your traffic shape? (Compare token-billed vs trace-billed pricing; differences can be large.)

Eval Frameworks Compared

As of 2026-05-19

There are now enough LLM eval frameworks that picking one feels like its own evaluation problem. The good news: they cluster cleanly by shape and by what they optimize for. Pick the cluster that matches your workflow first; pick a specific tool within it after.

The shape clusters

Most production teams pick one tool from each of two clusters:

Eval runner — the thing that runs your tests. CLI-first (lives in your repo, runs in CI) or SaaS (lives behind a UI, with hosted state).
Observability — the thing that captures production traces. Often the same product as the eval runner, sometimes a separate observability tool that feeds traces into the eval runner.

Below, by tool, the brief view.

Inspect — AI Safety Institute (UK)

What it is. An open-source AI evaluation platform from the UK AI Safety Institute. Designed for frontier-model safety and capability evals: jailbreakability, autonomy, harmful content, dangerous-capability assessments.

Best at. Structured evaluation pipelines with governance and safety emphasis. Suitable for regulators, labs, and teams running serious risk-flavor evaluations rather than routine product QA.

Shape. Open source Python package + web app, self-hosted, no SaaS.

Primary docs. inspect.aisi.org.uk and github.com/UKGovernmentBEIS/inspect_ai.

Promptfoo

What it is. An open-source CLI and SDK for prompt testing. Config-driven (YAML/JSON), heavily CI-focused.

Best at. Prompt regression testing with rich assertion types (exact match, contains, regex, model-graded, JSON schema, etc.). Excellent for "treat prompts like code" workflows.

Shape. Open-source CLI + library; optional hosted dashboard. Runs locally, in CI, anywhere.

Primary docs. promptfoo.dev/docs.

Braintrust

What it is. A data-centric eval and observability platform. Strong dataset management, scoring functions, and experiment tracking. Started in the open-source orbit and is now a hosted product with significant free tier.

Best at. Production teams that want a hosted UI for managing eval datasets and comparing prompt versions, with first-class CI integration.

Shape. SaaS + open-source SDK; self-hosted option exists.

Primary docs. braintrust.dev/docs.

LangSmith — LangChain

What it is. LangChain's production platform: tracing, datasets, experiments, evals. Tight integration with the LangChain library, but works without it.

Best at. Teams already using LangChain or LangGraph who want production traces, dataset management, and prompt-version experiments in one place.

Shape. SaaS + self-hosted option.

Primary docs. docs.smith.langchain.com.

DeepEval — Confident AI

What it is. Open-source Python SDK that markets itself as "pytest for LLMs." Defines tests as Python functions; offers a library of metrics (faithfulness, toxicity, bias, summarization, RAG).

Best at. Teams that want eval inside their normal Python testing flow, no separate CLI, no SaaS account required.

Shape. Open-source Python package; companion hosted platform (Confident AI) for trace storage.

Primary docs. deepeval.com and github.com/confident-ai/deepeval.

HELM — Stanford CRFM

What it is. HELM — a public model-vs-model benchmark suite from Stanford. Not really an "application eval" framework; it is a public leaderboard infrastructure.

Best at. Comparing research models against standardized tasks. Reading published numbers. Not useful for evaluating your specific application against your specific users.

Shape. Open-source code, public leaderboards.

Primary docs. crfm.stanford.edu/helm.

Phoenix — Arize AI

What it is. An open-source LLM observability and eval platform. Strong on tracing, retrieval evaluation, and hallucination detection.

Best at. Teams that want tracing-first observability with built-in evals for RAG, agents, and retrieval.

Shape. Open-source server + Python SDK; commercial Arize platform behind it.

Primary docs. docs.arize.com/phoenix.

Weave — Weights & Biases

What it is. W&B's product for LLM application tracing, evals, and experiment tracking. Integrates with the broader W&B ecosystem (model training, experiment management).

Best at. Teams already on W&B for ML experiments who want their LLM app evals in the same place.

Shape. SaaS (W&B platform) + open-source SDK.

Primary docs. weave-docs.wandb.ai.

Ragas

What it is. An open-source framework specifically for evaluating RAG (retrieval-augmented generation) systems. Defines RAG-specific metrics: context precision, context recall, faithfulness, answer relevance.

Best at. Teams whose application is primarily RAG and who want metrics that target retrieval quality directly.

Shape. Open-source Python package; integrates with most of the above platforms.

Primary docs. docs.ragas.io and github.com/explodinggradients/ragas.

How to pick

A decision tree that fits most teams:

What do you mainly evaluate?
- Prompts (and prompt variants in CI) → Promptfoo first.
- RAG quality → Ragas (sometimes alongside one of the others).
- Production traces of an agent or chat app → Braintrust, LangSmith, Phoenix, or Weave.
- Frontier-model safety/capability evals → Inspect.
Where do you want state to live?
- In your repo and CI → Promptfoo, DeepEval, Inspect.
- In a hosted UI → Braintrust, LangSmith, Phoenix, Weave.
What is already in your stack?
- Already on LangChain → LangSmith is the easiest integration.
- Already on Weights & Biases → Weave is the easiest integration.
- Nothing yet → Promptfoo + Phoenix is a common starter combo (CLI evals + open-source observability).

A reasonable starting setup for most teams: Promptfoo in CI for prompt regression, plus whichever observability tool already has your production traces (or Phoenix if you do not have one yet). Add Ragas if you are doing RAG; add LangSmith or Braintrust later when the hosted UI starts paying for itself.

What you should not do is adopt three tools at once and then never use any of them. Pick the smallest setup that catches regressions on the cases you care about, and grow from there.