Subject
How working teams actually evaluate LLM applications: building golden sets, designing rubrics, using LLM-as-judge without inheriting its biases, regression-testing prompts in CI, and which of the dozen eval frameworks is worth picking up.
Most production LLM evaluation stacks in 2026 converge on the same shape, even when the teams using them never compared notes. It is five layers, each with a different job:
You do not need every layer for every app. You do need a deliberate choice about which layers you have, why, and what each catches.
Treat your prompts like code: define small, focused tests with explicit pass criteria, run them on every change.
What goes in:
Most eval frameworks (Promptfoo, Braintrust, LangSmith, Inspect) support this directly. Promptfoo's whole CLI is built around it. Anthropic's production guidance and OpenAI's evals library both treat this layer as table stakes.
Unit tests do not catch quality regressions; they catch contract violations. That is enough to be worth the cost.
This is where most of the actual signal comes from. Curate 10–50 real inputs with documented expected behaviors. Freeze them. Re-run them on every prompt change, model change, or dependency change.
The discipline:
The companion article in the Bench cluster, how-we-bench-llms-methodology, describes this approach in more detail.
Once you are evaluating open-ended outputs at scale (writing quality, code review quality, agent behavior), human scoring stops being practical. LLM-as-judge fills the gap: one model scores another's output against an explicit rubric.
It works, with caveats. The biases are real and well-documented (see llm-as-judge-explained). Mitigated correctly, LLM-as-judge correlates well enough with human scoring to be useful as an automated layer. Mitigated badly, it gives you confident-looking numbers that drift away from what users actually experience.
Use LLM-as-judge where the cost of being slightly wrong is low (regression detection, A/B testing prompt variants) and back it with human review for the high-stakes decisions.
For the cases where LLM-as-judge is not enough — model launches, regulatory questions, anything where being subtly wrong is expensive — you need humans actually reading the outputs.
What it looks like in practice:
Anthropic's production guidance recommends combining LLM-judge and human review in exactly this layered way.
Offline evals predict; online evals confirm. The standard pattern:
This layer catches the failures that the golden set missed — usually distribution shift or edge cases your eval set did not include.
Three patterns that look like eval but do not deliver:
For a small team starting from zero:
You can add the other layers as the app grows. That setup catches more regressions on day one than a five-layer dream stack you never actually run.
A short editorial reading list. Pick whichever fits how you like to learn.