Subject

Evals & Testing

How working teams actually evaluate LLM applications: building golden sets, designing rubrics, using LLM-as-judge without inheriting its biases, regression-testing prompts in CI, and which of the dozen eval frameworks is worth picking up.

A dark workbench with a scoresheet and two annotated outputs side by side under a desk lamp

Most production LLM evaluation stacks in 2026 converge on the same shape, even when the teams using them never compared notes. It is five layers, each with a different job:

  1. Prompt and component unit tests. Cheap, fast, run on every change.
  2. Golden eval sets with rubrics. The backbone — your contract with the application's behavior.
  3. LLM-as-judge scoring. Automated grading of open-ended outputs.
  4. Human-in-the-loop review. For the cases where automated scoring is not enough.
  5. Online monitoring and shadow deploys. Real traffic in production.

You do not need every layer for every app. You do need a deliberate choice about which layers you have, why, and what each catches.

Layer 1: prompt and component unit tests

Treat your prompts like code: define small, focused tests with explicit pass criteria, run them on every change.

What goes in:

  • A classification prompt should return one of N labels — assert that.
  • An extraction prompt should return valid JSON matching a schema — assert that.
  • A summarization prompt should stay under a word cap — assert that.
  • An agent's first tool call should be one of an allowed set — assert that.

Most eval frameworks (Promptfoo, Braintrust, LangSmith, Inspect) support this directly. Promptfoo's whole CLI is built around it. Anthropic's production guidance and OpenAI's evals library both treat this layer as table stakes.

Unit tests do not catch quality regressions; they catch contract violations. That is enough to be worth the cost.

Layer 2: golden eval sets with rubrics

This is where most of the actual signal comes from. Curate 10–50 real inputs with documented expected behaviors. Freeze them. Re-run them on every prompt change, model change, or dependency change.

The discipline:

  • Real inputs, not synthetic. Past user messages, past meeting transcripts, past tickets. Distribution-aware: include edge cases at the rate they actually occur.
  • Explicit rubric. Binary pass/fail when possible; a four-point rubric for open-ended outputs. Write it down before you score anything.
  • Frozen set. Same inputs every time. If you change the eval set, you have to re-baseline.
  • Versioned. Track which prompt version got which score against which eval set version. This is your audit trail.

The companion article in the Bench cluster, how-we-bench-llms-methodology, describes this approach in more detail.

Layer 3: LLM-as-judge

Once you are evaluating open-ended outputs at scale (writing quality, code review quality, agent behavior), human scoring stops being practical. LLM-as-judge fills the gap: one model scores another's output against an explicit rubric.

It works, with caveats. The biases are real and well-documented (see llm-as-judge-explained). Mitigated correctly, LLM-as-judge correlates well enough with human scoring to be useful as an automated layer. Mitigated badly, it gives you confident-looking numbers that drift away from what users actually experience.

Use LLM-as-judge where the cost of being slightly wrong is low (regression detection, A/B testing prompt variants) and back it with human review for the high-stakes decisions.

Layer 4: human-in-the-loop review

For the cases where LLM-as-judge is not enough — model launches, regulatory questions, anything where being subtly wrong is expensive — you need humans actually reading the outputs.

What it looks like in practice:

  • A small panel reviews a random sample of recent outputs against the same rubric the LLM judge uses.
  • Disagreements between the LLM judge and humans are logged and investigated — they usually surface a weakness in the rubric or in the judge prompt itself.
  • High-risk decisions get full human review; routine decisions get the LLM judge.

Anthropic's production guidance recommends combining LLM-judge and human review in exactly this layered way.

Layer 5: online monitoring and shadow deploys

Offline evals predict; online evals confirm. The standard pattern:

  • Log every model call in production with the inputs, outputs, and any user feedback signals.
  • Run lightweight automatic checks on a sample of production traffic (the same checks from layer 1, plus drift detection).
  • Shadow-deploy candidate prompts against real traffic without affecting users; compare outputs.
  • A/B test changes that pass offline.

This layer catches the failures that the golden set missed — usually distribution shift or edge cases your eval set did not include.

What to skip

Three patterns that look like eval but do not deliver:

  • Vibe scoring. Eyeballing 3-5 outputs and calling it done. Below the sample size where you can tell signal from noise.
  • Generic benchmark leaderboards as a proxy. MMLU / HumanEval / GSM-8K are interesting; they do not predict whether your prompt handles your tickets correctly.
  • One-shot pre-launch evals with no CI integration. If evals are not run on every change, prompts will regress between evals.

The simplest version that works

For a small team starting from zero:

  1. Curate 20 real inputs. Write the expected behavior for each.
  2. Pick one framework (Promptfoo for CLI-first; LangSmith or Braintrust for SaaS).
  3. Wire it into CI. Block prompt-change PRs on regressions.
  4. Look at the failures every time. Update the eval set when real-world inputs surprise you.

You can add the other layers as the app grows. That setup catches more regressions on day one than a five-layer dream stack you never actually run.

Forthcoming

  • Designing a Golden Eval Set
  • Rubric Design for Open Ended Tasks
  • Online Vs Offline Evals

Where to go next

A short editorial reading list. Pick whichever fits how you like to learn.

  • NerdSip: 5-minute AI micro-course on almost any topic, on iOS and Android