How large should the golden set be for CI?

Small enough that it runs in a few minutes on a PR, big enough that one or two failing cases is meaningful signal. For most teams that is 20–50 inputs. Tight, curated, representative beats large and noisy every time. If a single failed case has to be ignored regularly, the eval set is too lax — tighten the test.

What about flakiness?

Run evals at temperature 0 wherever possible. Pin the model version. Pin the seed where the API supports it. Even then, judge-based evals can flake a few percent of the time; require a small consistent regression (e.g., 2+ cases failing twice) before blocking, not a single flaky run.

How do I handle "the output changed but it is fine"?

Write that distinction into the rubric. A rubric that says "output must equal expected_string" treats every change as a regression. A rubric that says "output must follow this structure and not include hallucinated names" tolerates style drift while still catching the things you care about. The cost of vague rubrics is exactly the cost of every "is this a real regression?" debate.

Prompt Regression Testing

Prompts regress. A change that looks like a clarity tweak quietly makes the model worse on cases nobody thought to check. A model update on the backend silently shifts behavior on a fraction of inputs. A new "improvement" to the system prompt makes the assistant more helpful on the obvious cases and weirder on the edge cases.

Without a regression-testing setup, you find out from users. With one, you find out in CI, before the change merges.

Treat prompts like code

The pattern is genuinely the same as code regression testing:

Define a contract — a curated set of inputs with documented expected behaviors. Often called a "golden set."
Test that contract on every change — run the eval set in CI for any PR that touches prompts, models, or related infrastructure.
Block merges on regressions — failing evals stop the change from shipping.
Version the contract itself — when behavior genuinely should change, update the eval set in the same PR as the prompt change, with a justification.

Tools like Promptfoo, Braintrust, LangSmith, and Inspect all support this directly. The specifics differ; the pattern is identical.

The golden set, in practice

A useful golden set has three qualities:

Real. Inputs taken from past user messages, past tickets, past transcripts. Not synthetic, not generated by an LLM, not edge cases you imagined. Things your users actually sent.
Distribution-aware. Cover the proportions of types you actually see. If 20% of real traffic is short-and-friendly, 20% of your eval set should be short-and-friendly.
Tight. 20–50 inputs is usually right. Tighter than that is too small to be meaningful; looser than that takes too long to run on every PR and accumulates noisy cases nobody investigates.

For each input, document the expected behavior, not the expected exact output. "Must classify as 'refund'" is a useful expectation. "Must equal 'refund: yes'" is brittle and treats every wording change as a regression. The rubric is more important than the inputs.

What to run on every PR

In a typical setup, the CI pipeline runs three layers on every prompt-touching PR:

Schema checks — fastest, cheapest. Does the output match the expected shape? (Classification labels in the allowed set; extraction JSON validates against the schema; structured output fields are present.)
Golden-set scoring — run the curated inputs through the candidate prompt; score each against the rubric. Often a mix of programmatic checks (string contains, regex match) and LLM-as-judge for open-ended bits.
Diff against baseline — compare the candidate prompt's scores to the current production prompt's scores on the same set. Block when the candidate is worse by more than a small threshold.

This last step is where most teams either thrive or fail. Comparing absolute scores invites flakiness; comparing per-case diffs against the same baseline gives you stable, actionable signal: "this PR regressed on cases 7, 12, and 18; here are the new outputs."

Versioning the prompts themselves

Three practices that compound:

Prompts live in code. Not in a database, not in a config service that nobody version-controls. Same repo as the application, same review process as the application.
Prompts have IDs and versions. When you change a prompt, the ID stays, the version bumps. Logs and traces include both. Rollback is "revert to version N-1," not "remember what we changed yesterday."
Eval scores are stored against prompt version. Every CI run produces a record: prompt version, model version, eval set version, scores. Drift over time becomes a chart.

The eval frameworks listed above support this differently. Promptfoo treats your repo as the source of truth and CI as the runner. Braintrust and LangSmith add hosted UIs for browsing scores over time. Inspect adds governance-flavored reporting. Pick what matches your team's habits.

"The output changed but it is fine"

The single most common bug-report pattern in a regression-test setup is: the test failed, but the new output is actually fine — just worded differently. Two paths to handle it without losing the protection:

Tighten the rubric. If the test was scoring exact string equality on something where wording does not matter, replace it with a rubric that scores the structural property you actually cared about. Then the test does not fail on benign wording change.
Use LLM-as-judge for open-ended cases. A judge prompt that scores "does this output give the same recommendation as the expected one" tolerates style drift; an exact-match assertion does not. (See llm-as-judge-explained.)

The discipline: when you find yourself manually approving a "fine" failure, fix the rubric in the same PR. If you do not, the eval set decays and people start ignoring it.

What this gets you

A working prompt-regression setup gives you three things:

Confidence to ship. You know the change does not break the cases you care about because the test said so.
A history. When a user reports something weird, you have a trace and can add the case to the golden set.
A culture. Once "evals must pass to merge" is the norm, prompts stop being a thing that mysteriously changes overnight. They become a thing the team owns.

Most production-grade LLM teams in 2026 have something like this, even when the team is small. The cost of setting it up is one engineer-week. The cost of not having it accumulates the entire life of the application.

Comments 2

u/ci_dev · 1 month ago

prompt regression testing quietly saved us after a model update changed our outputs without warning. cannot recommend setting it up early enough
u/newtoevals · 1 month ago

set this up after a nasty scare in prod. honestly did a quick NerdSip primer on testing concepts first since i came from design not engineering, kept me from drowning in the jargon

Prompt Regression Testing

Article summary