Prompts regress. A change that looks like a clarity tweak quietly makes the model worse on cases nobody thought to check. A model update on the backend silently shifts behavior on a fraction of inputs. A new "improvement" to the system prompt makes the assistant more helpful on the obvious cases and weirder on the edge cases.
Without a regression-testing setup, you find out from users. With one, you find out in CI, before the change merges.
Treat prompts like code
The pattern is genuinely the same as code regression testing:
- Define a contract — a curated set of inputs with documented expected behaviors. Often called a "golden set."
- Test that contract on every change — run the eval set in CI for any PR that touches prompts, models, or related infrastructure.
- Block merges on regressions — failing evals stop the change from shipping.
- Version the contract itself — when behavior genuinely should change, update the eval set in the same PR as the prompt change, with a justification.
Tools like Promptfoo, Braintrust, LangSmith, and Inspect all support this directly. The specifics differ; the pattern is identical.
The golden set, in practice
A useful golden set has three qualities:
- Real. Inputs taken from past user messages, past tickets, past transcripts. Not synthetic, not generated by an LLM, not edge cases you imagined. Things your users actually sent.
- Distribution-aware. Cover the proportions of types you actually see. If 20% of real traffic is short-and-friendly, 20% of your eval set should be short-and-friendly.
- Tight. 20–50 inputs is usually right. Tighter than that is too small to be meaningful; looser than that takes too long to run on every PR and accumulates noisy cases nobody investigates.
For each input, document the expected behavior, not the expected exact output. "Must classify as 'refund'" is a useful expectation. "Must equal 'refund: yes'" is brittle and treats every wording change as a regression. The rubric is more important than the inputs.
What to run on every PR
In a typical setup, the CI pipeline runs three layers on every prompt-touching PR:
- Schema checks — fastest, cheapest. Does the output match the expected shape? (Classification labels in the allowed set; extraction JSON validates against the schema; structured output fields are present.)
- Golden-set scoring — run the curated inputs through the candidate prompt; score each against the rubric. Often a mix of programmatic checks (string contains, regex match) and LLM-as-judge for open-ended bits.
- Diff against baseline — compare the candidate prompt's scores to the current production prompt's scores on the same set. Block when the candidate is worse by more than a small threshold.
This last step is where most teams either thrive or fail. Comparing absolute scores invites flakiness; comparing per-case diffs against the same baseline gives you stable, actionable signal: "this PR regressed on cases 7, 12, and 18; here are the new outputs."
Versioning the prompts themselves
Three practices that compound:
- Prompts live in code. Not in a database, not in a config service that nobody version-controls. Same repo as the application, same review process as the application.
- Prompts have IDs and versions. When you change a prompt, the ID stays, the version bumps. Logs and traces include both. Rollback is "revert to version N-1," not "remember what we changed yesterday."
- Eval scores are stored against prompt version. Every CI run produces a record: prompt version, model version, eval set version, scores. Drift over time becomes a chart.
The eval frameworks listed above support this differently. Promptfoo treats your repo as the source of truth and CI as the runner. Braintrust and LangSmith add hosted UIs for browsing scores over time. Inspect adds governance-flavored reporting. Pick what matches your team's habits.
"The output changed but it is fine"
The single most common bug-report pattern in a regression-test setup is: the test failed, but the new output is actually fine — just worded differently. Two paths to handle it without losing the protection:
- Tighten the rubric. If the test was scoring exact string equality on something where wording does not matter, replace it with a rubric that scores the structural property you actually cared about. Then the test does not fail on benign wording change.
- Use LLM-as-judge for open-ended cases. A judge prompt that scores "does this output give the same recommendation as the expected one" tolerates style drift; an exact-match assertion does not. (See llm-as-judge-explained.)
The discipline: when you find yourself manually approving a "fine" failure, fix the rubric in the same PR. If you do not, the eval set decays and people start ignoring it.
What this gets you
A working prompt-regression setup gives you three things:
- Confidence to ship. You know the change does not break the cases you care about because the test said so.
- A history. When a user reports something weird, you have a trace and can add the case to the golden set.
- A culture. Once "evals must pass to merge" is the norm, prompts stop being a thing that mysteriously changes overnight. They become a thing the team owns.
Most production-grade LLM teams in 2026 have something like this, even when the team is small. The cost of setting it up is one engineer-week. The cost of not having it accumulates the entire life of the application.