Most prompt iteration goes like this: change the prompt, run it once, eyeball the output, decide. That process is almost always wrong. Sample size of one cannot distinguish "this is better" from "this happened to look better this time."
A proper prompt eval takes about half an hour and ends with an answer you can defend.
Step 1: build the eval set
Collect 10–50 representative inputs. "Representative" means:
- Real inputs if you have them. Past customer messages, past meeting transcripts, past code reviews.
- Distribution-aware. If 20% of your real traffic is short and 5% is multilingual, mirror that.
- Edge cases included. Empty inputs, garbage inputs, inputs in the wrong format. The model's behavior on edge cases often matters more than its behavior on the easy ones.
Freeze the eval set. Same inputs, every time. If you change the eval set, you have to re-run all candidates.
Step 2: define the rubric
Be specific about what "better" means. Three common rubric shapes:
- Binary: did it produce the right answer? Best for classification, extraction, simple QA. Cheap to score, mechanical.
- 4-point: how well did it produce the answer? 0 = wrong/broken, 1 = mostly wrong, 2 = mostly right, 3 = right. Best for open-ended tasks.
- Multi-dimensional: score each of (faithfulness, structure, voice, hard-errors). Best when one dimension can be great and another terrible.
Write the rubric down before you run anything. Otherwise you will subconsciously score in favor of the prompt you wrote last.
Step 3: run both candidates
Fix everything you can:
- Same model and version.
- Temperature 0 (or fix a seed and run k samples).
- Same system prompt baseline.
- Same input order.
- Same parser if you are checking format.
If you want to test sampling variance too, run each candidate at 3–5 different seeds and look at the worst-case score, not the best-case.
Step 4: score and compare
For binary tasks, count. prompt_A: 38/50 correct, prompt_B: 41/50 correct. That difference might or might not be real. A binomial confidence interval (or a quick Fisher exact test) tells you whether the gap is significant at your sample size. For 50 examples, a 3-point gap is right on the edge.
For 4-point tasks, average the scores and look at the per-example diff. A new prompt that lifts the average but drops one specific case from "right" to "wrong" might be a regression you do not want.
Step 5: write it down
The single most useful artifact from prompt iteration is a tiny log: which prompt variant, on which eval set, on which model, got which score, on what date. Six iterations later this saves you from re-discovering that you already tried a variant.
The whole loop should take 30 minutes for a small eval. That is a tenth of the time most people spend "iterating" by vibes, and it produces actual answers.