Prompt Eval Methodology — Picking a Prompt Like an Engineer

Article summary

Pick a small fixed eval set (10–50 inputs), define a binary or 4-point success rubric, run both prompts at temperature 0 (or with multiple seeds), score the outputs, and compare. Most "this prompt is better" decisions are made on 1–2 examples, which is below the level where you can tell signal from noise. A 10-input eval takes 30 minutes and reaches actual conclusions.

Most prompt iteration goes like this: change the prompt, run it once, eyeball the output, decide. That process is almost always wrong. Sample size of one cannot distinguish "this is better" from "this happened to look better this time."

A proper prompt eval takes about half an hour and ends with an answer you can defend.

Step 1: build the eval set

Collect 10–50 representative inputs. "Representative" means:

Real inputs if you have them. Past customer messages, past meeting transcripts, past code reviews.
Distribution-aware. If 20% of your real traffic is short and 5% is multilingual, mirror that.
Edge cases included. Empty inputs, garbage inputs, inputs in the wrong format. The model's behavior on edge cases often matters more than its behavior on the easy ones.

Freeze the eval set. Same inputs, every time. If you change the eval set, you have to re-run all candidates.

Step 2: define the rubric

Be specific about what "better" means. Three common rubric shapes:

Binary: did it produce the right answer? Best for classification, extraction, simple QA. Cheap to score, mechanical.
4-point: how well did it produce the answer? 0 = wrong/broken, 1 = mostly wrong, 2 = mostly right, 3 = right. Best for open-ended tasks.
Multi-dimensional: score each of (faithfulness, structure, voice, hard-errors). Best when one dimension can be great and another terrible.

Write the rubric down before you run anything. Otherwise you will subconsciously score in favor of the prompt you wrote last.

Step 3: run both candidates

Fix everything you can:

Same model and version.
Temperature 0 (or fix a seed and run k samples).
Same system prompt baseline.
Same input order.
Same parser if you are checking format.

If you want to test sampling variance too, run each candidate at 3–5 different seeds and look at the worst-case score, not the best-case.

Step 4: score and compare

For binary tasks, count. prompt_A: 38/50 correct, prompt_B: 41/50 correct. That difference might or might not be real. A binomial confidence interval (or a quick Fisher exact test) tells you whether the gap is significant at your sample size. For 50 examples, a 3-point gap is right on the edge.

For 4-point tasks, average the scores and look at the per-example diff. A new prompt that lifts the average but drops one specific case from "right" to "wrong" might be a regression you do not want.

Step 5: write it down

The single most useful artifact from prompt iteration is a tiny log: which prompt variant, on which eval set, on which model, got which score, on what date. Six iterations later this saves you from re-discovering that you already tried a variant.

The whole loop should take 30 minutes for a small eval. That is a tenth of the time most people spend "iterating" by vibes, and it produces actual answers.

Frequently asked questions

How many examples is enough?

For a binary task (classification, extraction with strict schemas), 30–50 examples gets you tight enough error bars to make a decision. For open-ended tasks (writing, analysis), 10–20 examples is usually enough because the differences are larger when they exist. If both prompts look the same on 10 examples, they are probably actually the same.

Do I need an LLM-as-judge?

Only for open-ended scoring at scale. For binary or schema-based tasks, write a programmatic check. For 20 outputs, human scoring is fastest and most reliable. LLM-as-judge is useful once your eval grows past a few hundred examples per run.

Should I evaluate prompts on multiple models?

Yes if you might switch models. A prompt tuned on Claude 4 sometimes underperforms on GPT-4o-mini or vice versa. If you only ship on one model, optimizing for that one is fine — just rerun the eval when you upgrade.