How biased is LLM-as-judge in practice?

Biased enough to matter, but mitigatable. Zheng et al. (2023) and follow-up work documented several systematic biases — position bias (judge favors the first option), length bias (judge prefers longer answers regardless of quality), self-preference (a model prefers outputs from the same family), verbosity bias. Untreated, these can flip A/B test conclusions. Treated, they shrink to small constant offsets.

Pairwise comparison or Likert scoring?

Pairwise (A vs B) tends to be more reliable than absolute scoring on a 1–5 scale, especially when comparing prompt variants. The judge has to discriminate, not anchor to a number it has no calibration for. Pairwise with randomized order is the default move; reach for Likert when you genuinely need an absolute score (e.g., for a quality dashboard).

Should the judge be the same model as the candidate?

Avoid it when you can. Self-preference bias is well-documented: a model gives slightly higher scores to outputs from the same family. Using a stronger judge model (a frontier model judging a smaller production model) is a common pattern. Alternatively, ensemble two different judges from different families and require agreement.

LLM-as-Judge, Explained

LLM-as-judge has gone from a research idea to standard infrastructure in three years. By 2026 it is the default automation layer for scoring open-ended LLM outputs at scale — chatbots, summarization, RAG answers, agent traces, anything where a binary pass/fail is too crude. The technique works. It also has known biases. Both halves of that sentence matter.

The basic pattern

In its simplest form:

The task model receives input x and produces output y.
The judge model receives x, y, optionally a reference answer r, and a rubric.
The judge produces a verdict: a score, a label, a pairwise winner, plus (usually) a short rationale.

What the rubric looks like depends on the task. A common rubric for summarization, for example, scores faithfulness (does the summary match the source), structure (does it follow the requested shape), and hard errors (count of factual mistakes). The judge applies this rubric across hundreds or thousands of outputs in the time a human takes to read ten.

Frameworks like Arize Phoenix, Patronus, and LangSmith all implement variations of the pattern. The terminology is fairly standardized at this point.

The biases (and what to do about them)

The foundational work here is Zheng et al., 2023, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (arXiv:2306.05685). It documented several systematic biases that the follow-up literature has confirmed and extended:

Position bias. When comparing two answers (A vs B), judges tend to prefer whichever appears first. The effect is usually a few percentage points but is large enough to flip close A/B test outcomes. Mitigation: randomize order; run each pairing twice with swapped positions; only count consistent wins.

Length bias. Judges prefer longer answers, regardless of quality. Models that produce padded, verbose output get higher scores than terse correct ones. Mitigation: include length as an explicit dimension in the rubric and ask the judge to score brevity favorably; normalize for length post-hoc; check that "best" outputs are not consistently longer than the rest.

Self-preference / family bias. A model gives higher scores to outputs that came from the same family. GPT judges score GPT outputs higher; Claude judges score Claude outputs higher. The effect is real but usually small. Mitigation: use a judge from a different family than the candidate, or ensemble two judges and require agreement.

Verbosity / hedging bias. Judges over-reward answers that hedge appropriately and under-reward direct answers, sometimes scoring confident correct answers below cautious wrong ones. Mitigation: explicit rubric items about directness and confidence; calibrate against human scoring.

Concreteness illusion. Judges sometimes reward answers that sound specific (numbers, names, citations) even when those specifics are fabricated. Mitigation: combine LLM-judge with cheap factual verifications (does the cited paper exist? does the number match the source?) where you can.

Pairwise vs Likert

Two common scoring shapes:

Likert / absolute. The judge assigns a score (1–5, 1–7) on each rubric dimension.
Pairwise / preference. The judge picks which of two outputs is better, optionally with a margin.

Pairwise tends to be more reliable, for a simple reason: the judge has to discriminate, not anchor to a numerical scale it has no calibration for. Pairwise with randomized order is the default move for comparing prompt variants or model candidates.

Reach for Likert when you need absolute scores for a quality dashboard, or when you need to rank N>2 candidates. Calibrate any Likert system against a human-scored subset before you trust the numbers.

Designing the judge prompt

The judge prompt is itself a piece of prompt engineering and benefits from the same discipline:

Make the rubric explicit and structured. "Score on faithfulness (0–3), structure (0–3), hard errors (count)." Vague rubrics produce vague scores.
Ask for the rationale before the score. Standard chain-of-thought logic applies; the rationale forces the judge to engage with the rubric, and you get something to debug when scores look wrong.
Constrain the output format. JSON with explicit fields beats free-form prose. Strict schema validation catches malformed verdicts.
Show the judge the inputs and the outputs, not just the outputs. Faithfulness scoring is impossible without the source.

When LLM-as-judge is the right tool

It works well for:

Regression detection — has my new prompt made things worse? (Pairwise comparison vs. baseline.)
A/B testing — which of two prompt variants is better on a held-out set?
Continuous quality monitoring — is the production system drifting?
Triage — flag the bottom 5% of outputs for human review.

It is not the right tool for:

Final approval of high-stakes outputs. Human review still wins for anything regulatory, legal, or safety-critical.
Calibrating a model against a fixed quality bar. LLM judges drift; the same judge prompt against the same outputs can produce different numbers across model updates.
Adversarial settings. A judge can be prompt-injected through the candidate's output.

Validating your judge

The best way to trust LLM-as-judge is to keep validating it. Periodically:

Sample 50–100 recent judgments.
Have humans re-score the same outputs against the same rubric.
Measure agreement (Cohen's kappa for categorical, Pearson for continuous).
If agreement is dropping, investigate — usually a rubric ambiguity, sometimes a judge-model update.

A judge that agrees with humans 80–90% of the time on routine outputs is doing useful work. A judge that agrees 60% of the time is just adding noise with confidence.

Comments 2

u/judge_j · 1 month ago

the length bias is so real, we had to explicitly tell the judge to ignore verbosity. the judges end up needing their own evals lol
u/h.berg · 1 month ago

Useful, but worth stressing harder that the judge model has its own biases. We caught ours consistently favoring longer answers regardless of actual quality.

LLM-as-Judge, Explained

Article summary