As of 2026-05-15

As of 2026-05-15

Most LLM comparisons you read online either repeat published benchmarks (which leak) or vibe-rank models with no method. Neither is what a working reader needs. This page is the public methodology for every dated head-to-head on this site.

What we test

Each comparison is task-shaped: a small, named scenario with a fixed input distribution. Examples:

  • Code-review battery (12 prompts). Real diffs from open-source repos with a planted bug each. We score how many bugs each model finds and how many spurious comments it adds.
  • Summarization battery (10 prompts). News articles + technical write-ups + meeting transcripts. Rubric: faithfulness, key-point coverage, length adherence, hallucination count.
  • Long-context retrieval (8 prompts). Needle-in-a-haystack at 16k, 64k, 128k, 200k tokens. Measure success vs. fabricated answers.
  • Hard-reasoning battery (15 prompts). Problems with traceable answers (logic puzzles, arithmetic chains, planning). Measure correctness, not chain-of-thought style.

Each battery lives in source control; the prompts are reproducible.

Fixed parameters

For every model in every test:

  • Temperature: 0 unless the task explicitly tests sampling behavior.
  • Top-p: 1.
  • Random seed: fixed where the API exposes it (most do for local models, some for hosted).
  • Max tokens: identical for the task; disclosed.
  • System prompt: identical for the task; disclosed.
  • Few-shot examples: identical or none; disclosed.
  • Model version: pinned to a specific snapshot or build date. No "auto-route" endpoints.

If any of those vary, the test is invalid.

Rubric for open-ended tasks

For writing and analysis tasks where there is no right answer, we score against a four-point rubric:

  1. Faithfulness. Does the output actually match the source / prompt? Count hallucinations, not vibes.
  2. Structure. Does the output follow the requested shape? (Length, format, label discipline.)
  3. Voice. Does it read like what was asked for, or does it read like "AI"?
  4. Hard errors. Count of objectively wrong factual claims.

Each output gets numbered. Raw outputs are linked. You are free to disagree with our scoring — the data is public.

What we report

Every comparison post includes:

  • Models tested, with version pins.
  • Date of the run.
  • Task batteries and prompts, linked.
  • Per-task scores, with the rubric where applicable.
  • A short editorial verdict, marked as opinion.
  • Caveats — what we did not test, what surprised us, where we doubt our own scoring.

If the verdict and the numbers disagree, we say so.

What we do not do

  • No leaderboards across all tasks. Aggregate scores hide what matters. Each task has its own table.
  • No "best model" headlines. "Best for what" is the question; the headline reflects the task.
  • No undated snapshots. Every benchmark article carries an "as of" line. Old snapshots stay live; they are not silently updated.
  • No paid placement. Nothing in our methodology gives any model a structural advantage. If a vendor sent us hardware, we would say so.

Why this approach beats vibes

Vibe-ranking models is fun and worthless. Standardized benchmarks are rigorous and misleading. The middle path — small disclosed task batteries with fixed parameters and a visible rubric — is what actual engineers do when they are picking a model for a project. We just write it down.