As of 2026-05-16

As of 2026-05-16

When a lab announces a "1M token context window," the headline number does not tell you whether the model actually uses tokens 500k–700k or treats them as background noise. The honest test is a context-window stress battery.

Test 1: needle in a haystack

Hide a fact in a block of unrelated text. Ask for it back.

[lots of unrelated text — 64,000 tokens]

The secret number is 7142.

[lots more unrelated text — 64,000 tokens]

Question: What is the secret number?

Vary two axes:

  • Position: start, 25%, 50%, 75%, end of context.
  • Length: 8k, 32k, 64k, 128k, 200k tokens.

Plot the success rate as a heatmap. The classic "lost in the middle" failure shows up as a hole near the 50% position. Models with good long-context training have a clean grid; weaker models have a noticeable dip in the middle and at the longest lengths.

Test 2: ordered retrieval

Easier than needle in name, harder in practice. Plant five facts at known positions and ask the model to list them in the order they appeared:

[long document with 5 numbered "secrets" buried at positions]

Question: List the 5 secrets in the order they appear in the document.

Pure needle tests retrieval. Ordered retrieval tests whether the model maintains the position information, which is what RAG-without-citations relies on.

Test 3: end-of-context summarization

Paste a long document and ask for a summary that emphasizes the last section. Many models silently overweight the first chunk of long input because that is where the attention budget concentrates. A model that summarizes the first chapter when asked about the last is failing at usable long context.

Test 4: distractor resistance

Plant the fact you want and a similar-looking distractor:

The secret number is 7142.
...
The secret PIN is 4217.
...
Question: What is the secret PIN?

Weak long-context models will sometimes return 7142 because it was the first plausible-looking fact. Strong models pick correctly even when the wrong answer is closer to the prompt.

What the results mean

A model that aces 16k but falls apart at 128k is not useless — it is great for chat, fine for short-doc analysis, and unsafe for "paste this whole codebase" or "ingest this entire transcript" use cases. Knowing the practical ceiling is more useful than the advertised maximum.

The Bench cluster keeps a dated leaderboard of context-window stress results for the major models. Each test is reproducible — the prompts and the scoring script are public.