Context-Window Stress Tests

Article summary

Advertised context windows do not equal usable context windows. A model rated for 200k tokens can be terrible at finding a fact buried at the 80k mark. The fix is to test it: needle-in-a-haystack at multiple positions, ordered retrieval, and end-of-context summarization. Models that look great at 8k can fall apart at 64k.

As of 2026-05-16

When a lab announces a "1M token context window," the headline number does not tell you whether the model actually uses tokens 500k–700k or treats them as background noise. The honest test is a context-window stress battery.

Test 1: needle in a haystack

Hide a fact in a block of unrelated text. Ask for it back.

[lots of unrelated text — 64,000 tokens]

The secret number is 7142.

[lots more unrelated text — 64,000 tokens]

Question: What is the secret number?

Vary two axes:

Position: start, 25%, 50%, 75%, end of context.
Length: 8k, 32k, 64k, 128k, 200k tokens.

Plot the success rate as a heatmap. The classic "lost in the middle" failure shows up as a hole near the 50% position. Models with good long-context training have a clean grid; weaker models have a noticeable dip in the middle and at the longest lengths.

Test 2: ordered retrieval

Easier than needle in name, harder in practice. Plant five facts at known positions and ask the model to list them in the order they appeared:

[long document with 5 numbered "secrets" buried at positions]

Question: List the 5 secrets in the order they appear in the document.

Pure needle tests retrieval. Ordered retrieval tests whether the model maintains the position information, which is what RAG-without-citations relies on.

Test 3: end-of-context summarization

Paste a long document and ask for a summary that emphasizes the last section. Many models silently overweight the first chunk of long input because that is where the attention budget concentrates. A model that summarizes the first chapter when asked about the last is failing at usable long context.

Test 4: distractor resistance

Plant the fact you want and a similar-looking distractor:

The secret number is 7142.
...
The secret PIN is 4217.
...
Question: What is the secret PIN?

Weak long-context models will sometimes return 7142 because it was the first plausible-looking fact. Strong models pick correctly even when the wrong answer is closer to the prompt.

What the results mean

A model that aces 16k but falls apart at 128k is not useless — it is great for chat, fine for short-doc analysis, and unsafe for "paste this whole codebase" or "ingest this entire transcript" use cases. Knowing the practical ceiling is more useful than the advertised maximum.

The Bench cluster keeps a dated leaderboard of context-window stress results for the major models. Each test is reproducible — the prompts and the scoring script are public.

Frequently asked questions

What is needle-in-a-haystack?

A test where you hide a specific known fact ("the secret number is 73") in a long block of unrelated text and ask the model to retrieve it. You vary the position (start, middle, end) and the haystack size (8k, 32k, 128k tokens) and chart which positions the model finds vs. misses. The "lost in the middle" failure mode shows up clearly.

Is needle-in-a-haystack enough?

No. It tests pure retrieval of a string. Real use cases also need synthesis, ordered retrieval, and consistency across long contexts. Treat needle as a floor: if a model fails needle, it will definitely fail your task. If it passes, you still need to test the harder cases.

How long are real production contexts?

Median production prompts at scale are still under 4k tokens. The tail extends to 100k+ for whole-document analysis and large RAG retrievals. Test the long tail you actually have, not the maximum the model advertises.