Is context rot the same as "lost in the middle"?

Closely related but slightly broader. "Lost in the middle" specifically describes the U-shaped position effect (Liu et al., 2023). "Context rot" or "context decay" is the umbrella term practitioners use for the general pattern: as the input window grows, accuracy drops, instructions get ignored, position effects worsen, and distractor passages compound. Same family of failures, slightly different framing.

Do long-context models like Gemini 1M or Llama 4 (10M experimental) solve this?

They reduce the severity. Long-context-tuned models with retrieval-aware attention and explicit long-context training handle the middle of long contexts much better than older models. They do not eliminate position effects, and the gap between advertised and usable context is still real. Test on your task at the length you actually plan to use.

What is the practical maximum I should plan for?

Run a needle-in-a-haystack battery on your target model at the lengths you care about (see [context-window-stress-tests](/model-benchmarks/context-window-stress-tests/)). Most teams find that practical reliability falls off well below the advertised maximum — for a model rated to 200k tokens, the comfortable working zone might be 30k–60k for retrieval-heavy tasks. The exact number depends on the model and the task.

Context Rot, Explained

Context rot is the practitioner term for the systematic degradation of LLM performance as the input window fills up. The terminology is community-converged rather than academic — Drew Breunig's 2025 piece "How Long Contexts Fail" is widely cited as the post that popularized the phrase, building on a research thread that goes back to Liu et al.'s 2023 paper "Lost in the Middle: How Language Models Use Long Contexts".

The phenomenon is real, well-documented, and one of the most important things to understand about long-context applications.

The core finding

Liu et al. ran a clean test: bury a single relevant passage inside a long context full of unrelated distractors, ask the model to find it, and chart accuracy against the passage's position. The result was a U-shaped curve. Models reliably retrieved the passage when it sat near the beginning or the end of the context. Accuracy dropped sharply when it sat in the middle, even at lengths the model was supposedly trained to handle.

Three pieces of that result matter:

The drop is sharp, not gradual. The middle of a long context behaves as if it is not really there.
More distractors makes it worse, not just longer documents. Filling the window with marginally-related text hurts retrieval of the relevant text.
Models within their official context window can still fail. Advertised window length is not the same as the length at which the model actually uses every token.

This held across GPT-3.5, GPT-4, and the open-weight models tested at the time. Follow-up work through 2024–2026 has confirmed the effect on every generation since, with the severity decreasing for models trained with explicit long-context objectives.

The broader failure family

"Context rot" extends the position effect into a family of related failures that all get worse as the window grows:

Position bias — the U-shaped curve. Edges of the context get more attention than the middle.
Distractor sensitivity — more irrelevant material in the window depresses accuracy on the relevant material.
Instruction drift — system-prompt rules get followed less reliably as the conversation grows long. The model "forgets" constraints that were stated 30k tokens ago.
Tool-result blindness — in agent loops, observations from earlier tool calls become less likely to influence the next decision the further back they are.
Coherence loss — long generation in long contexts starts to repeat, drift, or contradict earlier content.

These are not separate bugs. They share a common cause: attention is a finite resource and gets diluted across the context.

What modern long-context training changes

Frontier models in 2026 (Gemini 2.x with its million-token window, Claude 4.x, Llama 4 with its experimental 10M-token configuration) have been explicitly trained for long-context retention and use specialized attention patterns to mitigate the failure modes above.

The honest picture:

Long-context-tuned models do markedly better on needle-in-a-haystack at long lengths than the 2023 generation did.
The U-shaped position curve is flatter but not flat.
The advertised maximum context still over-states the practically reliable length. For most retrieval-heavy tasks, a model rated to a million tokens has a comfortable working zone well below that.
Performance on synthetic needle tests overstates performance on real tasks. The synthetic tests use a single hidden fact against neutral filler; real applications have multiple relevant pieces, distractors that look relevant, and questions that require reasoning across positions.

The practical conclusion stays the same as it has been since 2023: fewer tokens is usually better than more.

How to fight context rot

The defenses are exactly the techniques covered elsewhere in this cluster, and they compound:

Aggressive selection. Send the model the few passages it actually needs, not every plausibly-relevant one. See how-to-choose-what-goes-in-the-context-window.
Edge placement. Put critical instructions at the top, critical questions at the bottom. The middle is for material the model uses less reliably anyway.
Compaction. In long conversations, summarize old turns rather than carrying them verbatim. See memory-systems-for-llm-apps.
Re-statement. When a rule needs to hold across many turns, restate it in the user message near the point of action, not just in a system message 50k tokens earlier.
Stress test before shipping. Run a needle-in-a-haystack battery at your actual deployment length. If the model is unreliable at that length, drop the length or pick a different model.

Context rot is one of the few failure modes where the technique that fixes it (use less context) is also the technique that makes the rest of your stack cheaper and faster. It is genuinely free engineering to take seriously.

Comments 2

u/rot_r · 1 month ago

watched a long thread slowly lose the plot in real time last week and now i finally have a name for it. context rot is very real
u/deepa.r · 1 month ago

context rot is such a good name for it. ive watched answers slowly degrade as the window fills up and never had a clean term to describe what was happening

Context Rot, Explained

Article summary

The core finding

The broader failure family

What modern long-context training changes

How to fight context rot

Frequently asked questions

See also

Where to go next

Comments 2