Context rot is the practitioner term for the systematic degradation of LLM performance as the input window fills up. The terminology is community-converged rather than academic — Drew Breunig's 2025 piece "How Long Contexts Fail" is widely cited as the post that popularized the phrase, building on a research thread that goes back to Liu et al.'s 2023 paper "Lost in the Middle: How Language Models Use Long Contexts".
The phenomenon is real, well-documented, and one of the most important things to understand about long-context applications.
The core finding
Liu et al. ran a clean test: bury a single relevant passage inside a long context full of unrelated distractors, ask the model to find it, and chart accuracy against the passage's position. The result was a U-shaped curve. Models reliably retrieved the passage when it sat near the beginning or the end of the context. Accuracy dropped sharply when it sat in the middle, even at lengths the model was supposedly trained to handle.
Three pieces of that result matter:
- The drop is sharp, not gradual. The middle of a long context behaves as if it is not really there.
- More distractors makes it worse, not just longer documents. Filling the window with marginally-related text hurts retrieval of the relevant text.
- Models within their official context window can still fail. Advertised window length is not the same as the length at which the model actually uses every token.
This held across GPT-3.5, GPT-4, and the open-weight models tested at the time. Follow-up work through 2024–2026 has confirmed the effect on every generation since, with the severity decreasing for models trained with explicit long-context objectives.
The broader failure family
"Context rot" extends the position effect into a family of related failures that all get worse as the window grows:
- Position bias — the U-shaped curve. Edges of the context get more attention than the middle.
- Distractor sensitivity — more irrelevant material in the window depresses accuracy on the relevant material.
- Instruction drift — system-prompt rules get followed less reliably as the conversation grows long. The model "forgets" constraints that were stated 30k tokens ago.
- Tool-result blindness — in agent loops, observations from earlier tool calls become less likely to influence the next decision the further back they are.
- Coherence loss — long generation in long contexts starts to repeat, drift, or contradict earlier content.
These are not separate bugs. They share a common cause: attention is a finite resource and gets diluted across the context.
What modern long-context training changes
Frontier models in 2026 (Gemini 2.x with its million-token window, Claude 4.x, Llama 4 with its experimental 10M-token configuration) have been explicitly trained for long-context retention and use specialized attention patterns to mitigate the failure modes above.
The honest picture:
- Long-context-tuned models do markedly better on needle-in-a-haystack at long lengths than the 2023 generation did.
- The U-shaped position curve is flatter but not flat.
- The advertised maximum context still over-states the practically reliable length. For most retrieval-heavy tasks, a model rated to a million tokens has a comfortable working zone well below that.
- Performance on synthetic needle tests overstates performance on real tasks. The synthetic tests use a single hidden fact against neutral filler; real applications have multiple relevant pieces, distractors that look relevant, and questions that require reasoning across positions.
The practical conclusion stays the same as it has been since 2023: fewer tokens is usually better than more.
How to fight context rot
The defenses are exactly the techniques covered elsewhere in this cluster, and they compound:
- Aggressive selection. Send the model the few passages it actually needs, not every plausibly-relevant one. See how-to-choose-what-goes-in-the-context-window.
- Edge placement. Put critical instructions at the top, critical questions at the bottom. The middle is for material the model uses less reliably anyway.
- Compaction. In long conversations, summarize old turns rather than carrying them verbatim. See memory-systems-for-llm-apps.
- Re-statement. When a rule needs to hold across many turns, restate it in the user message near the point of action, not just in a system message 50k tokens earlier.
- Stress test before shipping. Run a needle-in-a-haystack battery at your actual deployment length. If the model is unreliable at that length, drop the length or pick a different model.
Context rot is one of the few failure modes where the technique that fixes it (use less context) is also the technique that makes the rest of your stack cheaper and faster. It is genuinely free engineering to take seriously.