Should I put the instruction first or last?

For most tasks, the high-level instruction goes at the top (in the system message). The specific question or input goes right before the model's turn. This puts both at "edge" positions of the context, where attention is strongest, and matches the pattern that most modern instruction-tuned models were trained on.

How many retrieved chunks should I send?

Fewer than you think. Most production RAG setups peak around 3–8 well-chosen chunks. Past about a dozen, additional chunks usually hurt: the relevant one gets buried, distractor chunks pull attention, and the cost goes up. Tune chunk count on your evals, not on what fits in the window.

Do long-context models fix this?

They reduce the symptom but rarely eliminate it. Models trained explicitly for long context (Gemini's 1M+, Claude's 200k+, Llama 4's 10M experimental) handle the middle of long contexts much better than older models, but position effects do not go to zero. Test on a needle-in-a-haystack battery at the lengths you actually use.

How to Choose What Goes in the Context Window

Treat the context window as a constrained optimization problem. You have a token budget. You decide three things: what goes in, where it sits, and how much of the budget each part gets. Every other context engineering question is a special case of these three.

What goes in

The default mistake is to put in whatever fits and let the model sort it out. That works at 4k tokens. It stops working as soon as you have real retrieval, real history, or real tool output.

The reliable approach: for each request, ask what the model actually needs to answer correctly. Then include only that.

The instruction — the standing rules. Usually in the system message.
The narrow context — the specific documents, fields, code, or transcripts the answer depends on.
Worked examples — one or two if the task has a precise shape (extraction, classification, formatting). Skip them if it does not.
Tool results and intermediate state — only what the next decision needs. Drop what was already used.
The actual question or input — last.

What does not belong in the context:

Whole documents when you only need a section. Pre-filter or retrieve.
Past conversation turns whose information has already been distilled into a summary.
Tool observations from earlier loops that nothing downstream reads.
Decorative content (preambles, "let me think out loud about your question..." style scaffolding from earlier turns).

If you are not sure something belongs, leave it out and see if accuracy drops on your eval set. It usually does not.

Where it goes

The single most useful piece of research here is Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (arXiv:2307.03172). The headline finding is a U-shaped position curve: models reliably use information at the start and end of a long context, and reliably miss information placed in the middle.

The result has held up across follow-up work. Newer long-context-tuned models flatten the U but do not eliminate it. The practical implication is the same as it was in 2023:

Top of context — system prompt, durable rules, persona, output format constraints. The model will pay attention to these the most.
Just before the model's turn — the specific question or instruction for this turn. Last-in attention is strong on every model we have tested.
The middle — the part you trust least. Long retrieved documents go here if anywhere, but expect the model to use them less effectively than equivalent material at the edges.

For RAG specifically, putting retrieved chunks between the system message and the user question (i.e., at the bottom of the context, just before the question) consistently beats putting them at the top. Some teams put a one-line summary of each chunk at the top and the full chunks at the bottom — that is a good compromise.

How much each part gets

Token budgets matter even when you have a million tokens to play with. Three reasons: cost is linear, latency grows with input length, and the "lost in the middle" effect gets worse as the window fills up.

A reasonable starting budget for a typical RAG + chat setup:

Part	Tokens (typical)	Notes
System message	200–800	Persona, scope, rules, output format.
Conversation summary	200–600	Replaces earlier turns.
Recent turns (verbatim)	1k–3k	Sliding window of the last few exchanges.
Retrieved chunks	1k–4k	3–8 chunks of 200–500 tokens each.
Current user input	50–500	The actual question.
Reserved for model output	500–4k	Depends on task.

Total: usually 5k–10k tokens, far less than the advertised window. That is fine. Using less context tends to improve quality, not hurt it, until you starve the model of information it actually needs.

Make the construction observable

The number-one debugging move in context engineering is logging the actual input the model saw. Not the template, not the parameters, the literal final string. Re-read it after a bad answer. Most of the time the cause is obvious in five seconds — the wrong chunk got retrieved, an earlier summary dropped a critical fact, a tool result is in the middle of the context and the model never saw it.

If your application makes context construction observable, debugging gets easy. If it does not, every wrong answer feels like the model "just got it wrong." Almost none of them did.

Comments 2

u/n_walsh · 1 month ago

the curation mindset is basically the whole game. funny enough i first got the mental model from a NerdSip course on information density, then realised it applied directly here
u/curate_c · 1 month ago

treating the window like a tight budget completely changed my results. more context is not more better, signal to noise is everything

How to Choose What Goes in the Context Window

Article summary

What goes in

Where it goes

How much each part gets

Make the construction observable

Frequently asked questions

See also

Where to go next

Comments 2