Treat the context window as a constrained optimization problem. You have a token budget. You decide three things: what goes in, where it sits, and how much of the budget each part gets. Every other context engineering question is a special case of these three.

What goes in

The default mistake is to put in whatever fits and let the model sort it out. That works at 4k tokens. It stops working as soon as you have real retrieval, real history, or real tool output.

The reliable approach: for each request, ask what the model actually needs to answer correctly. Then include only that.

  • The instruction — the standing rules. Usually in the system message.
  • The narrow context — the specific documents, fields, code, or transcripts the answer depends on.
  • Worked examples — one or two if the task has a precise shape (extraction, classification, formatting). Skip them if it does not.
  • Tool results and intermediate state — only what the next decision needs. Drop what was already used.
  • The actual question or input — last.

What does not belong in the context:

  • Whole documents when you only need a section. Pre-filter or retrieve.
  • Past conversation turns whose information has already been distilled into a summary.
  • Tool observations from earlier loops that nothing downstream reads.
  • Decorative content (preambles, "let me think out loud about your question..." style scaffolding from earlier turns).

If you are not sure something belongs, leave it out and see if accuracy drops on your eval set. It usually does not.

Where it goes

The single most useful piece of research here is Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (arXiv:2307.03172). The headline finding is a U-shaped position curve: models reliably use information at the start and end of a long context, and reliably miss information placed in the middle.

The result has held up across follow-up work. Newer long-context-tuned models flatten the U but do not eliminate it. The practical implication is the same as it was in 2023:

  • Top of context — system prompt, durable rules, persona, output format constraints. The model will pay attention to these the most.
  • Just before the model's turn — the specific question or instruction for this turn. Last-in attention is strong on every model we have tested.
  • The middle — the part you trust least. Long retrieved documents go here if anywhere, but expect the model to use them less effectively than equivalent material at the edges.

For RAG specifically, putting retrieved chunks between the system message and the user question (i.e., at the bottom of the context, just before the question) consistently beats putting them at the top. Some teams put a one-line summary of each chunk at the top and the full chunks at the bottom — that is a good compromise.

How much each part gets

Token budgets matter even when you have a million tokens to play with. Three reasons: cost is linear, latency grows with input length, and the "lost in the middle" effect gets worse as the window fills up.

A reasonable starting budget for a typical RAG + chat setup:

Part Tokens (typical) Notes
System message 200–800 Persona, scope, rules, output format.
Conversation summary 200–600 Replaces earlier turns.
Recent turns (verbatim) 1k–3k Sliding window of the last few exchanges.
Retrieved chunks 1k–4k 3–8 chunks of 200–500 tokens each.
Current user input 50–500 The actual question.
Reserved for model output 500–4k Depends on task.

Total: usually 5k–10k tokens, far less than the advertised window. That is fine. Using less context tends to improve quality, not hurt it, until you starve the model of information it actually needs.

Make the construction observable

The number-one debugging move in context engineering is logging the actual input the model saw. Not the template, not the parameters, the literal final string. Re-read it after a bad answer. Most of the time the cause is obvious in five seconds — the wrong chunk got retrieved, an earlier summary dropped a critical fact, a tool result is in the middle of the context and the model never saw it.

If your application makes context construction observable, debugging gets easy. If it does not, every wrong answer feels like the model "just got it wrong." Almost none of them did.