For the first few years of working with large language models, the job was prompt engineering: read the task, write the cleverest instruction, ship the model. As context windows grew from 4k to 200k to a million tokens, and as tool use and retrieval became standard, the bottleneck moved. Output quality stopped depending mostly on how you worded the instruction and started depending mostly on what else lived in the window alongside it.

That second job is what people now call context engineering.

The working definition

Prompt engineering is sentence-level. Context engineering is system-level.

Concretely, context engineering covers:

  • Selection — which retrieved chunks, which past conversation turns, which tool results, which examples actually make it into the window.
  • Ordering — where each piece goes. Instruction first or last? Retrieved chunks before or after the question? System prompt how long?
  • Compaction — when the history overflows, what gets summarized, what gets dropped, what gets kept verbatim.
  • Allocation — system prompt vs user prompt vs assistant history vs tool results. How many tokens does each get.
  • Memory — what persists across sessions, what is rebuilt each time, what gets indexed for retrieval later.

A prompt engineer asks "what should the instruction say?" A context engineer asks "what should the whole input look like by the time the model sees it?"

Why the shift happened

Three things converged.

Context windows grew, but using them well stayed hard. Frontier models advertise 200k, 1M, even 10M tokens. Research like Liu et al.'s "Lost in the Middle" showed that the advertised window and the usable window are not the same number. Filling a long context is easy; making the model actually use the middle of it is not. Choosing what to include became more important than how to phrase the include.

Retrieval became standard. Once your application is reading from a knowledge base, the question stops being "what is the best phrasing?" and starts being "which five passages out of 50,000 should the model see?" That is a context problem.

Agents and tool use multiplied the input. Every tool call produces an observation that becomes part of the next turn's context. By turn 10 of a chatty loop, you can be carrying tens of thousands of tokens of intermediate state, and most of the model's output quality depends on what survived from earlier turns. That is a context problem too.

What a context-engineering mindset looks like

A few practical heuristics that fall out of the discipline:

  • Treat the context window as a budget, not a buffer. You have N tokens. Allocate them deliberately. Track where they go. When you overflow, decide what to cut rather than letting the API decide.
  • Position matters. Critical instructions at the top of the system message. Critical context immediately before the model's turn. The middle of a long context is the part you trust least.
  • Smaller is often better. Adding more retrieved chunks does not monotonically improve answers; past a few well-chosen passages, additional chunks usually hurt. Test before you stuff.
  • Compact aggressively in long sessions. Summaries beat verbatim history once you cross a few thousand tokens of chat. Drop tool observations that nothing downstream needs.
  • Make context construction observable. Log what went into the window. When the model is wrong, ninety percent of the time you can see why by reading the actual input it saw — not the input you thought it saw.

The rest of this cluster

Each linked article goes deeper on one piece:

Prompt engineering will not go away — the four levers from How to Prompt AI Effectively still matter. They just stopped being the whole job.