An LLM with no memory is a stateless function. An LLM with infinite raw context is impractical (and would suffer context rot anyway). Production applications sit in the middle: they keep just enough information accessible to the model to be useful, and shed the rest.

The discipline of building that middle layer is memory engineering — a sub-genre of context engineering, important enough to deserve its own page.

Two layers

In every real LLM app, memory splits into two distinct layers with different costs and different patterns:

Short-term memory lives within a single session — one conversation, one task, one agent run. It is bounded by the context window and has to fit on every turn. The challenge is keeping the recent conversation usable without overflowing.

Long-term memory persists across sessions. New session opens; old facts come back. Storage is unbounded; retrieval has to be selective; relevance has to be fresh.

Most production systems implement both. Almost no systems get the second one right without explicit design.

Short-term: three patterns

Sliding window. Keep the last k turns verbatim, drop the rest. Cheap, predictable, dominant in production. LangChain calls this ConversationBufferWindowMemory. The trade-off is that everything older than k turns is invisible to the model — fine for casual chat, broken for long workflows.

Rolling summary. Periodically condense the older conversation into a short summary; carry the summary plus the most recent turns. LangChain's ConversationSummaryBufferMemory does this directly: when the token budget is exceeded, it asks an LLM to update the summary and drops the verbatim older messages. The summary becomes the model's view of the deep history; the recent window keeps the texture of the immediate conversation.

Hierarchical summary. Same idea, layered. Recent turns verbatim, last-hour turns summarized briefly, last-day turns summarized very briefly, all-time summary as a one-paragraph user profile. Each layer gets a smaller token budget. Useful for very long-running agents; over-engineered for chat.

The rule of thumb: if your sessions are short, use a sliding window. If they get long enough that you start dropping facts that should be remembered, add a rolling summary. Reach for hierarchical only when both are insufficient.

Long-term: vector store + episodic notes

The dominant pattern across the 2024–2026 wave of memory frameworks is the same:

  1. After each session (or each meaningful turn), the system writes one or more notes to a persistent store. Notes are short summaries of what happened: "user prefers metric units," "decided to go with option B because of cost," "expressed frustration with X."
  2. The notes are embedded and indexed in a vector store, usually keyed by user / project / agent.
  3. On a new turn, the system retrieves the k most relevant notes (by embedding similarity to the current query) and includes them in the context window.

That is the entire pattern. Differences across frameworks are mostly differences in how aggressively the framework auto-extracts and auto-writes:

  • Mem0 focuses on automatic extraction: it watches the conversation and writes structured memory entries without explicit calls.
  • Letta (formerly MemGPT) treats the LLM itself as a memory manager — the model uses tool calls to read and write its own long-term store.
  • LangChain and LlamaIndex provide primitives (memory modules, vector stores) and let you wire the policy yourself.
  • Anthropic's memory tool (and similar provider-native primitives) puts read/write operations behind the standard tool-use API so the model can manage simple cases without an external framework.

Where it goes wrong

Three failure modes show up repeatedly:

Memory pollution. The store fills with low-quality notes — "user said okay," "agent confirmed the request" — that crowd out the important ones. Fix: be selective about what gets written. Most turns do not produce a memorable fact.

Wrong-context retrieval. Embedding-similarity retrieves notes that sound related but are not. A note about a totally different project gets surfaced because it shares vocabulary. Fix: scope retrieval by user/project keys; add recency weighting; consider a hybrid of vector + keyword retrieval.

Stale memory. Old notes that were once true (user preferences, decisions, identities) keep getting retrieved long after they have been superseded. Fix: timestamp notes; explicitly invalidate or update; on retrieval, prefer recent over old when they conflict.

How to start

The boring version that works for most apps:

  1. Sliding window of the last k turns.
  2. Vector store keyed by user ID; write one or two short summary notes after each meaningful turn; retrieve top 3–5 by similarity at the start of each new turn.
  3. Log everything that gets read or written, so you can debug when memory misbehaves.

Start there. Add rolling summaries if sessions get long. Add a framework if you find yourself reinventing one. Most "advanced memory architecture" content online is a solution looking for a problem you do not have yet.