As of 2026-05-27

Prompt caching is the single most underused LLM cost lever in production today. Reads on a cached prefix cost meaningfully less than fresh input across every major provider that ships it. The reason teams skip it is mostly that the four major providers implement it differently enough that the docs feel confusing on the first read.

This article is the working engineer's read of how each provider's caching actually works, what counts as a cache hit, and when it saves real money. The mechanics are stable; the exact pricing multipliers are not — every provider publishes current numbers, and they have moved more than once since each feature shipped.

What "prompt caching" actually means

The underlying mechanic is the same everywhere: the model has already done the prefill work for a chunk of input tokens (the KV cache at the attention layer level). If your next request starts with the same prefix, the provider can skip redoing that prefill and charge you less for those tokens.

The shape of the bill across providers in 2026:

  • Cache read (hit on an existing cached prefix): discounted relative to fresh input. The exact multiplier is model- and provider-specific; assume "much cheaper than fresh, much more expensive than free."
  • Cache write (first time the prefix is submitted): equal to fresh input on some providers, modestly more on others (notably Anthropic for its longer TTL mode).
  • TTL: best treated as "minutes, not hours" by default, with explicit longer modes available on some providers.

The differences between providers are in how you trigger caching, in the minimum cacheable size, and in what visibility you get into hits — not in whether the underlying optimization works.

OpenAI: automatic

OpenAI prompt caching, introduced in October 2024, runs automatically on supported models: GPT-4o family, GPT-4.1, the o-series reasoning models, and subsequent generations. No code changes. The rules:

  • Caching is matched on prefix — the longest common prefix of your prompt with what is already in the cache hits. Anything you change earlier in the prompt invalidates everything after it.
  • A minimum cacheable prefix length applies, originally documented at 1,024 tokens but per-model and subject to change. Prompts shorter than the minimum do not cache.
  • TTL is treated as an internal optimization rather than a public contract. In practice it behaves on the order of minutes, with each hit refreshing the lifetime.
  • Cached input is billed at a discounted rate versus standard input; the exact discount varies by model and is published on the OpenAI pricing page.

Practical implication: put everything stable at the front of your prompt. System prompt, then tool definitions, then RAG context that does not change between calls, then the user message at the very end. Reorganizing prompts so the variable part is last is the single biggest change that makes OpenAI caching effective.

Anthropic: explicit cache_control blocks

Anthropic shipped prompt caching in beta in August 2024 and made it generally available shortly after. It is explicit: you mark which message blocks should be cached using a cache_control field on a content block.

client.messages.create(
    model="claude-sonnet-4",
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_message}],
)

Notable details:

  • A minimum cacheable size applies (around the 1,024-token mark for Sonnet-class models, lower on Haiku-class). Below the minimum, the marker is silently ignored.
  • Two TTL modes exist: the default ephemeral cache (roughly five minutes) and a longer 1-hour beta TTL. The longer mode is priced higher on writes.
  • Up to four cache breakpoints per request, so you can have nested layers (e.g. a stable system prompt + semi-stable conversation history + variable user message).
  • Current per-token multipliers for cache reads, ephemeral writes, and 1-hour writes are listed on the Anthropic pricing page — check there rather than hardcoding numbers.

The explicit model is more code but also more controllable. You always know what is being cached and what is not.

Google Gemini: explicit cached content

Google launched context caching for Gemini in June 2024. It is the most explicit of the four — you create a CachedContent object with the content you want cached, set a TTL directly, and reference its handle in subsequent requests.

cache = genai.caching.CachedContent.create(
    model="gemini-2.5-pro",
    contents=[long_document],
    ttl=datetime.timedelta(minutes=10),
)
response = model.generate_content(
    "Summarize section 3",
    cached_content=cache.name,
)

The differences worth knowing:

  • You pay storage for as long as the cache lives, on top of the per-token input/output charges. Storage is priced per GiB-month on Vertex AI pricing and is non-trivial for large caches kept for hours or days.
  • You set the TTL directly. Useful for batch workloads where you want a cache to live exactly as long as the job.
  • Minimum cache size is much higher than OpenAI or Anthropic, suited for caching whole document corpora rather than short system prompts. Current minimums and per-model pricing are on the same Vertex pricing page.

Gemini's caching is best when you have a really large stable context (a multi-megabyte document, a long codebase) and you want to query it dozens of times. For a short system prompt it is the wrong tool — use the automatic caching of a smaller model instead.

DeepSeek: automatic, with a low floor

DeepSeek launched context caching in August 2024. Like OpenAI, it is automatic: the API detects shared prefixes across your requests and charges cache reads at a discounted rate, with the current multiplier on the DeepSeek pricing page. The minimum prefix length is materially shorter than OpenAI's, which makes DeepSeek caching useful even for moderately short prompts that see repetition.

When caching saves real money

The break-even arithmetic is simple. For a workload with a stable prefix of size P and a variable suffix of size V, the cost of N calls is roughly:

  • Without caching: N × (P + V) × input_rate
  • With caching: (P + V) × write_rate + (N - 1) × (P × read_rate + V × input_rate)

The savings grow as N grows. With typical read multipliers, you collect most of the maximum saving within roughly five to ten reuses inside the TTL window. Workloads that look great for caching:

  • Customer support assistants with a long stable system prompt asked dozens of times per minute.
  • Coding agents with a large stable codebase indexed into the prompt.
  • RAG pipelines with a fixed instructional prefix and changing retrieved context — cache up to the retrieval boundary.
  • Eval runs that repeatedly score outputs against the same rubric.

Workloads where caching does little:

  • One-shot user queries with no repeated prefix.
  • Highly variable prompts where every part changes between calls.
  • Short prompts below the provider minimum.

Verification is the part most teams skip. Every provider exposes some kind of cache-hit accounting — OpenAI shows cached-input tokens in usage exports, Anthropic surfaces cache_creation_input_tokens and cache_read_input_tokens in the API response, Google's Vertex usage view distinguishes cached content, DeepSeek's dashboard separates cache_hit_tokens from regular input. Read those numbers. The bill is the only honest answer to "is caching actually working here?"