Subject
Where your LLM bill actually comes from, and what to do about it. Prompt caching across providers, when to drop to a smaller model, streaming vs batching, latency budgets, and cutting tokens without cutting accuracy.
As of 2026-05-27
Your monthly LLM bill is, almost without exception, four line items in a trench coat:
That is it. Every other variable — concurrency, streaming, batch APIs, model size — modifies one of those four numbers, it does not add a fifth.
The reason teams overspend is rarely that their prompts are bloated. It is almost always one of two things: they default to the largest model when a smaller one would have worked, or they pay full price for tokens they could have cached. The rest of this article is the decomposition that lets you see which one (or both) is happening to you.
Each major provider segments its lineup into roughly three tiers — though the names and the exact ratios move from generation to generation. The current shape across OpenAI, Anthropic, and Google:
The model names move; the structure does not. Always check the current per-token rates on the OpenAI, Anthropic, and Google AI pricing pages before sizing a workload. The mental model that survives generations: each step up the ladder is roughly an order of magnitude more expensive, and reasoning is a further multiplier on top.
The first useful exercise on a real bill: pull last month's usage by model and by direction. Most providers expose this in the dashboard or via the usage endpoints. You want a table like this:
| Model | Input tokens | Output tokens | Reasoning tokens | Spend |
|---|---|---|---|---|
| GPT-4o-mini | 4.2B | 280M | — | $X |
| GPT-4o | 90M | 18M | — | $Y |
| o1-mini | 4M | 1M | 9M | $Z |
Three things almost always jump out:
Once you can see the bill, every cost intervention is one of these:
Cut tokens. Less input, less output. Tighter system prompts, retrieval over context dumping, output length caps, structured output instead of prose where possible. See Cutting Tokens Without Cutting Accuracy.
Cut the rate. Use a smaller model where the smaller model is sufficient. The cascade pattern — small model first, escalate only on low-confidence — captures most of the win. See When to Use a Smaller Model.
Cache the stable parts. Long stable prefixes (system prompts, RAG context, document corpora, few-shot examples) belong in the cache. Every major provider offers some form of prompt or context caching now; the mechanics and exact discounts differ. See Prompt Caching, Explained.
Batch what does not need to be realtime. OpenAI and Anthropic both offer Batch APIs that trade up to 24 hours of completion latency for a meaningful per-token discount (around 50% at the time the features launched; check the current rate). For nightly jobs, eval runs, and content pipelines this is one of the largest unconditional cost levers available. See Streaming, Batching, and Latency Budgets.
Most teams that take cost seriously combine all four. None of them is exotic; they are all configuration changes.
Three line items that do not show up as a "token cost" but should be in your mental model:
Optimizing the bill in isolation, without these three, ends with a slower or worse product that costs slightly less to run. The point of cost engineering is not to minimize the LLM line item. It is to maximize value per dollar across the whole system.
A short editorial reading list. Pick whichever fits how you like to learn.