Subject

Cost & Performance Engineering

Where your LLM bill actually comes from, and what to do about it. Prompt caching across providers, when to drop to a smaller model, streaming vs batching, latency budgets, and cutting tokens without cutting accuracy.

As of 2026-05-27

Your monthly LLM bill is, almost without exception, four line items in a trench coat:

Input tokens × input rate. Everything you send into the model — system prompt, conversation history, retrieved context, user message.
Output tokens × output rate. Everything the model generates back at you. Billed at a higher per-token rate than input across every major provider — usually a few times more, sometimes much more on reasoning models. Check the current rates on the OpenAI, Anthropic, and Google AI pricing pages.
Reasoning tokens × output rate (if you use reasoning models). Hidden thinking tokens the user never sees, billed at the output tier on the providers that surface them. The exact billing convention varies by provider and model.
Cache reads and writes (if you enable prompt caching). Cache hits are cheaper than fresh input; cache writes are equal to or slightly more expensive than fresh input, depending on the provider and the cache duration mode.

That is it. Every other variable — concurrency, streaming, batch APIs, model size — modifies one of those four numbers, it does not add a fifth.

The reason teams overspend is rarely that their prompts are bloated. It is almost always one of two things: they default to the largest model when a smaller one would have worked, or they pay full price for tokens they could have cached. The rest of this article is the decomposition that lets you see which one (or both) is happening to you.

The price ladder, sanity-checked

Each major provider segments its lineup into roughly three tiers — though the names and the exact ratios move from generation to generation. The current shape across OpenAI, Anthropic, and Google:

Small / fast tier — the smallest, fastest models in each lineup (GPT-4o-mini-class, Haiku-class, Flash-class, and open-weight equivalents like Llama and Qwen served on cheap infra). These are the workhorse models for the boring 80% of LLM calls.
Mid tier — the general-purpose default (GPT-4o-class, Sonnet-class, Pro-class). Multiple times more expensive than the small tier; where most production "frontier-ish" workloads land.
Frontier / reasoning tier — the flagship and reasoning models (Opus-class, the o-series, Gemini's Deep Think mode). Materially more expensive again, especially once reasoning tokens are counted.

The model names move; the structure does not. Always check the current per-token rates on the OpenAI, Anthropic, and Google AI pricing pages before sizing a workload. The mental model that survives generations: each step up the ladder is roughly an order of magnitude more expensive, and reasoning is a further multiplier on top.

Reading your own bill

The first useful exercise on a real bill: pull last month's usage by model and by direction. Most providers expose this in the dashboard or via the usage endpoints. You want a table like this:

Model	Input tokens	Output tokens	Reasoning tokens	Spend
GPT-4o-mini	4.2B	280M	—	$X
GPT-4o	90M	18M	—	$Y
o1-mini	4M	1M	9M	$Z

Three things almost always jump out:

One model is dominating spend. Usually the largest one you allow your code to fall back to. Look at what tasks routed there.
Output is a bigger share of spend than its share of tokens. This is just the price ratio; it tells you that shortening responses moves the bill more than shortening inputs of equal length.
Reasoning tokens, when present, are a surprise line. Teams often discover they enabled an extended-thinking mode somewhere and have been silently paying for hidden tokens.

The four leverage points

Once you can see the bill, every cost intervention is one of these:

Cut tokens. Less input, less output. Tighter system prompts, retrieval over context dumping, output length caps, structured output instead of prose where possible. See Cutting Tokens Without Cutting Accuracy.

Cut the rate. Use a smaller model where the smaller model is sufficient. The cascade pattern — small model first, escalate only on low-confidence — captures most of the win. See When to Use a Smaller Model.

Cache the stable parts. Long stable prefixes (system prompts, RAG context, document corpora, few-shot examples) belong in the cache. Every major provider offers some form of prompt or context caching now; the mechanics and exact discounts differ. See Prompt Caching, Explained.

Batch what does not need to be realtime. OpenAI and Anthropic both offer Batch APIs that trade up to 24 hours of completion latency for a meaningful per-token discount (around 50% at the time the features launched; check the current rate). For nightly jobs, eval runs, and content pipelines this is one of the largest unconditional cost levers available. See Streaming, Batching, and Latency Budgets.

Most teams that take cost seriously combine all four. None of them is exotic; they are all configuration changes.

The hidden costs

Three line items that do not show up as a "token cost" but should be in your mental model:

Latency cost. Slow responses lose users. Latency is bought back by streaming (improves time-to-first-token), batching the prefill, picking faster models, and shortening output. There is a real budget here even if it does not show up as dollars.
Engineering cost. A clever 20% saving that requires a custom retry framework, a custom cache layer, and ongoing maintenance can cost more in engineering time than the dollars it saved. Pick the cheap big wins first.
Failure cost. Models that fail or hallucinate cost money you do not see — wasted user time, support tickets, downstream errors. Evals and guardrails are a real cost lever even though they do not show up on the LLM invoice.

Optimizing the bill in isolation, without these three, ends with a slower or worse product that costs slightly less to run. The point of cost engineering is not to minimize the LLM line item. It is to maximize value per dollar across the whole system.

Forthcoming

Llm Observability Tools Compared
Fallback and Cascade Patterns
Rate Limits and Quota Engineering

Where to go next

A short editorial reading list. Pick whichever fits how you like to learn.

NerdSip: 5-minute AI micro-course on almost any topic, on iOS and Android

Comments 0

No comments yet. Be the first to share your thoughts.