Why is output so much more expensive than input?

Output tokens come from autoregressive decoding — the model generates them one at a time, each conditioned on everything that came before. Input tokens get processed in a single forward pass over the prompt (prefill). The compute cost per output token is materially higher than per input token, and the providers price accordingly. Across the major APIs in 2026, output is typically 3–5x the input rate for the same model. A reply that is twice as long roughly doubles its share of the bill.

Do reasoning tokens cost extra?

Yes. Reasoning models (OpenAI o-series, Anthropic extended-thinking modes, Gemini Deep Think) generate hidden thinking tokens before emitting the visible answer. These tokens are billed at the output rate even though the user never sees them, and reasoning runs can spend tens of thousands of tokens before producing a one-line answer. If you turn reasoning on, budget for it — the line item that surprises teams is "I thought this was a cheap query and the model thought for 12,000 tokens before answering."

What single change cuts the bill the most?

For most apps the highest-leverage change is dropping the default model tier for tasks that do not need frontier capability. Classification, routing, simple extraction, reformatting, short-context QA — these work well on a small model and cost a fraction. The second-highest-leverage change is enabling prompt caching for any prompt that repeats a long stable prefix (system prompts, RAG context, few-shot examples). Both changes are typically a config flag, not a refactor.

Where Your LLM Bill Actually Comes From

As of 2026-05-27

Your monthly LLM bill is, almost without exception, four line items in a trench coat:

Input tokens × input rate. Everything you send into the model — system prompt, conversation history, retrieved context, user message.
Output tokens × output rate. Everything the model generates back at you. Billed at a higher per-token rate than input across every major provider — usually a few times more, sometimes much more on reasoning models. Check the current rates on the OpenAI, Anthropic, and Google AI pricing pages.
Reasoning tokens × output rate (if you use reasoning models). Hidden thinking tokens the user never sees, billed at the output tier on the providers that surface them. The exact billing convention varies by provider and model.
Cache reads and writes (if you enable prompt caching). Cache hits are cheaper than fresh input; cache writes are equal to or slightly more expensive than fresh input, depending on the provider and the cache duration mode.

That is it. Every other variable — concurrency, streaming, batch APIs, model size — modifies one of those four numbers, it does not add a fifth.

The reason teams overspend is rarely that their prompts are bloated. It is almost always one of two things: they default to the largest model when a smaller one would have worked, or they pay full price for tokens they could have cached. The rest of this article is the decomposition that lets you see which one (or both) is happening to you.

The price ladder, sanity-checked

Each major provider segments its lineup into roughly three tiers — though the names and the exact ratios move from generation to generation. The current shape across OpenAI, Anthropic, and Google:

Small / fast tier — the smallest, fastest models in each lineup (GPT-4o-mini-class, Haiku-class, Flash-class, and open-weight equivalents like Llama and Qwen served on cheap infra). These are the workhorse models for the boring 80% of LLM calls.
Mid tier — the general-purpose default (GPT-4o-class, Sonnet-class, Pro-class). Multiple times more expensive than the small tier; where most production "frontier-ish" workloads land.
Frontier / reasoning tier — the flagship and reasoning models (Opus-class, the o-series, Gemini's Deep Think mode). Materially more expensive again, especially once reasoning tokens are counted.

The model names move; the structure does not. Always check the current per-token rates on the OpenAI, Anthropic, and Google AI pricing pages before sizing a workload. The mental model that survives generations: each step up the ladder is roughly an order of magnitude more expensive, and reasoning is a further multiplier on top.

Reading your own bill

The first useful exercise on a real bill: pull last month's usage by model and by direction. Most providers expose this in the dashboard or via the usage endpoints. You want a table like this:

Model	Input tokens	Output tokens	Reasoning tokens	Spend
GPT-4o-mini	4.2B	280M	—	$X
GPT-4o	90M	18M	—	$Y
o1-mini	4M	1M	9M	$Z

Three things almost always jump out:

One model is dominating spend. Usually the largest one you allow your code to fall back to. Look at what tasks routed there.
Output is a bigger share of spend than its share of tokens. This is just the price ratio; it tells you that shortening responses moves the bill more than shortening inputs of equal length.
Reasoning tokens, when present, are a surprise line. Teams often discover they enabled an extended-thinking mode somewhere and have been silently paying for hidden tokens.

The four leverage points

Once you can see the bill, every cost intervention is one of these:

Cut tokens. Less input, less output. Tighter system prompts, retrieval over context dumping, output length caps, structured output instead of prose where possible. See Cutting Tokens Without Cutting Accuracy.

Cut the rate. Use a smaller model where the smaller model is sufficient. The cascade pattern — small model first, escalate only on low-confidence — captures most of the win. See When to Use a Smaller Model.

Cache the stable parts. Long stable prefixes (system prompts, RAG context, document corpora, few-shot examples) belong in the cache. Every major provider offers some form of prompt or context caching now; the mechanics and exact discounts differ. See Prompt Caching, Explained.

Batch what does not need to be realtime. OpenAI and Anthropic both offer Batch APIs that trade up to 24 hours of completion latency for a meaningful per-token discount (around 50% at the time the features launched; check the current rate). For nightly jobs, eval runs, and content pipelines this is one of the largest unconditional cost levers available. See Streaming, Batching, and Latency Budgets.

Most teams that take cost seriously combine all four. None of them is exotic; they are all configuration changes.

The hidden costs

Three line items that do not show up as a "token cost" but should be in your mental model:

Latency cost. Slow responses lose users. Latency is bought back by streaming (improves time-to-first-token), batching the prefill, picking faster models, and shortening output. There is a real budget here even if it does not show up as dollars.
Engineering cost. A clever 20% saving that requires a custom retry framework, a custom cache layer, and ongoing maintenance can cost more in engineering time than the dollars it saved. Pick the cheap big wins first.
Failure cost. Models that fail or hallucinate cost money you do not see — wasted user time, support tickets, downstream errors. Evals and guardrails are a real cost lever even though they do not show up on the LLM invoice.

Optimizing the bill in isolation, without these three, ends with a slower or worse product that costs slightly less to run. The point of cost engineering is not to minimize the LLM line item. It is to maximize value per dollar across the whole system.

Comments 2

u/fin_ops · 1 month ago

the hidden cost of retries and bloated system prompts is so real. our bill made zero sense untill we started logging token usage per call
u/bill_b2 · 1 month ago

the retries hiding inside your bill is the silent killer. logging token usage per call is genuinely the first thing anyone should set up

Where Your LLM Bill Actually Comes From

Article summary

The price ladder, sanity-checked

Reading your own bill

The four leverage points

The hidden costs

Frequently asked questions

See also

Where to go next

Comments 2