Does streaming change cost?

No. Streaming changes how the response is delivered — token-by-token via Server-Sent Events instead of as one final payload — but the token count and pricing are identical. What it changes is perceived latency: the user sees the first token in 100–300ms instead of waiting 2–5 seconds for the whole response. For interactive UX this is the single biggest perceived-quality lever; for backend jobs it is irrelevant.

When should I use the Batch API?

When you can afford up to 24 hours of completion time and want a meaningful per-token discount on both input and output (around 50% at the time both OpenAI and Anthropic launched their Batch APIs — check the current rate on their docs). Good fits: nightly content generation pipelines, eval runs across thousands of test cases, bulk data enrichment, document processing backlogs. Bad fits: anything a user is waiting on, anything where the result feeds another system within minutes, anything where order matters strictly.

What is a realistic latency budget for a user-facing LLM call?

A common rule of thumb: under 500ms time-to-first-token feels responsive, under 1 second feels fine, over 2 seconds feels slow. Total time-to-completion depends on response length; 5–10 seconds is acceptable for substantive answers if streaming is on and the first tokens arrive fast. Without streaming, anything over 2–3 seconds feels broken to most users. Latency budgets degrade harshly above those thresholds — engagement metrics start dropping.

Streaming, Batching, and Latency Budgets

Latency for LLM calls is not one number. It is three numbers that move independently:

Time-to-first-token (TTFT) — how long until the model emits its first output token. Dominated by prefill cost, which scales with input length.
Tokens-per-second (TPS) — the streaming output rate once generation has started. Dominated by model size and the provider's serving stack.
Total time-to-completion (TTC) — TTFT plus (output_length / TPS). What the user actually waits for if you are not streaming.

Streaming makes TTFT visible to the user; batching trades TTC for a per-token discount; right-sizing the model changes both TPS and TTFT. The right combination depends on whether anyone is watching.

Streaming: cheaper feel, same price

Every major API supports streaming via Server-Sent Events. You set stream=true, you iterate over partial deltas, you display them as they arrive. The cost is exactly the same — same tokens, same rate.

What streaming buys you is TTFT visibility. A 4-second total response feels slow when delivered in one chunk and feels fine when the first words appear in 200ms and the rest follow. For any chat-style UX, streaming is non-negotiable.

What streaming does not buy you:

Total time savings. If TTC is 4 seconds, the last token still arrives at second 4 regardless of streaming.
Cost savings. Token count and rate are unchanged.
Reliability. Streaming connections can drop mid-response; you need reconnection logic. The structured-output story is messier with streams because you have to assemble a valid JSON object from partial chunks.

Practical: turn streaming on for any user-facing interactive call. Leave it off for backend pipelines where you want one complete payload and the simpler error model.

Batching: half price, slow

OpenAI's Batch API and Anthropic's Message Batches API work the same way: you submit a file of requests, the provider promises completion within 24 hours, and you pay a discounted per-token rate on both input and output. Both launched at roughly 50% of standard pricing; the current rate is on each provider's pricing page and is subject to change. In practice batches typically complete in well under the SLA, often within an hour for smaller batches, but you cannot count on it.

The discount is unconditional — no minimum tokens, no caching tricks — and it composes with prompt caching (you still get caching within a batch). For workloads that can wait, it is one of the largest cost levers available after model selection.

Good fits:

Nightly content pipelines. Newsletter summaries, daily reports, scheduled enrichment runs.
Eval runs. You are scoring 5,000 test cases against a new prompt; you do not care if it finishes in 20 minutes or 4 hours.
Bulk processing backlogs. Re-classifying a year of historical messages, enriching a CRM, re-embedding a document corpus when the model changes.

Bad fits:

Anything user-facing.
Anything where downstream systems depend on near-realtime results.
Workloads where you need responses in a specific order, since batches do not guarantee ordering.

A common pattern: have two code paths for the same logic — a realtime path for live traffic, a batch path that catches up overnight on the queue of non-urgent work. The batch path is where 30–50% of LLM spend ends up at scale.

The latency budget, decomposed

For interactive UX, the budget breaks into three layers:

Layer	Typical target	Lever
TTFT	< 500ms	Smaller model, prompt caching (cuts prefill), shorter input
TPS	> 30–50 tok/s	Smaller model, provider's serving infrastructure
TTC	< 5–10s for chat, < 2s for short answers	All of the above + capping output length

A 4,000-token prompt to GPT-4o without caching typically hits TTFT around 1–2 seconds. The same prompt with a 3,500-token cached prefix typically drops TTFT under 500ms — caching is a TTFT lever before it is a cost lever, and on first-token-sensitive UX that matters more than the dollars.

TPS varies more by provider than people expect. The same model class on different providers can show 30 tok/s vs 60 tok/s for output. If TPS is your bottleneck, benchmark a few providers on your specific prompt shape before assuming the model is the issue.

For non-interactive workloads (background jobs, agents that take many steps), TTC matters more than TTFT. The right move is to drop streaming, drop the realtime API path, and consider whether the workload fits the Batch API.

When latency is itself the cost

For interactive products, a slow response is a real cost — engagement drops, abandonment rises, support tickets accumulate. The point of latency engineering is not raw speed; it is staying inside the budget that keeps user behavior healthy.

The empirical thresholds, roughly:

TTFT under 500ms feels responsive. Users barely notice the model is thinking.
TTFT 500ms–1s feels fine for a thoughtful answer; the loading state covers it.
TTFT 1–3s feels slow; users start looking away.
TTFT over 3s feels broken; users assume something failed.

For TTC the budget is wider because the response itself is the content — a 6-second streaming response that produced a substantive answer feels different from a 6-second blank screen. The user is reading the first part while the rest streams in.

The combination that loses on both axes: a large model, a long uncached prefix, no streaming. The combination that wins: cached prefix on a right-sized model, streaming on, capped output length. The cost saving and the latency win are often the same change.

Streaming, Batching, and Latency Budgets

Article summary

Streaming: cheaper feel, same price

Batching: half price, slow

The latency budget, decomposed

When latency is itself the cost

Frequently asked questions

See also

Where to go next

Comments 1