Latency for LLM calls is not one number. It is three numbers that move independently:
- Time-to-first-token (TTFT) — how long until the model emits its first output token. Dominated by prefill cost, which scales with input length.
- Tokens-per-second (TPS) — the streaming output rate once generation has started. Dominated by model size and the provider's serving stack.
- Total time-to-completion (TTC) — TTFT plus (output_length / TPS). What the user actually waits for if you are not streaming.
Streaming makes TTFT visible to the user; batching trades TTC for a per-token discount; right-sizing the model changes both TPS and TTFT. The right combination depends on whether anyone is watching.
Streaming: cheaper feel, same price
Every major API supports streaming via Server-Sent Events. You set stream=true, you iterate over partial deltas, you display them as they arrive. The cost is exactly the same — same tokens, same rate.
What streaming buys you is TTFT visibility. A 4-second total response feels slow when delivered in one chunk and feels fine when the first words appear in 200ms and the rest follow. For any chat-style UX, streaming is non-negotiable.
What streaming does not buy you:
- Total time savings. If TTC is 4 seconds, the last token still arrives at second 4 regardless of streaming.
- Cost savings. Token count and rate are unchanged.
- Reliability. Streaming connections can drop mid-response; you need reconnection logic. The structured-output story is messier with streams because you have to assemble a valid JSON object from partial chunks.
Practical: turn streaming on for any user-facing interactive call. Leave it off for backend pipelines where you want one complete payload and the simpler error model.
Batching: half price, slow
OpenAI's Batch API and Anthropic's Message Batches API work the same way: you submit a file of requests, the provider promises completion within 24 hours, and you pay a discounted per-token rate on both input and output. Both launched at roughly 50% of standard pricing; the current rate is on each provider's pricing page and is subject to change. In practice batches typically complete in well under the SLA, often within an hour for smaller batches, but you cannot count on it.
The discount is unconditional — no minimum tokens, no caching tricks — and it composes with prompt caching (you still get caching within a batch). For workloads that can wait, it is one of the largest cost levers available after model selection.
Good fits:
- Nightly content pipelines. Newsletter summaries, daily reports, scheduled enrichment runs.
- Eval runs. You are scoring 5,000 test cases against a new prompt; you do not care if it finishes in 20 minutes or 4 hours.
- Bulk processing backlogs. Re-classifying a year of historical messages, enriching a CRM, re-embedding a document corpus when the model changes.
Bad fits:
- Anything user-facing.
- Anything where downstream systems depend on near-realtime results.
- Workloads where you need responses in a specific order, since batches do not guarantee ordering.
A common pattern: have two code paths for the same logic — a realtime path for live traffic, a batch path that catches up overnight on the queue of non-urgent work. The batch path is where 30–50% of LLM spend ends up at scale.
The latency budget, decomposed
For interactive UX, the budget breaks into three layers:
| Layer | Typical target | Lever |
|---|---|---|
| TTFT | < 500ms | Smaller model, prompt caching (cuts prefill), shorter input |
| TPS | > 30–50 tok/s | Smaller model, provider's serving infrastructure |
| TTC | < 5–10s for chat, < 2s for short answers | All of the above + capping output length |
A 4,000-token prompt to GPT-4o without caching typically hits TTFT around 1–2 seconds. The same prompt with a 3,500-token cached prefix typically drops TTFT under 500ms — caching is a TTFT lever before it is a cost lever, and on first-token-sensitive UX that matters more than the dollars.
TPS varies more by provider than people expect. The same model class on different providers can show 30 tok/s vs 60 tok/s for output. If TPS is your bottleneck, benchmark a few providers on your specific prompt shape before assuming the model is the issue.
For non-interactive workloads (background jobs, agents that take many steps), TTC matters more than TTFT. The right move is to drop streaming, drop the realtime API path, and consider whether the workload fits the Batch API.
When latency is itself the cost
For interactive products, a slow response is a real cost — engagement drops, abandonment rises, support tickets accumulate. The point of latency engineering is not raw speed; it is staying inside the budget that keeps user behavior healthy.
The empirical thresholds, roughly:
- TTFT under 500ms feels responsive. Users barely notice the model is thinking.
- TTFT 500ms–1s feels fine for a thoughtful answer; the loading state covers it.
- TTFT 1–3s feels slow; users start looking away.
- TTFT over 3s feels broken; users assume something failed.
For TTC the budget is wider because the response itself is the content — a 6-second streaming response that produced a substantive answer feels different from a 6-second blank screen. The user is reading the first part while the rest streams in.
The combination that loses on both axes: a large model, a long uncached prefix, no streaming. The combination that wins: cached prefix on a right-sized model, streaming on, capped output length. The cost saving and the latency win are often the same change.