Streaming a model response is the highest-leverage UX change in any LLM product. It is also where polished UIs and amateur ones diverge most visibly, because the underlying mechanic — a stream of tokens arriving asynchronously — interacts badly with most front-end assumptions about how the DOM should update.
This article is the working engineer's catalog of what breaks and the patterns that fix it.
The protocol layer
All the major LLM providers expose streaming via Server-Sent Events. On the SDK side it shows up as an async iterator:
for await (const chunk of stream) {
const delta = chunk.choices[0].delta.content;
if (delta) accumulate(delta);
}
You proxy the upstream stream through your own server (you do not put API keys in the browser) and forward it to the client as SSE. The client subscribes via the EventSource API or, in practice, via a higher-level helper like the Vercel AI SDK useChat hook, LangChain.js streaming, or one of the framework-specific options.
The protocol layer is well-trodden territory. The interesting problems start once tokens hit the DOM.
Rendering partial markdown without flicker
Most LLM responses are markdown. Rendering markdown incrementally is surprisingly hard because the parse tree is unstable mid-stream:
- A line that has just shown two backticks is briefly an inline-code start until the third backtick arrives and it becomes a code block.
- An asterisk at the start of a line is ambiguous until the next character — list item, italic, or literal.
- A code fence opens; the language tag arrives; the content arrives line by line. Each step changes the rendered structure.
The naive solution — re-parse on every token — works but flickers badly, especially around code blocks. Three fixes that compose:
Debounce the parse. Re-rendering at 60fps on every token is wasteful — the user cannot read that fast anyway. A 50–100ms debounce drops flicker dramatically while keeping the stream feeling live.
Use a streaming-aware renderer. Render an open code fence as a code block immediately, even without the closing fence. Render an in-progress list item as a list item, not as inline text. The Vercel AI SDK's default markdown components and similar streaming-friendly libraries handle this case; rolling your own with react-markdown requires custom transformations.
Render structural blocks atomically. A finished paragraph should not re-layout when the next paragraph arrives. Wrap each block-level element (paragraphs, code blocks, list items) in a stable React key so the reconciler does not throw them away on every parse.
Auto-scroll that does not fight the user
The default behavior — auto-scroll to bottom on every new token — feels right for the first 200ms and broken thereafter. The user starts reading; the auto-scroll yanks them away from the line they were on; they scroll back; the next token yanks them again. Either ship a working auto-scroll or do not ship it at all.
The pattern that works:
- Track whether the user is currently near the bottom of the scroll container (within ~100px).
- If yes, auto-scroll as new tokens arrive.
- If the user scrolls up, stop auto-scrolling. Show a small "jump to bottom" pill they can click to re-anchor.
- Resume auto-scroll only when the user explicitly returns to the bottom.
This three-state model — anchored, drifted-up, jumped-back — is what every well-tuned chat UI implements. ChatGPT, Claude.ai, Discord (for the same reason), Slack (for the same reason). It is the same pattern whether your messages are token-by-token or burst-arrival.
Partial JSON for structured streaming
When the model is producing structured output (a JSON object, a function call payload), streaming gets harder. The fragment "{"name": "a" is not yet valid JSON; you cannot run JSON.parse on it. But you may want to show the user something as the fields fill in.
Two approaches, depending on use case:
Accumulate and parse at the end. For short structured outputs (a 200-token JSON object), buffer the whole thing and parse once. Show a loading state until complete. Simple and reliable; loses the streaming feel but is fine for short outputs.
Use a partial JSON parser for genuinely long structured outputs (long lists, large objects). Libraries like partial-json (and equivalents in Python, e.g. json-stream-parser) parse incomplete JSON tolerantly, returning the best valid object so far. You can then render fields as they appear. Useful for streaming a long classified list, a multi-section research output, or an agent's tool calls as they form.
The trap to avoid: trying to parse the JSON ad-hoc with regex or string-split. It always breaks on the input that has a quoted string containing a brace.
Cancellation that actually cancels
When a user closes a tab, navigates away, or hits a "stop" button, two things should happen:
- The client stops rendering (free — the SSE connection drops).
- The server stops generating (not free — the upstream LLM call needs an explicit abort).
The pattern that works:
const controller = new AbortController();
const response = await fetch("/api/chat", {
signal: controller.signal,
});
// ... later
button.onclick = () => controller.abort();
On the server, forward the AbortSignal to the LLM SDK:
const stream = await openai.chat.completions.create(
{ model: "gpt-5", stream: true, messages },
{ signal: req.signal }
);
Without this plumbing, the upstream LLM call keeps running after the user has left, and you keep paying for tokens no one will ever see. On a busy product this is a meaningful line item — both because of the wasted spend and because canceled-but-still-running calls consume rate-limit quota.
Common bugs worth knowing
Things that go wrong in real production streaming UIs:
- Markdown code blocks render as inline code while the opening fence is mid-stream. Fix: render an unclosed fence as an open code block.
- Math expressions ($$…$$) flicker badly during streaming. Fix: defer math rendering until after the stream completes, or render delimited blocks atomically.
- Long lists cause layout thrash because every new item re-flows the page. Fix: use CSS containment (
contain: layout) on the message container. - Slow networks make the stream burst-arrive in chunks of 30 tokens at a time, defeating the perceived-latency win. Fix: nothing on the client side — this is a delivery issue. Verify your server is flushing the stream correctly and that no proxy is buffering it.
- Server-rendered frameworks (Next.js, Remix) cache the SSE response by accident. Fix: ensure the response is marked as dynamic and includes
Cache-Control: no-store.
The Vercel AI SDK's useChat hook ships with sensible defaults for most of these. Rolling your own gives more control and more bugs.
When not to stream
Streaming is the default but not the right answer for everything. Skip it when:
- The response goes to another system, not a human reader.
- The output is a short structured object (under ~100 tokens of JSON) that the user does not read incrementally.
- The use case requires the full response for validation before rendering anything (e.g. a tool-call payload that has to be schema-checked).
- You are running inside a constrained environment that cannot hold a long-lived connection.
In those cases, the simpler request-response shape with a single loading state is cleaner. Most user-facing chat and long-form generation, though, should be streaming the moment you have it working.