Subject
Designing UIs around LLMs that feel good despite the underlying model being slow, non-deterministic, and occasionally wrong. Streaming patterns, citation rendering, error and refusal states, and the conversation mechanics — regenerate, edit, branch — that separate a polished LLM product from a wrapper around an API.
Every LLM-powered product is solving the same underlying UX problem: the model in the middle is slow, non-deterministic, and occasionally wrong. A good UI hides none of those facts; it makes each one feel like a feature instead of a bug.
The version of this article we wish we had read first explains why those three properties are the actual constraints, what concrete moves they imply, and how to avoid the most common trap — shipping a ChatGPT-shaped UI for a product that is not a chat.
Latency. Frontier models take 2 to 30 seconds to produce a full response. Mid-tier models are faster but never instant. Reasoning models can take a minute or more before the visible answer arrives. Designing as if the model is fast produces UIs that look broken every time they are used.
Variance. The same prompt produces different outputs across runs. Sometimes meaningfully different — the model phrases something differently, picks a different example, returns a slightly different structure. Designing as if the model is deterministic produces UIs that break when the user runs the same thing twice.
Fallibility. The model sometimes makes things up, sometimes refuses, sometimes gets it half-right, sometimes times out, sometimes hits a rate limit, sometimes returns invalid JSON. Designing for the happy path only produces UIs that abandon the user the moment something goes wrong.
The five moves below are the ones that show up across almost every LLM product worth studying.
Replacing a loading spinner with a token-by-token stream is the single highest-ROI change in LLM UX. The token count is unchanged, the bill is unchanged, the model is unchanged — only the delivery is different. But a 4-second response that arrives in one chunk feels slow; the same 4-second response that starts showing words in 300ms feels responsive.
Streaming is non-negotiable for any chat-style or long-form-generation UX. It is also genuinely hard to get right — see Streaming UI Patterns That Don't Break for the scroll behavior, partial-markdown rendering, and cancellation patterns that actually work.
The exception: backend pipelines and short structured outputs. If the response is going into another system, or if it is a 30-token JSON object the user does not read incrementally, drop streaming for the simpler request/response shape.
For any answer grounded in external information — RAG over your documents, web search, retrieved code, tool results — citations are part of the answer, not a footnote. The reader needs to be able to verify, follow up, and gauge confidence at a glance.
Three placements that work in practice:
The deeper problem is making sure the citations are real — that the source the model claims it used actually says what the model implied. See Citations, Inline and Sticky for how to wire model-emitted citations to retrieval results and what to do when they do not match.
The model varies. Sometimes the first response is exactly what the user wanted; sometimes it is not. The friction between "got something not quite right" and "got something better" is one of the strongest predictors of whether a user keeps using your product.
The mechanics worth shipping:
Each has implications for context-window management, cost (cache invalidation), and the user's mental model of what they are looking at. See Regenerate, Undo, Branch — Conversation Mechanics for the design choices that work and the ones that confuse users.
Long-running model calls — reasoning models, agent loops, multi-tool workflows — need a UI that tells the user something is happening, what specifically, and roughly when it will end.
The patterns that work:
What does not work: a generic spinning indicator with no information. After about 3 seconds of "thinking…" with no further signal, users start wondering if it crashed.
Refusal, rate limit, content filter, timeout, network drop, partial response, "model is overloaded." Each is a real state the user will hit. Each deserves its own copy, its own recovery path, and at minimum a sense that the product knows what happened.
Bad: a generic "Something went wrong" with a refresh button. Better: state-specific messaging that names the cause and offers the next step. "The model is at capacity right now — try again in a minute, or switch to the faster model in settings." "Your message hit our content filter — here's what triggered it and how to rephrase." "This conversation is too long — start a new thread or summarize the previous turns."
See Error and Refusal States Without Anger for the specific states worth designing for and what good copy looks like.
The single most common mistake in the 2024–2026 cohort of LLM features is wrapping every capability in a chat box. Chat is the right shape for open-ended exploration; it is the wrong shape for almost everything else.
Tasks with a fixed shape (summarize this, classify that, rewrite the highlighted text) work better as a single transformation: the input visible, the output visible alongside it, a small action bar to regenerate or refine. No conversation log, no scroll, no "what should I ask next" cognitive load.
Tasks embedded in another product (an issue tracker, an IDE, a design tool) work better as inline actions or context-aware commands. Cursor's inline-edit (Cmd-K), Linear's "ask AI about this issue," Figma's "generate variant" — none of these are chats, and that is the point. The capability lives where the user already is.
The chat is a default. Defaults matter; they are also the most common reason a product feels generic.
Three products worth disassembling because each gets a different part right:
These five moves cover the majority of LLM UX work. The rest of this hub goes deep on each one.
A short editorial reading list. Pick whichever fits how you like to learn.