Every LLM-powered product is solving the same underlying UX problem: the model in the middle is slow, non-deterministic, and occasionally wrong. A good UI hides none of those facts; it makes each one feel like a feature instead of a bug.
The version of this article we wish we had read first explains why those three properties are the actual constraints, what concrete moves they imply, and how to avoid the most common trap — shipping a ChatGPT-shaped UI for a product that is not a chat.
The three constraints
Latency. Frontier models take 2 to 30 seconds to produce a full response. Mid-tier models are faster but never instant. Reasoning models can take a minute or more before the visible answer arrives. Designing as if the model is fast produces UIs that look broken every time they are used.
Variance. The same prompt produces different outputs across runs. Sometimes meaningfully different — the model phrases something differently, picks a different example, returns a slightly different structure. Designing as if the model is deterministic produces UIs that break when the user runs the same thing twice.
Fallibility. The model sometimes makes things up, sometimes refuses, sometimes gets it half-right, sometimes times out, sometimes hits a rate limit, sometimes returns invalid JSON. Designing for the happy path only produces UIs that abandon the user the moment something goes wrong.
The five moves below are the ones that show up across almost every LLM product worth studying.
Move 1: stream the output
Replacing a loading spinner with a token-by-token stream is the single highest-ROI change in LLM UX. The token count is unchanged, the bill is unchanged, the model is unchanged — only the delivery is different. But a 4-second response that arrives in one chunk feels slow; the same 4-second response that starts showing words in 300ms feels responsive.
Streaming is non-negotiable for any chat-style or long-form-generation UX. It is also genuinely hard to get right — see Streaming UI Patterns That Don't Break for the scroll behavior, partial-markdown rendering, and cancellation patterns that actually work.
The exception: backend pipelines and short structured outputs. If the response is going into another system, or if it is a 30-token JSON object the user does not read incrementally, drop streaming for the simpler request/response shape.
Move 2: cite the sources
For any answer grounded in external information — RAG over your documents, web search, retrieved code, tool results — citations are part of the answer, not a footnote. The reader needs to be able to verify, follow up, and gauge confidence at a glance.
Three placements that work in practice:
- Inline footnote markers (Perplexity-style superscripts that link to a source list below or beside the answer). Good for dense factual answers with many sources.
- Hover-cards that show the source title and a snippet on hover. Good when sources are mostly background and the reader needs them on demand.
- Sticky source panel alongside the answer, showing the retrieved chunks the model actually used. Good for high-trust use cases (research tools, legal, medical).
The deeper problem is making sure the citations are real — that the source the model claims it used actually says what the model implied. See Citations, Inline and Sticky for how to wire model-emitted citations to retrieval results and what to do when they do not match.
Move 3: make retry and edit cheap
The model varies. Sometimes the first response is exactly what the user wanted; sometimes it is not. The friction between "got something not quite right" and "got something better" is one of the strongest predictors of whether a user keeps using your product.
The mechanics worth shipping:
- Regenerate — replay the same input. One click. Should preserve everything before the regeneration point in the conversation.
- Edit — change the user's last message and resubmit. Implies discarding everything after that turn (since context has changed).
- Branch — fork from a turn into a parallel thread. Heavier UX but useful for coding agents, research tools, and any workflow where the user wants to explore alternatives without losing the original.
Each has implications for context-window management, cost (cache invalidation), and the user's mental model of what they are looking at. See Regenerate, Undo, Branch — Conversation Mechanics for the design choices that work and the ones that confuse users.
Move 4: surface reasoning state honestly
Long-running model calls — reasoning models, agent loops, multi-tool workflows — need a UI that tells the user something is happening, what specifically, and roughly when it will end.
The patterns that work:
- Status text that updates as the model moves through stages: "Searching…" → "Reading 4 results…" → "Drafting answer…" Anthropic's web search UI and Perplexity's pipeline both follow this shape.
- Collapsible reasoning panel for reasoning models. The full chain of thought is interesting to some users and noise to others; let them choose.
- Tool-call display for agent loops. The user sees what the agent is about to do before it does it (or just after), with the option to intervene. Cursor's agent mode and Claude's Computer Use UI both lean into this.
What does not work: a generic spinning indicator with no information. After about 3 seconds of "thinking…" with no further signal, users start wondering if it crashed.
Move 5: design the error states like you mean it
Refusal, rate limit, content filter, timeout, network drop, partial response, "model is overloaded." Each is a real state the user will hit. Each deserves its own copy, its own recovery path, and at minimum a sense that the product knows what happened.
Bad: a generic "Something went wrong" with a refresh button. Better: state-specific messaging that names the cause and offers the next step. "The model is at capacity right now — try again in a minute, or switch to the faster model in settings." "Your message hit our content filter — here's what triggered it and how to rephrase." "This conversation is too long — start a new thread or summarize the previous turns."
See Error and Refusal States Without Anger for the specific states worth designing for and what good copy looks like.
The biggest trap: defaulting to chat
The single most common mistake in the 2024–2026 cohort of LLM features is wrapping every capability in a chat box. Chat is the right shape for open-ended exploration; it is the wrong shape for almost everything else.
Tasks with a fixed shape (summarize this, classify that, rewrite the highlighted text) work better as a single transformation: the input visible, the output visible alongside it, a small action bar to regenerate or refine. No conversation log, no scroll, no "what should I ask next" cognitive load.
Tasks embedded in another product (an issue tracker, an IDE, a design tool) work better as inline actions or context-aware commands. Cursor's inline-edit (Cmd-K), Linear's "ask AI about this issue," Figma's "generate variant" — none of these are chats, and that is the point. The capability lives where the user already is.
The chat is a default. Defaults matter; they are also the most common reason a product feels generic.
What to study
Three products worth disassembling because each gets a different part right:
- Perplexity for citation rendering. The inline footnote model with a sources panel below is well-tuned and copies cleanly into other products.
- Cursor for diff and agentic UI. Editing code via AI without losing trust requires very careful diff display and approval mechanics; Cursor's iterations are instructive.
- Claude.ai with artifacts for the side-panel pattern. When the model is producing a substantive deliverable (a document, code, a diagram), pulling it out of the conversation log into its own pane works better than letting it scroll past.
These five moves cover the majority of LLM UX work. The rest of this hub goes deep on each one.