Does shortening the prompt always hurt accuracy?

No — and surprisingly often the opposite. Long prompts increase noise and dilute the relevant instruction. Tightening a bloated system prompt usually improves both cost and accuracy, especially on smaller models. The mistake to avoid is removing actually-load-bearing context (constraints, examples, format specs); the right move is removing decorative throat-clearing, unused tool descriptions, redundant instructions, and stale conversation history.

Is LLMLingua worth using?

For large stable contexts (5,000+ tokens of documents, code, or transcript), LLMLingua and similar compressors can cut 30–60% of input tokens with modest accuracy loss. The catch is latency — the compression itself takes a model call or a local compressor pass — and complexity, since you now have a moving piece in your pipeline. For most apps, simpler measures (retrieval, caching, structured output) get you most of the saving without the extra moving part. Bring in research compression when you have already done the simple things.

How do I limit output length?

Three levers, used together. First, ask for it directly in the prompt — "respond in at most 3 sentences" or "return only the JSON object, no commentary." Second, set max_tokens to a reasonable cap so the model cannot run away even if the prompt fails to constrain it. Third, ask for structured output (JSON, a short list, a specific schema) rather than open-ended prose; structure naturally bounds length. Output is typically 3–5x more expensive per token than input, so capping it is high-leverage.

Cutting Tokens Without Cutting Accuracy

Most teams have a 30–50% token reduction available without doing anything clever. The wins are unglamorous: tighten the system prompt, retrieve instead of dumping, cap output, ask for structure, and roll up conversation history. Research-grade prompt compression exists and works, but the simple measures get you most of the way and do not require a moving part in your pipeline.

This is the order I would attack a bloated workload in.

1. Tighten the system prompt

System prompts have a tendency to grow. A new edge case shows up; somebody adds a paragraph. Two months later the system prompt is 4,000 tokens of layered instructions, half of which contradict each other and a third of which describe tools the model no longer has access to.

The exercise: print the current system prompt, and for every sentence ask "what would change if I deleted this line?" If the answer is "nothing observable," delete it. If the answer is "the model would do X wrong on the Y kind of input," keep it and write a test that catches that case.

What usually goes:

Throat-clearing introductions ("You are a helpful AI assistant...").
Personality directives that do not affect outputs measurably.
Instructions for tools the model no longer has.
Redundant constraints already enforced by a schema or downstream validator.
Multi-paragraph examples where a single tight example would do.

What stays:

The role and the specific task.
Constraints that have visible failure modes if removed.
One or two tight few-shot examples of the correct output shape.
Output format specification, if the consumer is code.

Done well, this is often a 30–50% reduction on the system prompt alone, with no accuracy regression and frequently an improvement — smaller models in particular handle tight prompts better than sprawling ones.

2. Retrieve instead of dumping

The other place tokens accumulate is the user-supplied context. Tipping in a whole document, the whole conversation, all retrieved chunks, every tool result.

The fix is to retrieve more selectively. For a 50-page document, do not paste the whole thing — extract or retrieve the 1–2 sections relevant to the current question. For a long conversation, retrieve the last few turns plus a rolling summary, not the verbatim history. For RAG, return the top-k chunks that actually match, not the top-50 to "be safe."

Selective retrieval improves accuracy for the same reason tight system prompts do: less noise, better focus. The accuracy curve in long-context evals is famously non-monotonic — more context can hurt past a threshold (see Context Rot, Explained). Cutting from 20,000 tokens to 5,000 tokens of retrieved-and-relevant context often improves answer quality and is also 75% cheaper.

3. Cap and shape the output

Output is typically 3–5x more expensive per token than input. A long, rambling answer is a more expensive answer.

Three levers, used in combination:

Ask for it in the prompt. "Respond in at most 3 sentences." "Return only the JSON object, no preamble." "Bullet points, maximum 5." Models respect these instructions well, especially if reinforced by an example.
Set max_tokens. A reasonable upper bound prevents the worst-case runaway response. Set it to the realistic maximum you actually want, not the model's default ceiling.
Use structured output. A JSON object with a defined schema is naturally shorter than the equivalent prose. It also makes the response easier to validate downstream, which is its own kind of saving (less reprompting, less reasoning over noisy output).

For chat-style UX the lever is mostly the prompt and the max_tokens. For agent-style pipelines where the output is consumed by code, structured output is the single biggest win — both for cost and for downstream reliability.

4. Roll up the conversation history

In a long-running chat, the cost of every turn includes the cost of replaying every previous turn. By turn 50, the input is dominated by ancient context the model rarely needs.

The standard fix is a rolling summary: every N turns, ask the model to summarize the conversation so far into a short paragraph. Replace the verbatim history older than the last few turns with the summary. The next turn sees: summary + last K turns + new user message — usually under 1,000 tokens regardless of how long the conversation has actually run.

The trade-off is some fidelity loss on details from earlier turns. Two mitigations:

Keep the last K turns verbatim (3–5 is usually enough to preserve immediate coherence).
Cache the summary aggressively — it changes only every N turns, so the long stable prefix benefits from prompt caching.

For agentic workflows where the "conversation" is actually a chain of tool calls and results, the same idea applies: summarize old tool results once they are no longer relevant to the current step.

5. (Only after the above) Research-grade compression

Microsoft Research's LLMLingua and LongLLMLingua compress prompts by dropping tokens that contribute little information to the answer, using a small language model to score importance. They typically cut 30–60% of input tokens with modest accuracy loss on long contexts.

These work, but they add a moving piece:

Compression itself takes a model call or a local compressor pass — adds latency and a separate model dependency.
Compression artifacts can confuse downstream consumers if they need to quote or cite from the input.
Below a few thousand input tokens, the simpler measures above dominate; compression has overhead but little to compress.

The honest order of operations: do steps 1–4 first, measure where you actually still have bloat, and only reach for research compression on the workloads where the input is genuinely large, stable enough to compress once, and not amenable to retrieval. RAG pipelines with very long documents, code-search agents over big repositories, and historical-transcript analysis are the sweet spot.

What not to do

The two compression strategies that look smart and are usually traps:

Aggressive abbreviation in prompts. Replacing words with letters, deleting articles, switching to bullet-point shorthand. Saves a few percent, makes the prompt fragile, and confuses smaller models that have less ability to parse degraded English. The token saving is not worth the accuracy loss.
Removing examples to save tokens. Few-shot examples are some of the most useful tokens in a prompt — they teach the model the output shape and pin down edge cases. Cutting examples is almost always a false economy. If examples are the largest part of the prompt, cache them; do not delete them.

The pattern across all five steps: cut the noise, keep the signal. Token reduction that hurts accuracy is not a saving, it is a quality regression with a smaller invoice.

Comments 2

u/billtran · 1 month ago

we cut our bill like 40% just by trimming few-shot examples we didnt actually need. this matches my experience almost exactly
u/tokenmiser · 1 month ago

trimming the redundant few-shots is the easy 30%. the scary part is figuring out which examples were quietly load bearing all along

Cutting Tokens Without Cutting Accuracy

Article summary