Fine-Tuning vs. Prompting — When to Actually Change the Model

Article summary

Exhaust the cheap options first: a clearer prompt, then few-shot examples, then retrieval (RAG) for knowledge. Fine-tuning earns its cost when you need a consistent format, style, or behavior at scale, lower latency and cost per call by shrinking prompts, or a narrow skill baked in. It is the wrong tool for adding knowledge; that is what RAG is for. Fine-tuning also adds a data-collection cost, a staleness problem, and an ongoing eval burden, so reach for it last, not first.

Fine-tuning is the option people reach for first and need last. It feels like the serious move, the one that actually changes the model instead of fiddling with words. But changing the model is expensive in ways that do not show up until later: you have to build a dataset, you have to retrain when the base model updates, and you have to keep an eval suite alive to know the fine-tune did not quietly regress. The cheaper options solve most problems, so the real skill is knowing when they run out.

Climb the ladder in order

Think of it as a ladder. Each rung is cheaper and faster to undo than the one above it.

Write a clearer prompt. Name the task, the output format, and the constraints. Most "the model is bad at this" problems are "the prompt is vague" problems wearing a costume.
Add few-shot examples. Show the model two or three worked input/output pairs. This pins down format and edge-case handling that prose cannot.
Add retrieval (RAG). If the gap is missing knowledge, put the knowledge in front of the model at call time instead of in the weights. See how-to-give-an-llm-context for the mechanics.
Fine-tune. Only when the rungs below leave a real, measured gap that is about behavior, not information.

You cannot skip rungs honestly. If you have not measured the gap on an eval set, you do not know which rung you are on.

What fine-tuning is actually good at

Fine-tuning earns its cost in three situations:

Consistent format, style, or behavior at scale. When you need every output to follow the same structure, tone, or policy, and few-shot examples are not getting you there reliably, baking the behavior into the weights makes it the default instead of something you have to re-request every call.
Lower latency and cost per call. A fine-tuned model can do with a short prompt what a base model needs a long one for. If you are paying for a giant system prompt on every request at high volume, moving that behavior into the weights cuts tokens, cost, and latency.
A narrow skill. Teaching the model a specific transformation it does badly out of the box (a particular classification scheme, a house code style, a structured-extraction shape) is squarely in fine-tuning's strike zone.

What fine-tuning is bad at

The biggest mistake is using it to add knowledge. Fine-tuning on a pile of facts does not reliably make the model know those facts; it makes the model fluent at producing text that looks like those facts, which often means hallucinating in the same register with more confidence. Knowledge that changes belongs in a retrievable source, not in frozen weights you would have to retrain to update.

The honest tradeoffs

Before you commit, price in the parts nobody mentions on the sales slide:

Data cost. A good fine-tune needs a clean, consistent dataset. That is real work; the dataset is the product. See building-a-fine-tuning-dataset.
Staleness. Your fine-tune is frozen at the data you trained on. When the base model gets a better version, your adaptation does not automatically come along.
Eval burden. A fine-tune you cannot measure is a fine-tune you cannot trust. You need a held-out eval to prove it helped and to catch regressions later.

If a clearer prompt plus a few examples plus retrieval closes the gap, ship that. Fine-tune when, and only when, you have measured a behavioral gap those cannot close and your volume justifies the ongoing cost.

Frequently asked questions

Should I fine-tune to teach the model new facts?

Almost never. Fine-tuning is good at teaching behavior, format, and style; it is unreliable at teaching facts, and worse, it tends to teach the model to state facts confidently whether or not it actually learned them. If your problem is "the model does not know our internal docs," the answer is retrieval (RAG), where the facts live in a source you can update without retraining. Fine-tune to change how the model acts, retrieve to change what it knows.

How do I know I have exhausted prompting?

You have exhausted prompting when a clear instruction, a specified output format, and a handful of few-shot examples still leave a measurable gap on a real eval set, and that gap is about consistency or behavior rather than missing information. If you have not built an eval set yet, you have not exhausted prompting. Most "we need to fine-tune" conclusions are reached on a couple of bad outputs, which is below the level where you can tell a prompt problem from a model problem.

Is fine-tuning cheaper than a long prompt at scale?

It can be. If every request carries a 2,000-token system prompt with twenty few-shot examples, you pay for those tokens on every single call. Baking that behavior into the weights lets you ship a short prompt instead, which cuts per-call cost and latency. Whether that saving beats the cost of building the dataset and retraining depends on your volume; it pays off at high call counts and rarely pays off at low ones.