Fine-tuning is the option people reach for first and need last. It feels like the serious move, the one that actually changes the model instead of fiddling with words. But changing the model is expensive in ways that do not show up until later: you have to build a dataset, you have to retrain when the base model updates, and you have to keep an eval suite alive to know the fine-tune did not quietly regress. The cheaper options solve most problems, so the real skill is knowing when they run out.
Climb the ladder in order
Think of it as a ladder. Each rung is cheaper and faster to undo than the one above it.
- Write a clearer prompt. Name the task, the output format, and the constraints. Most "the model is bad at this" problems are "the prompt is vague" problems wearing a costume.
- Add few-shot examples. Show the model two or three worked input/output pairs. This pins down format and edge-case handling that prose cannot.
- Add retrieval (RAG). If the gap is missing knowledge, put the knowledge in front of the model at call time instead of in the weights. See how-to-give-an-llm-context for the mechanics.
- Fine-tune. Only when the rungs below leave a real, measured gap that is about behavior, not information.
You cannot skip rungs honestly. If you have not measured the gap on an eval set, you do not know which rung you are on.
What fine-tuning is actually good at
Fine-tuning earns its cost in three situations:
- Consistent format, style, or behavior at scale. When you need every output to follow the same structure, tone, or policy, and few-shot examples are not getting you there reliably, baking the behavior into the weights makes it the default instead of something you have to re-request every call.
- Lower latency and cost per call. A fine-tuned model can do with a short prompt what a base model needs a long one for. If you are paying for a giant system prompt on every request at high volume, moving that behavior into the weights cuts tokens, cost, and latency.
- A narrow skill. Teaching the model a specific transformation it does badly out of the box (a particular classification scheme, a house code style, a structured-extraction shape) is squarely in fine-tuning's strike zone.
What fine-tuning is bad at
The biggest mistake is using it to add knowledge. Fine-tuning on a pile of facts does not reliably make the model know those facts; it makes the model fluent at producing text that looks like those facts, which often means hallucinating in the same register with more confidence. Knowledge that changes belongs in a retrievable source, not in frozen weights you would have to retrain to update.
The honest tradeoffs
Before you commit, price in the parts nobody mentions on the sales slide:
- Data cost. A good fine-tune needs a clean, consistent dataset. That is real work; the dataset is the product. See building-a-fine-tuning-dataset.
- Staleness. Your fine-tune is frozen at the data you trained on. When the base model gets a better version, your adaptation does not automatically come along.
- Eval burden. A fine-tune you cannot measure is a fine-tune you cannot trust. You need a held-out eval to prove it helped and to catch regressions later.
If a clearer prompt plus a few examples plus retrieval closes the gap, ship that. Fine-tune when, and only when, you have measured a behavioral gap those cannot close and your volume justifies the ongoing cost.
Comments 0
No comments yet. Be the first to share your thoughts.