How to Build a Fine-Tuning Dataset

Article summary

In fine-tuning, the dataset is the product. Favor quality over quantity: a few hundred consistent, correct, well-formatted examples usually beat thousands of noisy ones, because the model copies whatever pattern dominates the data, including the mistakes. Cover the real distribution of inputs plus the edge cases that matter, keep formatting rigidly consistent, and hold out an eval set the model never sees. The most common failure is data that teaches the model the right format but the wrong habit, namely answering confidently when it should not.

In fine-tuning, the model architecture is fixed, the training recipe is mostly fixed, and the one thing you fully control is the data. That makes the dataset the product. A mediocre dataset produces a mediocre fine-tune no matter how good the base model is, because fine-tuning is imitation: the model learns to reproduce the patterns in your examples, faithfully, including the patterns you did not mean to teach.

Quality beats quantity, and it is not close

The instinct is to gather as much data as possible. The better instinct is to gather as little bad data as possible. A few hundred examples that are correct, consistent, and representative will usually beat several thousand that are noisy, because the model latches onto whatever pattern dominates the set. Noise does not average out into nothing; it teaches the model that noise is acceptable output.

The practical implication: read your data. Every example you keep should be one you would be happy to ship as a real output. If you would not, fix it or cut it. A dataset you have actually inspected line by line is worth more than a scraped pile you never opened.

Cover the real distribution, then the edges

A good dataset looks like the traffic the model will actually see, plus the hard cases that matter.

Mirror the distribution. If most real inputs are short and a fraction are long or multilingual, your examples should reflect those proportions. Train only on tidy inputs and the model will be brittle on the messy ones that show up in production.
Include the edge cases. Empty inputs, malformed inputs, inputs where the correct answer is "I cannot do that" or "the source does not say." These are where fine-tunes quietly fail, and the only way to get them right is to put them in the data.

Keep the format rigidly consistent

The model is extremely sensitive to format, because format is one of the easiest patterns for it to pick up. Decide once how an example looks, the field names, the casing, how missing values appear, how the assistant turn starts, and then make every single example match. One example that uses null and another that uses "none" for the same idea teaches the model that both are fine, and now your downstream parser has to handle both. Consistency in the data is consistency in the output.

Hold out an eval set, before you train

Carve off a portion of your examples and never train on them. This held-out set is how you find out whether the fine-tune actually helped, and later whether a new version regressed. Without it you are flying blind, judging the model on the same examples it memorized. Build the eval set the same way you build any prompt eval: freeze the inputs, define a clear rubric, and score honestly. The methodology in prompt-eval-methodology applies directly.

The failure that looks like success

The most damaging mistake is subtle. Suppose every example in your data gives a complete, confident answer, including the questions where the honest answer is "not enough information." The fine-tune will learn your format perfectly and also learn a habit you did not intend: always answer, never hesitate. On real inputs where the right move is to decline or flag uncertainty, the model now confidently makes something up, because that is the only behavior it ever saw.

The fix is to teach the behavior you actually want, uncertainty included. Put refusals, "unknown" answers, and partial answers into the data in the same proportion you want to see them in production. A dataset that only ever shows confident success teaches confident failure. The dataset is the product, and the product includes knowing when to say no.

Frequently asked questions

How many examples do I need?

Fewer than people expect, if they are good. For a narrow task with a consistent format, a few hundred excellent examples often outperform thousands of inconsistent ones, because the model learns the dominant pattern in the data and noise dilutes that pattern. Start small and clean, measure on a held-out set, and only add more data once you have evidence that the current set is the limiting factor rather than your prompt or your base model.

Why does a noisy dataset hurt so much?

Because fine-tuning is imitation. The model does not judge your examples; it copies their patterns. If a tenth of your outputs are sloppy, inconsistent in format, or subtly wrong, you are teaching the model that sloppy, inconsistent, and subtly wrong are acceptable outputs. A small set you have actually read and corrected line by line is worth more than a large set you scraped and never checked.

What is the difference between teaching format and teaching hallucination?

If every example in your data answers the question fully, even the questions where the honest answer is "the source does not say," you are not just teaching format. You are teaching the model that the correct behavior is always to produce a confident answer. It learns the shape of your outputs and the habit of never admitting uncertainty. Include examples where the right answer is a refusal or an "unknown," or the fine-tune will train confident hallucination into the model.