In fine-tuning, the model architecture is fixed, the training recipe is mostly fixed, and the one thing you fully control is the data. That makes the dataset the product. A mediocre dataset produces a mediocre fine-tune no matter how good the base model is, because fine-tuning is imitation: the model learns to reproduce the patterns in your examples, faithfully, including the patterns you did not mean to teach.

Quality beats quantity, and it is not close

The instinct is to gather as much data as possible. The better instinct is to gather as little bad data as possible. A few hundred examples that are correct, consistent, and representative will usually beat several thousand that are noisy, because the model latches onto whatever pattern dominates the set. Noise does not average out into nothing; it teaches the model that noise is acceptable output.

The practical implication: read your data. Every example you keep should be one you would be happy to ship as a real output. If you would not, fix it or cut it. A dataset you have actually inspected line by line is worth more than a scraped pile you never opened.

Cover the real distribution, then the edges

A good dataset looks like the traffic the model will actually see, plus the hard cases that matter.

  • Mirror the distribution. If most real inputs are short and a fraction are long or multilingual, your examples should reflect those proportions. Train only on tidy inputs and the model will be brittle on the messy ones that show up in production.
  • Include the edge cases. Empty inputs, malformed inputs, inputs where the correct answer is "I cannot do that" or "the source does not say." These are where fine-tunes quietly fail, and the only way to get them right is to put them in the data.

Keep the format rigidly consistent

The model is extremely sensitive to format, because format is one of the easiest patterns for it to pick up. Decide once how an example looks, the field names, the casing, how missing values appear, how the assistant turn starts, and then make every single example match. One example that uses null and another that uses "none" for the same idea teaches the model that both are fine, and now your downstream parser has to handle both. Consistency in the data is consistency in the output.

Hold out an eval set, before you train

Carve off a portion of your examples and never train on them. This held-out set is how you find out whether the fine-tune actually helped, and later whether a new version regressed. Without it you are flying blind, judging the model on the same examples it memorized. Build the eval set the same way you build any prompt eval: freeze the inputs, define a clear rubric, and score honestly. The methodology in prompt-eval-methodology applies directly.

The failure that looks like success

The most damaging mistake is subtle. Suppose every example in your data gives a complete, confident answer, including the questions where the honest answer is "not enough information." The fine-tune will learn your format perfectly and also learn a habit you did not intend: always answer, never hesitate. On real inputs where the right move is to decline or flag uncertainty, the model now confidently makes something up, because that is the only behavior it ever saw.

The fix is to teach the behavior you actually want, uncertainty included. Put refusals, "unknown" answers, and partial answers into the data in the same proportion you want to see them in production. A dataset that only ever shows confident success teaches confident failure. The dataset is the product, and the product includes knowing when to say no.