Full fine-tuning updates every weight in a model. For a large model that means holding billions of parameters plus their gradients and optimizer state in memory, which is why it lives on clusters and not on the machine under your desk. LoRA and QLoRA are the techniques that brought fine-tuning down to hardware normal people own, and the trick in both is the same: leave the base model alone and train something small on top.
LoRA: train a small adapter, freeze the base
LoRA stands for low-rank adaptation. The idea is that you do not need to change the original weights to teach a narrow skill; you can leave them frozen and learn a small correction that gets added alongside them.
Concretely, for the layers you adapt, LoRA learns two small matrices. Their product is the same shape as the layer's weight update, but because the two matrices share a small inner dimension (the rank), together they hold a tiny fraction of the parameters a full update would. During training, only those small matrices learn. During inference, their product is added to the frozen weights, nudging the model toward the new behavior.
The practical consequences are large:
- Far less memory. You are not storing gradients and optimizer state for the whole model, only for the small adapters.
- Small artifacts. A trained LoRA adapter is often a few megabytes to a few hundred, not the size of the whole model. You can keep many adapters for one base and swap them.
- Reversible. The base is untouched, so you can detach the adapter and you are back to the original model.
The cost is capacity. A low-rank adapter can only express a limited correction. For a narrow task that is exactly what you want; for a sweeping behavior change it can be too small a lever.
QLoRA: quantize the base, then adapt
LoRA shrinks the trainable part. QLoRA shrinks the frozen part too. It loads the base model in 4-bit precision, then trains LoRA adapters on top of that quantized base. Quantization roughly quarters the memory just to hold the weights (see quantization-explained-for-llms for how that works), and because only the small adapters are training, the optimizer overhead stays small as well.
Put those two savings together and the headline result follows: you can fine-tune a large model on a single consumer GPU instead of a rack of them. The technique was introduced in the QLoRA paper (arxiv.org/abs/2305.14314), which is the reference to cite for the specifics. The adapters themselves are trained in higher precision even though the base is 4-bit, which is what keeps quality competitive with full-precision LoRA on many tasks.
When full fine-tuning still wins
LoRA and QLoRA are adapters, not replacements. Full fine-tuning still has the edge when:
- The change is deep, not narrow. Adapting the model across many tasks at once, or shifting its baseline behavior substantially, can exceed what a low-rank adapter can hold.
- The domain is far from the base. If your data is very unlike anything the base saw, a small correction may not be enough and you need the capacity of updating every weight.
- You have the hardware anyway. If compute is not your constraint, full fine-tuning removes the rank as a thing to tune.
For the common case, a single narrow skill or a consistent house style, start with LoRA or QLoRA. They get close enough that the extra cost of a full fine-tune rarely pays for itself.
Comments 0
No comments yet. Be the first to share your thoughts.