LoRA and QLoRA Explained

Article summary

LoRA (low-rank adaptation) freezes the base model and trains a pair of small adapter matrices that nudge its behavior, so you update a tiny fraction of the parameters instead of all of them. QLoRA goes further by loading the frozen base in 4-bit precision and training LoRA adapters on top, which is what lets people fine-tune large models on a single consumer GPU. Full fine-tuning still wins when you need a deep behavior change rather than a narrow nudge, and when you have the hardware to do it.

Full fine-tuning updates every weight in a model. For a large model that means holding billions of parameters plus their gradients and optimizer state in memory, which is why it lives on clusters and not on the machine under your desk. LoRA and QLoRA are the techniques that brought fine-tuning down to hardware normal people own, and the trick in both is the same: leave the base model alone and train something small on top.

LoRA: train a small adapter, freeze the base

LoRA stands for low-rank adaptation. The idea is that you do not need to change the original weights to teach a narrow skill; you can leave them frozen and learn a small correction that gets added alongside them.

Concretely, for the layers you adapt, LoRA learns two small matrices. Their product is the same shape as the layer's weight update, but because the two matrices share a small inner dimension (the rank), together they hold a tiny fraction of the parameters a full update would. During training, only those small matrices learn. During inference, their product is added to the frozen weights, nudging the model toward the new behavior.

The practical consequences are large:

Far less memory. You are not storing gradients and optimizer state for the whole model, only for the small adapters.
Small artifacts. A trained LoRA adapter is often a few megabytes to a few hundred, not the size of the whole model. You can keep many adapters for one base and swap them.
Reversible. The base is untouched, so you can detach the adapter and you are back to the original model.

The cost is capacity. A low-rank adapter can only express a limited correction. For a narrow task that is exactly what you want; for a sweeping behavior change it can be too small a lever.

QLoRA: quantize the base, then adapt

LoRA shrinks the trainable part. QLoRA shrinks the frozen part too. It loads the base model in 4-bit precision, then trains LoRA adapters on top of that quantized base. Quantization roughly quarters the memory just to hold the weights (see quantization-explained-for-llms for how that works), and because only the small adapters are training, the optimizer overhead stays small as well.

Put those two savings together and the headline result follows: you can fine-tune a large model on a single consumer GPU instead of a rack of them. The technique was introduced in the QLoRA paper (arxiv.org/abs/2305.14314), which is the reference to cite for the specifics. The adapters themselves are trained in higher precision even though the base is 4-bit, which is what keeps quality competitive with full-precision LoRA on many tasks.

When full fine-tuning still wins

LoRA and QLoRA are adapters, not replacements. Full fine-tuning still has the edge when:

The change is deep, not narrow. Adapting the model across many tasks at once, or shifting its baseline behavior substantially, can exceed what a low-rank adapter can hold.
The domain is far from the base. If your data is very unlike anything the base saw, a small correction may not be enough and you need the capacity of updating every weight.
You have the hardware anyway. If compute is not your constraint, full fine-tuning removes the rank as a thing to tune.

For the common case, a single narrow skill or a consistent house style, start with LoRA or QLoRA. They get close enough that the extra cost of a full fine-tune rarely pays for itself.

Frequently asked questions

What does "low rank" actually mean here?

The full weight update for a layer is a large matrix. LoRA approximates that update as the product of two much smaller matrices, whose shared inner dimension is the "rank." A low rank (often 8, 16, or 32) means very few new parameters, which is why training is cheap and the resulting adapter file is small. The bet is that the useful adaptation for a narrow task lives in a low-dimensional subspace, so you do not need the full-rank update to capture it.

Why does QLoRA fit on one GPU when full fine-tuning does not?

Two reasons. First, the base model is loaded in 4-bit precision, which roughly quarters the memory needed just to hold the weights. Second, because the base is frozen and only the small LoRA adapters are trained, you avoid storing optimizer state and gradients for the billions of base parameters, which is usually the part that blows the memory budget. The QLoRA paper (arxiv.org/abs/2305.14314) is the canonical reference for the technique.

When should I do a full fine-tune instead?

When the change you need is deep rather than narrow: a substantial shift in the model behavior across many tasks, or a domain so far from the base distribution that a small adapter cannot cover it. Full fine-tuning updates every weight, so it has more capacity to absorb a large change, at the cost of far more compute and memory. For most narrow, single-task adaptations, LoRA or QLoRA gets close enough that the extra cost is not worth it.