Does quantization make the model slower or faster?

Faster, almost always. Smaller weights mean less memory bandwidth needed to move them, and modern runtimes have hand-tuned kernels for low-bit math. On GPU-bound workloads the speedup is small; on memory-bound workloads (typical for batch size 1 chat) it can be substantial.

Is quantized fine-tuning a thing?

Yes. QLoRA fine-tunes a 4-bit quantized base model by training small low-rank adapters in higher precision. It lets people fine-tune 70B-class models on a single 24GB GPU. The result is competitive with full-precision LoRA for most tasks.

How do I tell if a quant level is too aggressive?

Run your eval set at both Q4_K_M and Q6_K and compare the outputs. If you cannot tell them apart on your task, Q4 is fine. If Q4 starts producing different code, different math, or different facts on a noticeable percentage of runs, you have crossed the line for that task and should step up.

Quantization Explained for LLMs

Quantization is the trick that makes useful local LLMs possible. The same 70B-parameter model that takes 140GB of VRAM at full precision fits in roughly 40GB at Q4_K_M, which means it runs on a pair of 24GB consumer GPUs instead of needing a small datacenter.

What quantization actually does

Modern LLMs store each weight as a 16-bit floating point number (BF16 or FP16). Each weight uses 2 bytes. A 70B model is therefore 140GB of weights.

Quantization replaces those 16-bit floats with smaller integer representations, plus a small amount of bookkeeping data (a scale factor, sometimes a zero-point) per group of weights. At 4 bits per weight, the same 70B model becomes about 35GB plus a few percent overhead — roughly a quarter of the original size.

The cost is precision. Each weight is now one of 16 discrete values (instead of any of the ~65,000 values an FP16 can take) inside its group's scale. For the vast majority of weights, this is fine — neural networks are surprisingly robust to rounding. For the small minority of weights that are doing critical work, modern quantization schemes use calibration data to keep those weights at higher precision.

The quality cliff

Quality drops gracefully from FP16 down to about 4 bits. Below 4 bits it starts to drop faster:

8-bit: indistinguishable from full precision on essentially every task.
6-bit: indistinguishable for almost all use cases.
5-bit: detectable drop only on the hardest reasoning and code tasks.
4-bit: noticeable drop on hard reasoning, fine for chat, writing, summarization.
3-bit: real drop. Math gets worse, code gets sloppier, instruction following weakens.
2-bit: often broken outside narrow benchmarks. Avoid unless desperate.

The modern default — Q4_K_M in GGUF — sits right at the corner of the cliff. Below it you give up real quality; above it you mostly pay for memory you do not need to spend.

How to pick a quant level

Two questions to answer:

What is your VRAM budget? Subtract the OS, the context-window allocation, and any other GPU users. Take the rest, divide by the FP16 size of the model, and see which quantization gets you under the line.
How hard is your task? Chat and writing tolerate Q4 easily. Code generation, math, structured reasoning, and long-context retrieval benefit measurably from Q5–Q6. Pick the quant that fits, then bump up one notch if your task is in the hard column and you have headroom.

The most common upgrade path: start at Q4_K_M, run a battery of your real tasks, step up to Q5_K_M or Q6_K if anything is failing in a way that smells like precision rather than capability.

What quantization cannot fix

A quantized model is still the same model. Quantization cannot make a 7B model think like a 70B model; it can only make a 70B model fit. If your task needs a bigger model, you need a bigger model, not a better quant.

Quantization Explained for LLMs

Article summary

What quantization actually does

The quality cliff

How to pick a quant level

What quantization cannot fix

Frequently asked questions

See also

Where to go next