Quantization is the trick that makes useful local LLMs possible. The same 70B-parameter model that takes 140GB of VRAM at full precision fits in roughly 40GB at Q4_K_M, which means it runs on a pair of 24GB consumer GPUs instead of needing a small datacenter.
What quantization actually does
Modern LLMs store each weight as a 16-bit floating point number (BF16 or FP16). Each weight uses 2 bytes. A 70B model is therefore 140GB of weights.
Quantization replaces those 16-bit floats with smaller integer representations, plus a small amount of bookkeeping data (a scale factor, sometimes a zero-point) per group of weights. At 4 bits per weight, the same 70B model becomes about 35GB plus a few percent overhead — roughly a quarter of the original size.
The cost is precision. Each weight is now one of 16 discrete values (instead of any of the ~65,000 values an FP16 can take) inside its group's scale. For the vast majority of weights, this is fine — neural networks are surprisingly robust to rounding. For the small minority of weights that are doing critical work, modern quantization schemes use calibration data to keep those weights at higher precision.
The quality cliff
Quality drops gracefully from FP16 down to about 4 bits. Below 4 bits it starts to drop faster:
- 8-bit: indistinguishable from full precision on essentially every task.
- 6-bit: indistinguishable for almost all use cases.
- 5-bit: detectable drop only on the hardest reasoning and code tasks.
- 4-bit: noticeable drop on hard reasoning, fine for chat, writing, summarization.
- 3-bit: real drop. Math gets worse, code gets sloppier, instruction following weakens.
- 2-bit: often broken outside narrow benchmarks. Avoid unless desperate.
The modern default — Q4_K_M in GGUF — sits right at the corner of the cliff. Below it you give up real quality; above it you mostly pay for memory you do not need to spend.
How to pick a quant level
Two questions to answer:
- What is your VRAM budget? Subtract the OS, the context-window allocation, and any other GPU users. Take the rest, divide by the FP16 size of the model, and see which quantization gets you under the line.
- How hard is your task? Chat and writing tolerate Q4 easily. Code generation, math, structured reasoning, and long-context retrieval benefit measurably from Q5–Q6. Pick the quant that fits, then bump up one notch if your task is in the hard column and you have headroom.
The most common upgrade path: start at Q4_K_M, run a battery of your real tasks, step up to Q5_K_M or Q6_K if anything is failing in a way that smells like precision rather than capability.
What quantization cannot fix
A quantized model is still the same model. Quantization cannot make a 7B model think like a 70B model; it can only make a 70B model fit. If your task needs a bigger model, you need a bigger model, not a better quant.