GGUF is the dominant file format for llama.cpp-family local LLM runtimes — llama.cpp itself, Ollama, LM Studio, and text-generation-webui's llama.cpp backend. Other runtimes (vLLM, Transformers, TensorRT-LLM) can use other formats; GGUF is not a universal standard. Knowing how GGUF names its quantization variants makes the "which file should I download?" question decidable instead of intimidating.
What GGUF actually is
A GGUF file is a single binary container with three parts:
- Metadata: the model name, architecture, vocab size, context window, default sampling parameters, the chat template, plus tokenizer data. Everything the runtime needs to set up inference is in here.
- Tensor descriptors: a table of every weight tensor (name, shape, quantization type, byte offset).
- Tensor data: the quantized weights themselves, packed end-to-end.
GGUF replaced an earlier llama.cpp format called GGML, with a design goal of being extensible: new fields and new quantization variants can be added without breaking existing files.
The naming convention
Filenames usually look like:
some-model-7b-instruct.Q4_K_M.gguf
some-model-32b.Q6_K.gguf
some-model-8b-instruct.IQ4_XS.gguf
The chunk before .gguf is the quantization spec. These are family labels rather than precise descriptions of bits-per-weight; the underlying schemes use different block sizes and some mixed precision. Rough decoding:
- Q prefix — the "standard" llama.cpp quantization families.
- IQ prefix — importance-aware quantization families that use calibration data to keep critical weights at higher precision.
- The number (
4,5,6,8) is a shorthand for the average bits-per-weight target in that family. The actual storage size depends on the model's architecture and metadata. - _S / _M / _L — different parameterizations of a family (often called small / medium / large), trading more memory for more precision in critical layers.
- _XS / _XXS — more aggressive variants, mostly used with IQ families.
A practical shortlist. The size ratios are approximate and depend on the specific model; treat them as rules of thumb:
| Tag | Bits | Size vs FP16 (approx.) | Quality | When to use |
|---|---|---|---|---|
| Q8_0 | 8 | ~50% | Near-original | When you have RAM/VRAM to spare |
| Q6_K | 6 | ~38% | Very close to original | Safe upgrade from Q4 |
| Q5_K_M | 5 | ~32% | Subtle drop | A middle ground |
| Q4_K_M | 4 | ~25% | Slight drop on most tasks | Common default for chat |
| IQ4_XS | 4 | ~22% | Often comparable to Q4_K_M, model-dependent | Tight on VRAM |
| IQ3_M | 3 | ~18% | Noticeable drop | Big model on a small GPU |
| Q3_K_S | 3 | ~16% | Real quality drop | When you have no choice |
| Q2_K | 2 | ~10% | Heavy quality drop; usually significantly degraded | Almost never, but it does run |
What to download
Q4_K_M is a sensible default for chat on most models in 2026 — not a universal rule. If your task is quality-sensitive (code, math, structured reasoning) and you have the memory, Q5_K_M or Q6_K are common upgrades. Many current guides recommend testing Q4_K_M, Q5_K_M, and Q6_K against your specific eval set before committing.
For the "bigger model at Q4 vs smaller model at Q6" tradeoff: empirically it often favors the bigger model on chat-style tasks, but not universally. Some tasks and some model families behave differently — for example, very small quantized models can collapse on math while the same model at FP16 handles it fine. Test both on your actual workload rather than trusting a global rule.
Edit log
- 2026-05-16 — original treated GGUF as a universal local-LLM standard, presented quantization size/quality percentages as exact, and made absolute claims ("bigger-at-Q4 almost always wins"). Rewrite (after Sonar Pro fact-check) scopes GGUF to llama.cpp-family runtimes, marks all size ratios as approximate and model-dependent, softens Q2_K from "often broken" to "heavy quality drop but still runs," and turns the bigger-vs-smaller tradeoff into a heuristic to test rather than a rule.