GGUF is the dominant file format for llama.cpp-family local LLM runtimes — llama.cpp itself, Ollama, LM Studio, and text-generation-webui's llama.cpp backend. Other runtimes (vLLM, Transformers, TensorRT-LLM) can use other formats; GGUF is not a universal standard. Knowing how GGUF names its quantization variants makes the "which file should I download?" question decidable instead of intimidating.
What GGUF actually is
A GGUF file is a single binary container with three parts:
- Metadata: the model name, architecture, vocab size, context window, default sampling parameters, the chat template, plus tokenizer data. Everything the runtime needs to set up inference is in here.
- Tensor descriptors: a table of every weight tensor (name, shape, quantization type, byte offset).
- Tensor data: the quantized weights themselves, packed end-to-end.
GGUF replaced an earlier llama.cpp format called GGML, with a design goal of being extensible: new fields and new quantization variants can be added without breaking existing files.
The naming convention
Filenames usually look like:
some-model-7b-instruct.Q4_K_M.gguf
some-model-32b.Q6_K.gguf
some-model-8b-instruct.IQ4_XS.gguf
The chunk before .gguf is the quantization spec. These are family labels rather than precise descriptions of bits-per-weight; the underlying schemes use different block sizes and some mixed precision. Rough decoding:
- Q prefix — the "standard" llama.cpp quantization families.
- IQ prefix — importance-aware quantization families that use calibration data to keep critical weights at higher precision.
- The number (
4,5,6,8) is a shorthand for the average bits-per-weight target in that family. The actual storage size depends on the model's architecture and metadata. - _S / _M / _L — different parameterizations of a family (often called small / medium / large), trading more memory for more precision in critical layers.
- _XS / _XXS — more aggressive variants, mostly used with IQ families.
A practical shortlist. The size ratios are approximate and depend on the specific model; treat them as rules of thumb:
| Tag | Bits | Size vs FP16 (approx.) | Quality | When to use |
|---|---|---|---|---|
| Q8_0 | 8 | ~50% | Near-original | When you have RAM/VRAM to spare |
| Q6_K | 6 | ~38% | Very close to original | Safe upgrade from Q4 |
| Q5_K_M | 5 | ~32% | Subtle drop | A middle ground |
| Q4_K_M | 4 | ~25% | Slight drop on most tasks | Common default for chat |
| IQ4_XS | 4 | ~22% | Often comparable to Q4_K_M, model-dependent | Tight on VRAM |
| IQ3_M | 3 | ~18% | Noticeable drop | Big model on a small GPU |
| Q3_K_S | 3 | ~16% | Real quality drop | When you have no choice |
| Q2_K | 2 | ~10% | Heavy quality drop; usually significantly degraded | Almost never, but it does run |
What to download
Q4_K_M is a sensible default for chat on most models in 2026 — not a universal rule. If your task is quality-sensitive (code, math, structured reasoning) and you have the memory, Q5_K_M or Q6_K are common upgrades. Many current guides recommend testing Q4_K_M, Q5_K_M, and Q6_K against your specific eval set before committing.
For the "bigger model at Q4 vs smaller model at Q6" tradeoff: empirically it often favors the bigger model on chat-style tasks, but not universally. Some tasks and some model families behave differently — for example, very small quantized models can collapse on math while the same model at FP16 handles it fine. Test both on your actual workload rather than trusting a global rule.
gguf finaly explained without me having to dig through github issues. the quant suffixes always confused the hell out of me
the quant suffix decoder ring alone is worth the read. q4_k_m versus q5 finally actually means something to me instead of being noise