Why not just use safetensors?

Safetensors is the format Hugging Face uses for full-precision weights served from Python (Transformers, vLLM). GGUF was designed for CPU+GPU hybrid inference in llama.cpp, with quantization built into the format. They solve different problems. Use safetensors for GPU serving with Transformers/vLLM; GGUF for everything that runs through llama.cpp.

What is the difference between Q4_K_M and Q4_K_S?

The suffix is the variant. _S (small) is more aggressive and a bit lower quality. _M (medium) is the modern default. _L (large) is closer to full precision but bigger. For a 4-bit baseline, Q4_K_M is the safe pick.

Q is the original quantization scheme; IQ is the "importance-weighted" variant that uses calibration data to preserve quality at lower bit counts. IQ4_XS and IQ3_M punch above their weight: they can deliver quality close to Q5_K_M at noticeably smaller sizes. Use them when you are squeezing a model into tight VRAM.

GGUF Format Explained

GGUF is the dominant file format for llama.cpp-family local LLM runtimes — llama.cpp itself, Ollama, LM Studio, and text-generation-webui's llama.cpp backend. Other runtimes (vLLM, Transformers, TensorRT-LLM) can use other formats; GGUF is not a universal standard. Knowing how GGUF names its quantization variants makes the "which file should I download?" question decidable instead of intimidating.

What GGUF actually is

A GGUF file is a single binary container with three parts:

Metadata: the model name, architecture, vocab size, context window, default sampling parameters, the chat template, plus tokenizer data. Everything the runtime needs to set up inference is in here.
Tensor descriptors: a table of every weight tensor (name, shape, quantization type, byte offset).
Tensor data: the quantized weights themselves, packed end-to-end.

GGUF replaced an earlier llama.cpp format called GGML, with a design goal of being extensible: new fields and new quantization variants can be added without breaking existing files.

The naming convention

Filenames usually look like:

some-model-7b-instruct.Q4_K_M.gguf
some-model-32b.Q6_K.gguf
some-model-8b-instruct.IQ4_XS.gguf

The chunk before .gguf is the quantization spec. These are family labels rather than precise descriptions of bits-per-weight; the underlying schemes use different block sizes and some mixed precision. Rough decoding:

Q prefix — the "standard" llama.cpp quantization families.
IQ prefix — importance-aware quantization families that use calibration data to keep critical weights at higher precision.
The number (4, 5, 6, 8) is a shorthand for the average bits-per-weight target in that family. The actual storage size depends on the model's architecture and metadata.
_S / _M / _L — different parameterizations of a family (often called small / medium / large), trading more memory for more precision in critical layers.
_XS / _XXS — more aggressive variants, mostly used with IQ families.

A practical shortlist. The size ratios are approximate and depend on the specific model; treat them as rules of thumb:

Tag	Bits	Size vs FP16 (approx.)	Quality	When to use
Q8_0	8	~50%	Near-original	When you have RAM/VRAM to spare
Q6_K	6	~38%	Very close to original	Safe upgrade from Q4
Q5_K_M	5	~32%	Subtle drop	A middle ground
Q4_K_M	4	~25%	Slight drop on most tasks	Common default for chat
IQ4_XS	4	~22%	Often comparable to Q4_K_M, model-dependent	Tight on VRAM
IQ3_M	3	~18%	Noticeable drop	Big model on a small GPU
Q3_K_S	3	~16%	Real quality drop	When you have no choice
Q2_K	2	~10%	Heavy quality drop; usually significantly degraded	Almost never, but it does run

What to download

Q4_K_M is a sensible default for chat on most models in 2026 — not a universal rule. If your task is quality-sensitive (code, math, structured reasoning) and you have the memory, Q5_K_M or Q6_K are common upgrades. Many current guides recommend testing Q4_K_M, Q5_K_M, and Q6_K against your specific eval set before committing.

For the "bigger model at Q4 vs smaller model at Q6" tradeoff: empirically it often favors the bigger model on chat-style tasks, but not universally. Some tasks and some model families behave differently — for example, very small quantized models can collapse on math while the same model at FP16 handles it fine. Test both on your actual workload rather than trusting a global rule.

Edit log

2026-05-16 — original treated GGUF as a universal local-LLM standard, presented quantization size/quality percentages as exact, and made absolute claims ("bigger-at-Q4 almost always wins"). Rewrite (after Sonar Pro fact-check) scopes GGUF to llama.cpp-family runtimes, marks all size ratios as approximate and model-dependent, softens Q2_K from "often broken" to "heavy quality drop but still runs," and turns the bigger-vs-smaller tradeoff into a heuristic to test rather than a rule.

GGUF Format Explained

Article summary

What GGUF actually is

The naming convention

What to download

Edit log

Frequently asked questions

See also

Where to go next