GGUF is the dominant file format for llama.cpp-family local LLM runtimes — llama.cpp itself, Ollama, LM Studio, and text-generation-webui's llama.cpp backend. Other runtimes (vLLM, Transformers, TensorRT-LLM) can use other formats; GGUF is not a universal standard. Knowing how GGUF names its quantization variants makes the "which file should I download?" question decidable instead of intimidating.

What GGUF actually is

A GGUF file is a single binary container with three parts:

  1. Metadata: the model name, architecture, vocab size, context window, default sampling parameters, the chat template, plus tokenizer data. Everything the runtime needs to set up inference is in here.
  2. Tensor descriptors: a table of every weight tensor (name, shape, quantization type, byte offset).
  3. Tensor data: the quantized weights themselves, packed end-to-end.

GGUF replaced an earlier llama.cpp format called GGML, with a design goal of being extensible: new fields and new quantization variants can be added without breaking existing files.

The naming convention

Filenames usually look like:

some-model-7b-instruct.Q4_K_M.gguf
some-model-32b.Q6_K.gguf
some-model-8b-instruct.IQ4_XS.gguf

The chunk before .gguf is the quantization spec. These are family labels rather than precise descriptions of bits-per-weight; the underlying schemes use different block sizes and some mixed precision. Rough decoding:

  • Q prefix — the "standard" llama.cpp quantization families.
  • IQ prefix — importance-aware quantization families that use calibration data to keep critical weights at higher precision.
  • The number (4, 5, 6, 8) is a shorthand for the average bits-per-weight target in that family. The actual storage size depends on the model's architecture and metadata.
  • _S / _M / _L — different parameterizations of a family (often called small / medium / large), trading more memory for more precision in critical layers.
  • _XS / _XXS — more aggressive variants, mostly used with IQ families.

A practical shortlist. The size ratios are approximate and depend on the specific model; treat them as rules of thumb:

Tag Bits Size vs FP16 (approx.) Quality When to use
Q8_0 8 ~50% Near-original When you have RAM/VRAM to spare
Q6_K 6 ~38% Very close to original Safe upgrade from Q4
Q5_K_M 5 ~32% Subtle drop A middle ground
Q4_K_M 4 ~25% Slight drop on most tasks Common default for chat
IQ4_XS 4 ~22% Often comparable to Q4_K_M, model-dependent Tight on VRAM
IQ3_M 3 ~18% Noticeable drop Big model on a small GPU
Q3_K_S 3 ~16% Real quality drop When you have no choice
Q2_K 2 ~10% Heavy quality drop; usually significantly degraded Almost never, but it does run

What to download

Q4_K_M is a sensible default for chat on most models in 2026 — not a universal rule. If your task is quality-sensitive (code, math, structured reasoning) and you have the memory, Q5_K_M or Q6_K are common upgrades. Many current guides recommend testing Q4_K_M, Q5_K_M, and Q6_K against your specific eval set before committing.

For the "bigger model at Q4 vs smaller model at Q6" tradeoff: empirically it often favors the bigger model on chat-style tasks, but not universally. Some tasks and some model families behave differently — for example, very small quantized models can collapse on math while the same model at FP16 handles it fine. Test both on your actual workload rather than trusting a global rule.

Edit log

  • 2026-05-16 — original treated GGUF as a universal local-LLM standard, presented quantization size/quality percentages as exact, and made absolute claims ("bigger-at-Q4 almost always wins"). Rewrite (after Sonar Pro fact-check) scopes GGUF to llama.cpp-family runtimes, marks all size ratios as approximate and model-dependent, softens Q2_K from "often broken" to "heavy quality drop but still runs," and turns the bigger-vs-smaller tradeoff into a heuristic to test rather than a rule.