Quantization Formats Visual Guide
Byte-level structure, headers, and dequantization math for all 8 supported formats
Q4_0
Q4_1
Q5_0
Q5_1
Q8_0
Q4_K (Primary)
Q6_K
Q8_K (Activations)
Complete Format Reference
Click image to open in fullscreen viewer with zoom and pan controls
Key Concepts
Block Quantization
Weights are grouped into blocks (32 or 256). Each block has a shared scale factor, reducing overhead while maintaining accuracy. Larger blocks = better compression, smaller blocks = better accuracy.
Symmetric vs Asymmetric
Symmetric (Q4_0, Q5_0, Q8_0): Values centered at 0. Formula: weight = (q - center) x d
Asymmetric (Q4_1, Q5_1): Adds min offset. Formula: weight = q x d + m
Q4_K dmin: SUBTRACT not ADD!
Common Bug: Q4_K uses weight = q x (d x sc) - (dmin x mn)
The minus sign is critical! dmin encodes a positive offset that gets subtracted to shift the range down. Using + instead of - produces wrong outputs.
K-Quant Nested Scales
K-quant formats use 2-level scaling: a super-block FP16 scale multiplied by per-sub-block 6-bit or int8 scales. This gives fine-grained control with minimal overhead.
5-bit and 6-bit Packing
Q5 and Q6 formats split bits across multiple byte arrays. Q5: 4 low bits in qs, 1 high bit in qh. Q6: 4 low bits in ql, 2 high bits in qh.
Q8_K: FP32 Scale + bsums
Q8_K uses FP32 scale (not FP16!) for higher precision. The bsums field contains precomputed sums of 16 consecutive int8s for VNNI/AVX-512 dot product optimization.
Recent Fix: Q8_K SSE Parity
On 2026-03-09 we fixed a subtle Q8_K parity bug in quantize_row_q8_k_sse.c. The SSE path now preserves the same signed-max selection and bsums contract as llama.cpp/ref. See commit 224a4d30.
This class of bug is easy to miss: text generation can still look mostly normal, but mixed-quant boundaries like quantize_row_q8_k -> gemv_q4_k_q8_k accumulate small parity drift until a model family like Nanbeige exposes it.
Quant Summary (current v7 mixed-quant path)
These are the formats that matter in the current v7 inference path across Qwen, Gemma, and recent Llama/Nanbeige parity work. The practical rule is not just "pick a small dtype", but "pick the weight format and activation contract that match the runtime kernel actually used."
quantize_row_q8_k -> gemv_q4_k_q8_k and gemv_q6_k_q8_k.FP32
quantize_row_q8_k
Q4_K / Q6_K × Q8_K
q8_k activation contract.Related Documentation
- Quantization Deep Dive - Theory and implementation
- Bit Manipulation Visuals - Interactive SVG diagrams
- Kernel Reference - All GEMM kernels
- SIMD Architecture - AVX-512, VNNI, AMX