Quantization Formats Visual Guide

Byte-level structure, headers, and dequantization math for all 8 supported formats

Simple Formats (32 weights/block)

Q4_0

4.5 bits/weight

18 bytes = 2 (d) + 32x4/8

Q4_1

5.0 bits/weight

20 bytes = 2+2 (d,m) + 32x4/8

Q5_0

5.5 bits/weight

22 bytes = 2 (d) + 32x1/8 + 32x4/8

Q5_1

6.0 bits/weight

24 bytes = 4 (d,m) + 32x1/8 + 32x4/8

Q8_0

8.5 bits/weight

34 bytes = 2 (d) + 32x8/8

K-Quant Formats (256 weights/block)

Q4_K (Primary)

4.5 bits/weight

144 bytes = 4+12 (header) + 256x4/8

Q6_K

6.5625 bits/weight

210 bytes = 18 (header) + 256x6/8

Q8_K (Activations)

9.125 bits/weight

292 bytes = 36 (header) + 256x8/8

Complete Format Reference

C-Kernel-Engine Quantization Formats - Complete byte-level visualization

Click image to open in fullscreen viewer with zoom and pan controls

Key Concepts

Block Quantization

Weights are grouped into blocks (32 or 256). Each block has a shared scale factor, reducing overhead while maintaining accuracy. Larger blocks = better compression, smaller blocks = better accuracy.

Symmetric vs Asymmetric

Symmetric (Q4_0, Q5_0, Q8_0): Values centered at 0. Formula: weight = (q - center) x d
Asymmetric (Q4_1, Q5_1): Adds min offset. Formula: weight = q x d + m

Q4_K dmin: SUBTRACT not ADD!

Common Bug: Q4_K uses weight = q x (d x sc) - (dmin x mn)
The minus sign is critical! dmin encodes a positive offset that gets subtracted to shift the range down. Using + instead of - produces wrong outputs.

K-Quant Nested Scales

K-quant formats use 2-level scaling: a super-block FP16 scale multiplied by per-sub-block 6-bit or int8 scales. This gives fine-grained control with minimal overhead.

5-bit and 6-bit Packing

Q5 and Q6 formats split bits across multiple byte arrays. Q5: 4 low bits in qs, 1 high bit in qh. Q6: 4 low bits in ql, 2 high bits in qh.

Q8_K: FP32 Scale + bsums

Q8_K uses FP32 scale (not FP16!) for higher precision. The bsums field contains precomputed sums of 16 consecutive int8s for VNNI/AVX-512 dot product optimization.

Recent Fix: Q8_K SSE Parity

On 2026-03-09 we fixed a subtle Q8_K parity bug in quantize_row_q8_k_sse.c. The SSE path now preserves the same signed-max selection and bsums contract as llama.cpp/ref. See commit 224a4d30.

This class of bug is easy to miss: text generation can still look mostly normal, but mixed-quant boundaries like quantize_row_q8_k -> gemv_q4_k_q8_k accumulate small parity drift until a model family like Nanbeige exposes it.

Quant Summary (current v7 mixed-quant path)

These are the formats that matter in the current v7 inference path across Qwen, Gemma, and recent Llama/Nanbeige parity work. The practical rule is not just "pick a small dtype", but "pick the weight format and activation contract that match the runtime kernel actually used."

Q8_K 256-block

UseActivation-side K-quant path

StrengthMatches ggml mixed-quant kernels

Current v7 path for quantize_row_q8_k -> gemv_q4_k_q8_k and gemv_q6_k_q8_k.

Q8_0 32‑block

UseLegacy/simple high-quality paths

StrengthSimple block format

Still useful, but it is no longer the whole story for modern K-quant inference.

Q5_0 / Q5_1 32‑block

UseQ/K/O projections

StrengthGood speed/quality

Q5_1 adds an offset for asymmetric ranges; Q5_0 is symmetric.

Q5_K 256‑block

UseGemma MLP / proj

StrengthHigher compression

K‑quant with nested scales; improves size without wrecking MLP quality.

Q6_K 256‑block

UseSensitive MLP / output weights

StrengthHigher fidelity

Often selected for sensitive MLP paths where Q5_K is too lossy.

Q4_K 256‑block

UseCompression-first K-quant weights

StrengthSmallest footprint

Great for footprint, but easiest to destabilize if overused.

Hidden State
FP32

→

Activation Quantize
quantize_row_q8_k

→

Mixed K-Quant Matvec
Q4_K / Q6_K × Q8_K

Q8_K Activation-side block format for the current mixed K-quant runtime path.

Q4_K Compression-first weight blocks that rely on the correct q8_k activation contract.

Q6_K Higher-fidelity weight blocks for sensitive MLP/output paths.