Quantization Math Deep Dive
Understanding Q5_0 and Q8_0 Formats
This page explains how 5-bit and 8-bit quantization works, the AVX-512 vectorization strategy, and the bugs we fixed during development.
This page explains how 5-bit and 8-bit quantization works, the AVX-512 vectorization strategy, and the bugs we fixed during development.
Memory Layout: Block Structures
Q8_0 Block (34 bytes)
struct block_q8_0 {
ggml_fp16_t d; // 2 bytes: FP16 scale
int8_t qs[32]; // 32 bytes: 32 weights
}; // Total: 34 bytes / 32 weights
// Dequantization:
weight[i] = qs[i] * d
Compression: 8.5 bits/weight (vs 32 for FP32)
Q5_0 Block (22 bytes)
struct block_q5_0 {
ggml_fp16_t d; // 2 bytes: FP16 scale
uint32_t qh[4]; // 4 bytes: 32 high bits
uint8_t qs[16]; // 16 bytes: packed nibbles
}; // Total: 22 bytes / 32 weights
// Each 5-bit weight = 4 bits from qs + 1 bit from qh
Compression: 5.5 bits/weight
Q8_0 Dequantization (Simple)
Straightforward FP16 Scale
Q8_0 is simple: each weight is just qs[i] * d where d is the FP16 scale.
// Q8_0 dequantization
for (int j = 0; j < 32; j++) {
float weight = d * (float)block->qs[j];
}
Q5_0 Dequantization (Complex)
The 5-Bit Challenge
Q5_0 stores each weight in two pieces: 4 bits in
Q5_0 stores each weight in two pieces: 4 bits in
qs[] (packed) and 1 high bit in qh[]. These must be combined before use.
Q5_0 Bit Mapping
| Weight Index | qs Source | qh Bit |
|---|---|---|
| 0-15 (first half) | qs[j] & 0x0F (low nibble) |
bit j (0-15) |
| 16-31 (second half) | qs[j] >> 4 (high nibble) |
bit j+16 (16-31) |
Q5_0 Scalar Reference
// Q5_0 dequantization - scalar reference
for (int j = 0; j < 16; j++) {
// First 16 weights: low nibbles + qh bits 0-15
uint8_t xh_0 = ((qh >> j) & 1) << 4;
int32_t x0 = ((qs[j] & 0x0F) | xh_0) - 16;
// Second 16 weights: high nibbles + qh bits 16-31
uint8_t xh_1 = ((qh >> (j + 16)) & 1) << 4;
int32_t x1 = ((qs[j] >> 4) | xh_1) - 16;
weight[j] = d * x0;
weight[j + 16] = d * x1;
}
AVX-512 Bugs We Found & Fixed
Bug #1: 16x Multiplier
Result was exactly 16x too large!
// BUGGY:
acc = _mm512_add_ps(acc, _mm512_set1_ps(d * sumi));
*s = _mm512_reduce_add_ps(acc);
// _mm512_set1_ps broadcasts to 16 lanes, then sum = 16x!
Fix: Use scalar accumulator instead of vector broadcast.
Bug #2: Wrong qh Bits
Used bits 12-27 instead of bits 16-31!
// BUGGY - used wrong bit range
__m512i qh_hi = _mm512_set_epi32(
((qh >> 27) & 1) << 4, ..., // Wrong!
((qh >> 12) & 1) << 4 // Wrong!
);
// FIXED - correct bit range 16-31
__m512i qh_hi = _mm512_set_epi32(
((qh >> 31) & 1) << 4, ..., // Correct!
((qh >> 16) & 1) << 4 // Correct!
);
AVX-512 Vectorization Strategy
Processing 32 Weights in 2 Batches of 16
// AVX-512 vec_dot_q5_0_q8_0
for (int ib = 0; ib < nb; ib++) {
// Load 16 packed bytes (32 nibbles)
__m128i qs = _mm_loadu_si128(x[ib].qs);
// First 16 weights
__m512i lo_nibbles = _mm512_cvtepu8_epi32(qs);
lo_nibbles = _mm512_and_epi32(lo_nibbles, mask_0f);
__m512i qh_lo = /* bits 0-15 of qh, shifted to bit 4 */;
__m512i q5_lo = _mm512_sub_epi32(_mm512_or_epi32(lo_nibbles, qh_lo), sixteen);
__m512i y32_lo = _mm512_cvtepi8_epi32(y[ib].qs[0:16]);
__m512i prod_lo = _mm512_mullo_epi32(q5_lo, y32_lo);
// Second 16 weights (similar, using high nibbles + qh bits 16-31)
int sumi = _mm512_reduce_add_epi32(_mm512_add_epi32(prod_lo, prod_hi));
sumf += d * (float)sumi;
}
Key AVX-512 Intrinsics Used
| Intrinsic | Operation |
|---|---|
_mm512_cvtepu8_epi32 |
Expand 16 bytes → 16 int32s (zero extension) |
_mm512_cvtepi8_epi32 |
Expand 16 int8 → 16 int32s (sign extension) |
_mm512_mullo_epi32 |
Multiply 16 pairs of int32s (saturating) |
_mm512_reduce_add_epi32 |
Horizontal sum of all 16 lanes |
_mm512_set_epi32 |
Build vector from 16 scalar values |
Quick Reference
Q8_0 Formula
weight[i] = d × qs[i]
Q5_0 Formula
weight[j] = d × ((nibble | (qh_bit << 4)) - 16)
vec_dot Computation
result = Σ (d_x × d_y × Σ (dequant_x[i] × y[i]))
Sum over all blocks, where each block contributes: scale × integer_dot_product