Quantization Math Deep Dive

Understanding Q5_0 and Q8_0 Formats
This page explains how 5-bit and 8-bit quantization works, the AVX-512 vectorization strategy, and the bugs we fixed during development.

Memory Layout: Block Structures

Q8_0 Block (34 bytes)

struct block_q8_0 {
    ggml_fp16_t d;      // 2 bytes: FP16 scale
    int8_t qs[32];      // 32 bytes: 32 weights
};  // Total: 34 bytes / 32 weights

// Dequantization:
weight[i] = qs[i] * d

Compression: 8.5 bits/weight (vs 32 for FP32)

Q5_0 Block (22 bytes)

struct block_q5_0 {
    ggml_fp16_t d;      // 2 bytes: FP16 scale
    uint32_t qh[4];     // 4 bytes: 32 high bits
    uint8_t qs[16];     // 16 bytes: packed nibbles
};  // Total: 22 bytes / 32 weights

// Each 5-bit weight = 4 bits from qs + 1 bit from qh

Compression: 5.5 bits/weight

Q8_0 Dequantization (Simple)

Straightforward FP16 Scale

Q8_0 is simple: each weight is just qs[i] * d where d is the FP16 scale.

// Q8_0 dequantization
for (int j = 0; j < 32; j++) {
    float weight = d * (float)block->qs[j];
}

Q5_0 Dequantization (Complex)

The 5-Bit Challenge
Q5_0 stores each weight in two pieces: 4 bits in qs[] (packed) and 1 high bit in qh[]. These must be combined before use.

Q5_0 Bit Mapping

Weight Index	qs Source	qh Bit
0-15 (first half)	`qs[j] & 0x0F` (low nibble)	bit j (0-15)
16-31 (second half)	`qs[j] >> 4` (high nibble)	bit j+16 (16-31)

Q5_0 Scalar Reference

// Q5_0 dequantization - scalar reference
for (int j = 0; j < 16; j++) {
    // First 16 weights: low nibbles + qh bits 0-15
    uint8_t xh_0 = ((qh >> j) & 1) << 4;
    int32_t x0 = ((qs[j] & 0x0F) | xh_0) - 16;

    // Second 16 weights: high nibbles + qh bits 16-31
    uint8_t xh_1 = ((qh >> (j + 16)) & 1) << 4;
    int32_t x1 = ((qs[j] >> 4) | xh_1) - 16;

    weight[j] = d * x0;
    weight[j + 16] = d * x1;
}

AVX-512 Bugs We Found & Fixed

Bug #1: 16x Multiplier

Result was exactly 16x too large!

// BUGGY:
acc = _mm512_add_ps(acc, _mm512_set1_ps(d * sumi));
*s = _mm512_reduce_add_ps(acc);
// _mm512_set1_ps broadcasts to 16 lanes, then sum = 16x!

Fix: Use scalar accumulator instead of vector broadcast.

Bug #2: Wrong qh Bits

Used bits 12-27 instead of bits 16-31!

// BUGGY - used wrong bit range
__m512i qh_hi = _mm512_set_epi32(
    ((qh >> 27) & 1) << 4, ...,  // Wrong!
    ((qh >> 12) & 1) << 4       // Wrong!
);

// FIXED - correct bit range 16-31
__m512i qh_hi = _mm512_set_epi32(
    ((qh >> 31) & 1) << 4, ...,  // Correct!
    ((qh >> 16) & 1) << 4        // Correct!
);

AVX-512 Vectorization Strategy

Processing 32 Weights in 2 Batches of 16

// AVX-512 vec_dot_q5_0_q8_0
for (int ib = 0; ib < nb; ib++) {
    // Load 16 packed bytes (32 nibbles)
    __m128i qs = _mm_loadu_si128(x[ib].qs);

    // First 16 weights
    __m512i lo_nibbles = _mm512_cvtepu8_epi32(qs);
    lo_nibbles = _mm512_and_epi32(lo_nibbles, mask_0f);
    __m512i qh_lo = /* bits 0-15 of qh, shifted to bit 4 */;
    __m512i q5_lo = _mm512_sub_epi32(_mm512_or_epi32(lo_nibbles, qh_lo), sixteen);

    __m512i y32_lo = _mm512_cvtepi8_epi32(y[ib].qs[0:16]);
    __m512i prod_lo = _mm512_mullo_epi32(q5_lo, y32_lo);

    // Second 16 weights (similar, using high nibbles + qh bits 16-31)

    int sumi = _mm512_reduce_add_epi32(_mm512_add_epi32(prod_lo, prod_hi));
    sumf += d * (float)sumi;
}

Key AVX-512 Intrinsics Used

Intrinsic	Operation
`_mm512_cvtepu8_epi32`	Expand 16 bytes → 16 int32s (zero extension)
`_mm512_cvtepi8_epi32`	Expand 16 int8 → 16 int32s (sign extension)
`_mm512_mullo_epi32`	Multiply 16 pairs of int32s (saturating)
`_mm512_reduce_add_epi32`	Horizontal sum of all 16 lanes
`_mm512_set_epi32`	Build vector from 16 scalar values

Quick Reference

Q8_0 Formula

weight[i] = d × qs[i]

Q5_0 Formula

weight[j] = d × ((nibble | (qh_bit << 4)) - 16)

vec_dot Computation

result = Σ (d_x × d_y × Σ (dequant_x[i] × y[i]))

Sum over all blocks, where each block contributes: scale × integer_dot_product