Quantization Bit Manipulation Visuals

Interactive guide to internalize the math behind Q5_0, Q4_K, and INT8 kernels

Goal: Make Bit Manipulation Visceral

These visuals explain why the code does qh >> (j + 12) instead of qh >> (j + 16). Step through each diagram to understand exactly how bits are packed and extracted in quantized formats.

100%

Quantization Overview

Where quantization happens in a transformer layer
Bits per weight: FP32 (32) → Q8_0 (8.5) → Q5_0 (5.5) → Q4_K (4.5)
Kernel selection: Which kernel to use for each weight type
INT8 vs FP32: Why quantized activations are 4x faster

Q5_0 Bit Layout

The Mystery Solved: qh >> (j + 12)

We want bit j+16 in position 4 of the result.
j + 12 + 4 = j + 16 — Same bit, fewer operations!

Memory layout: 22 bytes = 2 (scale) + 4 (high bits) + 16 (nibbles)
Weights 0-15: Low nibble + qh bit j
Weights 16-31: High nibble + qh bit (j+16)
Reconstruction: weight = d × ((nibble | high_bit) - 16)

Q4_K Superblock

Nested Scales: Why K-quants are better at same bits/weight

weight = d × sc[sub_block] × qs[i] - dmin × m[sub_block]

Super-block: 256 weights in 144 bytes (4.5 bits/weight)
Sub-blocks: 8 × 32 weights, each with 6-bit scale/min
scales[12]: Packs 16 values (8 scales + 8 mins) using 6-bit encoding
Why better: Local variation captured without wasting global bits

INT8 Dot Product

The Speed Secret: Scale factors multiply ONCE at the end

result = d_w × d_x × Σ(w_int8[j] × x_int8[j])

FP32 path: 32× dequant + 32× FP mul + 32× FP add (~100 cycles)
INT8 path: 32× INT mul + 32× INT add + 2× FP mul (~15 cycles)
SIMD magic: _mm256_maddubs_epi16 processes 32 pairs in ~3 cycles
Result: 4-5x speedup from integer math + SIMD parallelism

Keyboard shortcuts: 1-4 switch diagrams | +/- zoom | 0 reset | F fullscreen

Quantization Bit Manipulation Visuals

Goal: Make Bit Manipulation Visceral

Quantization Overview

Q5_0 Bit Layout

Q4_K Superblock

INT8 Dot Product

Related Quantization Documentation

Quantization Deep Dive

Format Reference

SIMD Architecture

Kernel Catalog