Quantization Deep Dive
Reduce model size by 4-8x while maintaining accuracy. Learn how block quantization formats work, why grouping matters, and how to integrate quantized weights with our bump allocator.
This guide covers the fundamentals of weight quantization: from the basic math behind INT4/INT8 representation to advanced grouping strategies used by modern block quantization formats.
Visual Format Reference
See byte-level structure, bit packing, and dequantization math for all supported formats: Q4_0, Q4_1, Q5_0, Q5_1, Q5_K, Q8_0, Q4_K, Q6_K, Q8_K. Activations run in BF16/FP32 with Q8_0/Q8_K quantized paths.
Table of Contents
- Why Quantize?
- Floating Point Primer
- Basic Quantization Math
- The Problem with Single Scales
- Grouping: The Solution
- Quantization Formats
- Memory Layout in Bump Allocator
- Kernel Dispatch by Type
- Cache Line Access Patterns
- Dequantization in Registers
- AMX-INT8: The Hardware Reality
- Q4_K: K-Quants Deep Dive
- How Blockwise Dequant Works
- Bump Allocator Integration
- Practical Implementation
Why Quantize?
Large language models are memory-bound, not compute-bound. A 7B parameter model in FP32 requires 28 GB of memory just for weights. Quantization dramatically reduces this.
FP32 (32-bit)
28 GB for 7B params
Full precision, no loss
INT8 (8-bit)
7 GB for 7B params
4x reduction, ~0.1% loss
INT4 (4-bit)
3.5 GB for 7B params
8x reduction, ~0.5% loss
Memory Bandwidth is the Bottleneck
Modern CPUs can do billions of FLOPs/second, but memory bandwidth is limited:
| Hardware | Compute | Memory BW | Arithmetic Intensity Needed |
|---|---|---|---|
| Intel Xeon Gold 6542Y | ~4 TFLOPS (FP32) | ~300 GB/s | 13 FLOPS/byte |
| NVIDIA A100 | 19.5 TFLOPS (FP32) | 2 TB/s | 10 FLOPS/byte |
Matrix-vector multiply (inference) has arithmetic intensity of ~2 FLOPS/byte. We're memory-bound, so smaller weights = faster inference.
Floating Point Primer
Before diving into quantization, let's understand what we're compressing.
FP32 Format (32 bits)
1 bit
8 bits
23 bits
Value = (-1)^S x 2^(E-127) x 1.Mantissa Example: 0.15625 Sign: 0 (positive) Exponent: 01111100 (124 - 127 = -3) Mantissa: 01000000000000000000000 (1.25) = 1 x 2^(-3) x 1.25 = 0.15625
Precision: ~7 significant decimal digits (relative error ~1.2e-7)
BF16 Format (16 bits)
1 bit
8 bits
7 bits
Same exponent range as FP32, but reduced precision (7 bits = ~2 decimal digits).
Relative error: ~7.8e-3 (~0.78%). This is why our BF16 kernel tests use 1e-2 tolerance.
Basic Quantization Math
Quantization maps continuous floating-point values to discrete integers.
Symmetric Quantization
# Quantize: FP32 -> INT
scale = max(abs(weights)) / (2^(bits-1) - 1)
quantized = round(weights / scale)
# Dequantize: INT -> FP32
dequantized = quantized * scale
Example (INT8, range [-127, 127]):
weights = [-0.5, 0.2, 1.0, -0.3]
scale = 1.0 / 127 = 0.00787
quantized = [-64, 25, 127, -38]
dequantized = [-0.504, 0.197, 1.0, -0.299] <- small error
Asymmetric Quantization
# Also stores zero-point for non-symmetric distributions
scale = (max - min) / (2^bits - 1)
zero_point = round(-min / scale)
quantized = round(weights / scale) + zero_point
# Useful when weights are not centered around zero
# (e.g., after ReLU activations)
With a single scale for the entire weight matrix, small values get quantized to zero. This is catastrophic for neural networks where every parameter matters.
The Problem with Single Scales
Consider a weight matrix with values ranging from -2.0 to +2.0, where some small but important values exist around 0.01:
Single Scale Failure
Weight matrix: [-2.0, 0.01, 1.5, -0.005, 0.8]
With single scale (INT4, range [-8, 7]):
scale = 2.0 / 7 = 0.286
-2.0 -> round(-2.0 / 0.286) = -7 -> dequant: -2.0 OK
0.01 -> round(0.01 / 0.286) = 0 -> dequant: 0.0 LOST!
1.5 -> round(1.5 / 0.286) = 5 -> dequant: 1.43 ~ok
-0.005-> round(-0.005/0.286) = 0 -> dequant: 0.0 LOST!
0.8 -> round(0.8 / 0.286) = 3 -> dequant: 0.86 ~ok
Small weights (0.01, -0.005) become exactly zero. These might be critical for model behavior!
Grouping: The Solution
Instead of one scale for the entire matrix, use per-group scales. Each group gets its own scale, tailored to its local range.
Group 1: Large Values
Values: [-2.0, 1.5, 0.8, -1.2]
Local max: 2.0
Scale: 2.0 / 7 = 0.286
Each value uses full INT4 range
relative to THIS group's scale.
Group 2: Small Values
Values: [0.01, -0.005, 0.008, 0.003]
Local max: 0.01
Scale: 0.01 / 7 = 0.00143
Now small values use full range!
0.01 -> 7, -0.005 -> -4, etc.
Why Group Size Matters
| Group Size | Scale Overhead | Precision | Use Case |
|---|---|---|---|
| 32 | 1 FP16 per 32 weights | Best | Q4_0, Q8_0 (common) |
| 64 | 1 FP16 per 64 weights | Good | Some custom formats |
| 128 | 1 FP16 per 128 weights | Medium | Aggressive compression |
| 256 | Nested scales | Varies | Q4_K, Q5_K "k-quants" |
Trade-off: Smaller groups = better precision but more scale storage overhead.
Quantization Formats
The block formats below are widely used for weight-only quantized inference and are common in GGUF-style model files. We implement them to make the runtime compatible with those on-disk layouts.
Q4_0: The Simplest 4-bit Format
Structure
typedef struct {
ck_half d; // 2 bytes: scale (delta)
uint8_t qs[16]; // 16 bytes: 32 x 4-bit weights (2 per byte)
} block_q4_0; // Total: 18 bytes per 32 weights
Bits per weight: 18 * 8 / 32 = 4.5 bits
Memory Layout
2B
Dequantization
// Extract 4-bit values (signed, range -8 to 7)
int8_t q0 = (qs[i] & 0xF) - 8; // Lower nibble
int8_t q1 = (qs[i] >> 4) - 8; // Upper nibble
// Dequantize
float w0 = q0 * scale;
float w1 = q1 * scale;
Q4_1: 4-bit with Min Value
Structure
typedef struct {
ck_half d; // 2 bytes: scale
ck_half m; // 2 bytes: minimum value
uint8_t qs[16]; // 16 bytes: 32 x 4-bit weights
} block_q4_1; // Total: 20 bytes per 32 weights
Bits per weight: 20 * 8 / 32 = 5.0 bits
Why Store Min?
Asymmetric quantization: dequant = q * scale + min
Better for weight distributions not centered at zero (e.g., biased layers).
Q8_0: 8-bit Integer
Structure
typedef struct {
ck_half d; // 2 bytes: scale
int8_t qs[32]; // 32 bytes: 32 x 8-bit signed weights
} block_q8_0; // Total: 34 bytes per 32 weights
Bits per weight: 34 * 8 / 32 = 8.5 bits
Trade-off
Higher precision (256 levels vs 16 for Q4), but 2x larger than Q4_0.
Often used for activations or when accuracy is critical.
Q8_K: 8-bit K-Quant Activation Format
Structure (Superblock of 256 values)
typedef struct {
float d; // 4 bytes: FP32 scale
int8_t qs[256]; // 256 bytes: 256 x 8-bit signed values
int16_t bsums[16]; // 32 bytes: sums of each 16-value chunk
} block_q8_K; // Total: 292 bytes per 256 values
Bits per value: 292 * 8 / 256 = 9.125 bits
Why This Exists
Q8_K is the activation-side companion for mixed K-quant kernels like gemv_q4_k_q8_k, gemv_q6_k_q8_k, and gemm_nt_q4_k_q8_k. In the current v7 runtime, the relevant path is often FP32 hidden -> quantize_row_q8_k -> Q4_K/Q6_K matvec.
Recent Bug Fix (2026-03-09)
We fixed a subtle parity bug in the SSE implementation of quantize_row_q8_k. The fix restored the exact signed-max selection and bsums contract expected by llama.cpp/ref. See commit 224a4d30.
This kind of bug is easy to miss because outputs can remain plausible while parity drift grows at mixed-quant boundaries. It often only shows up under stricter per-op or model-family bring-up checks.
Q4_K: K-Quant with Nested Scales
Structure (Superblock of 256 weights)
typedef struct {
ck_half d; // 2 bytes: superblock scale
ck_half dmin; // 2 bytes: superblock min
uint8_t scales[12]; // 12 bytes: 8 sub-block scales (6-bit each)
uint8_t qs[128]; // 128 bytes: 256 x 4-bit weights
} block_q4_K; // Total: 144 bytes per 256 weights
Bits per weight: 144 * 8 / 256 = 4.5 bits
Nested Scale Hierarchy
Superblock (256 weights)
|-- d (FP16): overall scale
|-- dmin (FP16): overall minimum
+-- 8 Sub-blocks (32 weights each)
|-- scales[i] (6-bit): local scale adjustment
+-- qs[...]: 4-bit quantized weights
Dequant: weight = q * (d * sub_scale) - dmin * sub_min
Best of both worlds: Fine-grained local scales without per-32-weight FP16 overhead.
Format Comparison
| Format | Bits/Weight | Block Size | Quality | Speed |
|---|---|---|---|---|
| FP32 | 32 | N/A | Perfect | Slow (memory-bound) |
| FP16 | 16 | N/A | Excellent | 2x faster |
| Q8_0 | 8.5 | 32 | Very Good | ~3.5x faster |
| Q4_0 | 4.5 | 32 | Good | ~6x faster |
| Q4_K | 4.5 | 256 | Very Good | ~6x faster |
Memory Layout in Bump Allocator
Our bump allocator manages quantized weights, activations, and scratch space. Key principle: never mix types within a region.
Region Separation
// Bump allocator layout for quantized inference
typedef struct {
// Region 1: Quantized weights (read-only after load)
uint8_t *weights_q4; // Q4_0 or Q4_K blocks
size_t weights_size;
// Region 2: Activations (FP32, read-write)
float *activations; // Input/output tensors
size_t act_size;
// Region 3: Scratch (FP32, temporary)
float *scratch; // Intermediate buffers
size_t scratch_size;
// Region 4: Dequantized cache (optional)
float *dequant_cache; // Hot weights kept dequantized
size_t cache_size;
} BumpAllocator;
Allocation Strategy
// Allocate quantized weight buffer
size_t num_blocks = (num_weights + 31) / 32; // Round up
size_t q4_0_size = num_blocks * sizeof(block_q4_0); // 18 bytes/block
void* weights = bump_alloc(allocator, q4_0_size, REGION_WEIGHTS);
// Allocate activation buffer (always FP32)
size_t act_size = batch * seq_len * hidden_dim * sizeof(float);
float* activations = bump_alloc(allocator, act_size, REGION_ACTIVATIONS);
1. Cache efficiency: Homogeneous data types within cache lines
2. Alignment: Each region can have optimal alignment for its type
3. Prefetching: Sequential access patterns per region
4. Safety: Type confusion bugs are impossible
Kernel Dispatch by Type
Type checking happens at tensor level, never in the hot path.
Type-Tagged Tensors
typedef enum {
DTYPE_FP32,
DTYPE_FP16,
DTYPE_BF16,
DTYPE_Q4_0,
DTYPE_Q4_1,
DTYPE_Q5_0,
DTYPE_Q5_1,
DTYPE_Q4_K,
DTYPE_Q6_K,
DTYPE_Q8_0,
DTYPE_Q8_K
} DType;
typedef struct {
void *data;
size_t shape[4];
DType dtype;
// ... other metadata
} Tensor;
Dispatch at Operation Level
const char *select_linear_kernel(DType weight, DType activation, bool prefill) {
if (!prefill && weight == DTYPE_Q4_K && activation == DTYPE_Q8_K) return "gemv_q4_k_q8_k";
if (!prefill && weight == DTYPE_Q6_K && activation == DTYPE_Q8_K) return "gemv_q6_k_q8_k";
if (prefill && weight == DTYPE_Q4_K && activation == DTYPE_Q8_K) return "gemm_nt_q4_k_q8_k";
if (prefill && weight == DTYPE_Q4_K && activation == DTYPE_FP32) return "gemm_nt_q4_k";
if (!prefill && weight == DTYPE_Q8_0 && activation == DTYPE_FP32) return "gemv_q8_0";
return "fallback_or_error";
}
void lower_mixed_quant_linear(...) {
quantize_row_q8_k(hidden_fp32, hidden_q8k, K);
gemv_q4_k_q8_k(out_fp32, weight_q4k, hidden_q8k, K);
}
Why This Bug Was Hard To Notice
The March 9 fix was not a flashy crash bug. The runtime could still generate text, and some top-1 logits remained unchanged. But the wrong signed-max choice or incorrect bsums bookkeeping perturbs internal mixed-quant math just enough to accumulate drift later.
That is why quantization bugs like this often survive casual manual testing and only show up in strict parity tooling or in model families that stress a specific kernel path.
Cache Line Access Patterns
Understanding cache behavior is critical for quantized kernel performance.
Cache Line Basics
Cache line size: 64 bytes (typical x86)
Q4_0 block: 18 bytes -> ~3.5 blocks per cache line
Q4_K block: 144 bytes -> 2.25 cache lines per block
Key insight: Scale and weights should be in SAME cache line
when possible for minimal memory traffic.
Good: Q4_0 Access
Cache line 1 (64B):
|-- Block 0: scale (2B) + qs (16B)
|-- Block 1: scale (2B) + qs (16B)
|-- Block 2: scale (2B) + qs (16B)
+-- Block 3: scale (2B) + 10B qs
One fetch -> 3.5 blocks -> 112 weights!
Bad: Scattered Scales
If scales were separate from weights:
Cache line 1: scales[0..31]
Cache line 2: weights[0..127]
Cache line 3: weights[128..255]
3 fetches for same work!
SIMD-Aligned Processing
// Process 32 weights at a time (one Q4_0 block)
// Perfectly aligned for AVX-512: 16 FP32 results at a time
__m512 dequant_q4_0_avx512(const block_q4_0* block) {
// Load scale (broadcast to all lanes)
__m512 scale = _mm512_set1_ps(CK_FP16_TO_FP32(block->d));
// Load 16 bytes of quantized weights
__m128i qs = _mm_loadu_si128((__m128i*)block->qs);
// Unpack 4-bit to 8-bit
__m256i unpacked = unpack_4bit_to_8bit(qs);
// Convert to FP32 and scale
__m512 weights = _mm512_cvtepi32_ps(_mm512_cvtepi8_epi32(unpacked));
return _mm512_mul_ps(weights, scale);
}
Dequantization in Registers
Critical optimization: dequantize into CPU registers, never write back to memory.
The Pattern
// WRONG: Dequantize to memory, then use
float* temp = malloc(hidden_dim * sizeof(float));
dequantize_q4_0(weights_q4, temp, hidden_dim); // Write to RAM
matmul(out, input, temp, ...); // Read from RAM
free(temp);
// RIGHT: Dequantize in registers during computation
for (int block = 0; block < num_blocks; block++) {
// Dequantize 32 weights -> 32 floats in YMM/ZMM registers
__m512 w = dequant_q4_0_avx512(&weights_q4[block]);
// Immediately use in FMA (still in registers!)
acc = _mm512_fmadd_ps(input_vec, w, acc);
// w never touches RAM after dequantization
}
Register bandwidth: ~1 TB/s
L1 cache bandwidth: ~200 GB/s
RAM bandwidth: ~50 GB/s
By keeping dequantized values in registers, we get 20x better bandwidth than RAM.
Complete Quantized GEMV Example
// Quantized matrix-vector multiply: y = A_q4 @ x
void gemv_q4_0(float* y, const block_q4_0* A, const float* x,
int M, int K) {
const int blocks_per_row = K / 32;
for (int row = 0; row < M; row++) {
__m512 acc = _mm512_setzero_ps();
for (int b = 0; b < blocks_per_row; b++) {
// Load quantized block (18 bytes from RAM)
const block_q4_0* block = &A[row * blocks_per_row + b];
// Dequantize in registers (no memory write!)
__m512 w_lo = dequant_q4_0_lower(block); // weights 0-15
__m512 w_hi = dequant_q4_0_upper(block); // weights 16-31
// Load corresponding input (32 floats)
__m512 x_lo = _mm512_loadu_ps(&x[b * 32]);
__m512 x_hi = _mm512_loadu_ps(&x[b * 32 + 16]);
// FMA: accumulate dot product
acc = _mm512_fmadd_ps(w_lo, x_lo, acc);
acc = _mm512_fmadd_ps(w_hi, x_hi, acc);
}
// Horizontal sum and store single result
y[row] = _mm512_reduce_add_ps(acc);
}
}
AMX-INT8: The Hardware Reality
Intel's AMX (Advanced Matrix Extensions) offers significant speedups, but understanding its constraints is critical for correct implementation.
AMX-INT8 Constraint
_tile_dpbusd: UINT8 x INT8 -> INT32
----- ----
tile A tile B
Both tiles must be INT8. No exceptions.
There's no AMX instruction for:
BF16 x INT8 -> ??? // Doesn't exist
FP32 x INT8 -> ??? // Doesn't exist
Option 1: Full INT8 Pipeline
Quantize both weights and activations to INT8:
FP32 input
| quantize (AVX-512)
INT8 activations
|
AMX-INT8 (tile_dpbusd)
|
INT32 accumulator
| dequantize (AVX-512)
FP32 output
Requires:
- Calibration data for activation ranges
- Per-layer scales for activations
- ~1-2% accuracy loss (often acceptable for vision, not for LLMs)
Option 2: Weight-Only INT8
Keep activations in BF16/FP32, cannot use AMX-INT8:
BF16 activations
|
+---> AMX-BF16 <--- BF16 weights
^
INT8 weights -> Dequant (AVX-512)
Advantages:
- No calibration needed
- Preserves activation precision
- This is what most LLM deployments use
Most LLM deployments use INT4/INT8 weights with BF16/FP16 activations. AMX-INT8 is rarely used because quantizing activations hurts quality too much.
The Dequantization Cost
// Dequant INT8 -> BF16 using AVX-512
// This is "wasted" work before AMX compute
__m512i w_int8 = _mm512_loadu_si512(weights); // 64 INT8 values
__m512 scale = _mm512_set1_ps(group_scale);
// Unpack to 32-bit, convert to float, scale, convert to BF16
__m512i w_lo = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(w_int8, 0));
__m512 w_f32 = _mm512_mul_ps(_mm512_cvtepi32_ps(w_lo), scale);
__m256i w_bf16 = _mm512_cvtneps_pbh(w_f32); // FP32 -> BF16
// Now feed to AMX-BF16...
This dequant step eats cycles. The question is: does memory bandwidth savings outweigh dequant overhead?
The Math: When INT8 Wins
Memory bandwidth saved:
BF16 weights: 2 bytes/weight
INT8 weights: 1 byte/weight
Savings: ~50% bandwidth
Dequant cost:
~4-8 AVX-512 instructions per 64 weights
Maybe 10-20 cycles per 64 weights
Break-Even Analysis
Memory-bound (batch=1, token gen):
-> INT8 + dequant WINS
-> Bandwidth is bottleneck
Compute-bound (batch=64, prefill):
-> BF16 direct WINS
-> Dequant is pure overhead
Q4_K: K-Quants Deep Dive
The "K" in Q4_K refers to a family of block quantizers that use nested scales (a super-block scale plus per-sub-block adjustments). This format is popular because it achieves strong quality at ~4.5 bits/weight while keeping metadata compact.
Q4_K Structure Definition
// Q4_K block: 256 weights per block
typedef struct {
ck_half d; // 2 bytes: super-block scale (FP16)
ck_half dmin; // 2 bytes: super-block minimum (FP16)
uint8_t scales[12]; // 12 bytes: sub-block scales (6-bit each, packed)
uint8_t qs[128]; // 128 bytes: 256 INT4 weights (4 bits x 256)
} block_q4_K; // Total: 144 bytes for 256 weights = 4.5 bpw
Two-Level Scaling Hierarchy
Super-block (256 weights)
|-- d (FP16): overall scale for the whole block
|-- dmin (FP16): overall minimum
+-- 8 Sub-blocks (32 weights each)
|-- scales[i] (6-bit): local scale adjustment
+-- qs[...]: 4-bit quantized weights
Dequantization:
w_fp32 = q * (d * sub_scale) - dmin * sub_min
Why nested scales? Best of both worlds: fine-grained local adaptation without per-32-weight FP16 overhead.
K-Quant Variants
| Format | Description | Quality | Bits/Weight |
|---|---|---|---|
Q4_K_S |
Small - more aggressive quantization | Good | 4.5 |
Q4_K_M |
Medium - balanced (most common) | Very Good | 4.5 |
Q5_K_M |
5-bit with k-quant structure | Excellent | 5.5 |
Q6_K |
6-bit for near-FP16 quality | Near-perfect | 6.5 |
How Blockwise Dequant Works
For weight-only quantization, the key performance trick is to avoid a full dequantization pass. Instead, dequantize a small block in registers and immediately consume it in the dot-product/FMA loop.
Activations: Always FP32
Weight-only quantized inference (typical CPU path):
Activations: FP32
Weights: Q4_K, Q4_0, Q8_0, etc.
Compute: Dequant (registers) -> FMA accumulate
Fused Dequant + Compute
Critical insight: do not dequantize the whole tensor, then matmul. Dequantize-and-compute per block:
for each block (256 weights):
load (d, dmin, sub-scales)
unpack 4-bit nibbles
convert to fp32 and apply scales
fused multiply-add with fp32 activations
accumulate into fp32
Q4_K Path
Pure AVX-512 (dequant + FMA fused)
Load Q4_K block
|
Unpack nibbles (AVX-512)
|
Apply scales (AVX-512)
|
FMA with activations
|
Accumulate
No AMX used - all in one fused kernel.
BF16 Path
Direct AMX-BF16 (no dequant needed)
Load BF16 weights
|
Load BF16 activations
|
AMX tile matmul
|
FP32 accumulator
AMX only used when both operands are BF16.
The two-step AVX->AMX handoff you might imagine would actually be slower due to tile load/store overhead. Better to stay in one domain. For quantized formats, pure AVX-512 with fused dequant is the fastest path.
Bump Allocator Integration
Our bump allocator manages quantized weights, activations, and scratch space. The key principle: never mix types within a region.
Tensor Structure for Quantized Types
typedef enum {
CK_DT_FP32,
CK_DT_FP16,
CK_DT_BF16,
CK_DT_Q4_0, // Simple: 32 weights, 1 scale
CK_DT_Q4_K, // K-quant: 256 weights, hierarchical scales
CK_DT_Q8_0,
} CKDataType;
typedef struct {
CKDataType dtype;
uint32_t ne[4]; // Dimensions
size_t nb[4]; // Strides in bytes
size_t block_size; // Weights per block (32 for Q4_0, 256 for Q4_K)
size_t type_size; // Bytes per block
void* data; // Points into bump allocator
} tensor;
Size Calculation
size_t tensor_size_bytes(CKDataType dt, int64_t nelements) {
// See: include/ckernel_dtype.h (ck_dtype_row_bytes)
// Quantized types use "bytes per block" x "blocks per row".
...
}
// Example: 4096 x 4096 weight matrix
// Total weights: 16M
// Q4_K: 16M / 256 x 144 = 9.4 MB (vs 64 MB for FP32!)
Region Separation
Region 1: Tensor Headers
+-- Metadata only (pointers to data)
Region 2: Quantized Weights (Q4_K)
+-- Read-only after load
+-- Can be mmap'd
Region 3: Dequant Cache (BF16)
+-- Optional, for hot layers
Region 4: Activations (FP32)
+-- Always full precision
+-- Double-buffered
Region 5: Scratch (FP32)
+-- Temporary buffers
+-- Reused per layer
Why Separate Regions?
- Cache efficiency: Homogeneous types in cache lines
- Alignment: Each region optimally aligned for its type
- Prefetching: Sequential access patterns per region
- Safety: Type confusion bugs are impossible
- Sharing: Read-only weights can be mmap'd, shared across processes
Memory Savings: 7B Parameter Model
| Format | Weight Size | Savings |
|---|---|---|
| FP32 | 28 GB | Baseline |
| BF16 | 14 GB | 2x |
| Q8_0 | ~7.5 GB | 3.7x |
| Q4_K | ~4 GB | 7x |
+ Activations: ~500 MB per batch (always FP32)
Practical Implementation
Realistic LLM Inference on Xeon
void matmul_q4(float* out, const float* x, const tensor* W) {
if (batch_size == 1 && seq_len <= 32) {
// Memory-bound: token generation
// INT4 weights + dequant + AVX-512 FP32
// (AMX setup overhead not worth it for small tiles)
matmul_q4_avx512(out, x, W);
}
else {
// Compute-bound: prefill or batched
// Dequant to BF16 + AMX-BF16
dequant_q4_to_bf16(W_bf16, W_q4); // AVX-512
amx_matmul_bf16(out, x, W_bf16); // AMX
}
}
Token Generation (batch=1)
- Memory-bound: loading 4GB of weights
- Pure AVX-512 with fused dequant
- AMX setup overhead not worth it
- Latency: ~50ms per token (7B model)
Prefill (batch=64)
- Compute-bound: weight reuse across batch
- Dequant weights to BF16 once
- AMX-BF16 for the big matmul
- Throughput: ~1000 tokens/second
Don't try to mix AVX-512 and AMX per operation. Two clean paths:
1. Q4 path: Fused dequant+compute in AVX-512
2. BF16 path: Direct AMX-BF16
Get the simple paths working first, profile, then optimize.
Summary: Key Takeaways
Grouping is Essential
- Single scale loses small weights
- Per-group scales preserve precision
- 32 elements per group is the sweet spot
- K-quants use nested scales for efficiency
Memory Layout Matters
- Keep scale adjacent to weights
- Separate regions by data type
- Align to cache lines
- Dequantize in registers, not RAM
Format Quick Reference
| Format | Bytes/32 Weights | Structure | Best For |
|---|---|---|---|
| Q4_0 | 18 | scale + 32x4bit | General use, fastest |
| Q4_1 | 20 | scale + min + 32x4bit | Asymmetric distributions |
| Q8_0 | 34 | scale + 32x8bit | High quality requirements |
| Q6_K | 210/256 | K-quant with int8 sub-scales | Sensitive MLP / output weights |
| Q8_K | 292/256 | FP32 scale + 256x8bit + bsums | Mixed K-quant activation path |
| Q4_K | 144/256 | Nested superblock | Best quality/size ratio |
Further Reading
- Memory Safety - Bump allocator details
- Kernel Reference - SIMD implementations
- GEMM Optimization - Microkernel, packing, 1.44x faster than MKL
- Deep Dive: LLM Concepts - Transformer architecture