Quantization Deep Dive

Reduce model size by 4-8x while maintaining accuracy. Learn how block quantization formats work, why grouping matters, and how to integrate quantized weights with our bump allocator.

What You'll Learn
This guide covers the fundamentals of weight quantization: from the basic math behind INT4/INT8 representation to advanced grouping strategies used by modern block quantization formats.

Visual Format Reference

See byte-level structure, bit packing, and dequantization math for all supported formats: Q4_0, Q4_1, Q5_0, Q5_1, Q5_K, Q8_0, Q4_K, Q6_K, Q8_K. Activations run in BF16/FP32 with Q8_0/Q8_K quantized paths.

View Infographic

Why Quantize?
Floating Point Primer
Basic Quantization Math
The Problem with Single Scales
Grouping: The Solution
Quantization Formats
Memory Layout in Bump Allocator
Kernel Dispatch by Type
Cache Line Access Patterns
Dequantization in Registers
AMX-INT8: The Hardware Reality
Q4_K: K-Quants Deep Dive
How Blockwise Dequant Works
Bump Allocator Integration
Practical Implementation

Why Quantize?

Large language models are memory-bound, not compute-bound. A 7B parameter model in FP32 requires 28 GB of memory just for weights. Quantization dramatically reduces this.

FP32 (32-bit)

28 GB for 7B params

Full precision, no loss

INT8 (8-bit)

7 GB for 7B params

4x reduction, ~0.1% loss

INT4 (4-bit)

3.5 GB for 7B params

8x reduction, ~0.5% loss

Memory Bandwidth is the Bottleneck

Modern CPUs can do billions of FLOPs/second, but memory bandwidth is limited:

Hardware	Compute	Memory BW	Arithmetic Intensity Needed
Intel Xeon Gold 6542Y	~4 TFLOPS (FP32)	~300 GB/s	13 FLOPS/byte
NVIDIA A100	19.5 TFLOPS (FP32)	2 TB/s	10 FLOPS/byte

Matrix-vector multiply (inference) has arithmetic intensity of ~2 FLOPS/byte. We're memory-bound, so smaller weights = faster inference.

Floating Point Primer

Before diving into quantization, let's understand what we're compressing.

FP32 Format (32 bits)

S
1 bit

Exponent
8 bits

Mantissa (Fraction)
23 bits

Value = (-1)^S x 2^(E-127) x 1.Mantissa

Example: 0.15625
  Sign: 0 (positive)
  Exponent: 01111100 (124 - 127 = -3)
  Mantissa: 01000000000000000000000 (1.25)
  = 1 x 2^(-3) x 1.25 = 0.15625

Precision: ~7 significant decimal digits (relative error ~1.2e-7)

BF16 Format (16 bits)

S
1 bit

Exponent
8 bits

Mantissa
7 bits

Same exponent range as FP32, but reduced precision (7 bits = ~2 decimal digits).

Relative error: ~7.8e-3 (~0.78%). This is why our BF16 kernel tests use 1e-2 tolerance.

Basic Quantization Math

Quantization maps continuous floating-point values to discrete integers.

Symmetric Quantization

# Quantize: FP32 -> INT
scale = max(abs(weights)) / (2^(bits-1) - 1)
quantized = round(weights / scale)

# Dequantize: INT -> FP32
dequantized = quantized * scale

Example (INT8, range [-127, 127]):
  weights = [-0.5, 0.2, 1.0, -0.3]
  scale = 1.0 / 127 = 0.00787
  quantized = [-64, 25, 127, -38]
  dequantized = [-0.504, 0.197, 1.0, -0.299]  <- small error

Asymmetric Quantization

# Also stores zero-point for non-symmetric distributions
scale = (max - min) / (2^bits - 1)
zero_point = round(-min / scale)
quantized = round(weights / scale) + zero_point

# Useful when weights are not centered around zero
# (e.g., after ReLU activations)

The Catch
With a single scale for the entire weight matrix, small values get quantized to zero. This is catastrophic for neural networks where every parameter matters.

The Problem with Single Scales

Consider a weight matrix with values ranging from -2.0 to +2.0, where some small but important values exist around 0.01:

Quantization Infographic showing FP32 to INT4 conversion

Single Scale Failure

Weight matrix: [-2.0, 0.01, 1.5, -0.005, 0.8]

With single scale (INT4, range [-8, 7]):
  scale = 2.0 / 7 = 0.286

  -2.0  -> round(-2.0 / 0.286) = -7  -> dequant: -2.0   OK
   0.01 -> round(0.01 / 0.286) =  0  -> dequant:  0.0   LOST!
   1.5  -> round(1.5 / 0.286)  =  5  -> dequant:  1.43  ~ok
  -0.005-> round(-0.005/0.286) =  0  -> dequant:  0.0   LOST!
   0.8  -> round(0.8 / 0.286)  =  3  -> dequant:  0.86  ~ok

Small weights (0.01, -0.005) become exactly zero. These might be critical for model behavior!

Grouping: The Solution

Instead of one scale for the entire matrix, use per-group scales. Each group gets its own scale, tailored to its local range.

Detailed diagram of quantization grouping and memory layout

Group 1: Large Values

Values: [-2.0, 1.5, 0.8, -1.2]
Local max: 2.0
Scale: 2.0 / 7 = 0.286

Each value uses full INT4 range
relative to THIS group's scale.

Group 2: Small Values

Values: [0.01, -0.005, 0.008, 0.003]
Local max: 0.01
Scale: 0.01 / 7 = 0.00143

Now small values use full range!
0.01 -> 7, -0.005 -> -4, etc.

Why Group Size Matters

Group Size	Scale Overhead	Precision	Use Case
32	1 FP16 per 32 weights	Best	Q4_0, Q8_0 (common)
64	1 FP16 per 64 weights	Good	Some custom formats
128	1 FP16 per 128 weights	Medium	Aggressive compression
256	Nested scales	Varies	Q4_K, Q5_K "k-quants"

Trade-off: Smaller groups = better precision but more scale storage overhead.

Quantization Formats

The block formats below are widely used for weight-only quantized inference and are common in GGUF-style model files. We implement them to make the runtime compatible with those on-disk layouts.

Q4_0: The Simplest 4-bit Format

Structure

typedef struct {
    ck_half d;           // 2 bytes: scale (delta)
    uint8_t qs[16];      // 16 bytes: 32 x 4-bit weights (2 per byte)
} block_q4_0;            // Total: 18 bytes per 32 weights

Bits per weight: 18 * 8 / 32 = 4.5 bits

Memory Layout

scale
2B

w0,w1

w2,w3

w4,w5

...

w30,w31

Dequantization

// Extract 4-bit values (signed, range -8 to 7)
int8_t q0 = (qs[i] & 0xF) - 8;   // Lower nibble
int8_t q1 = (qs[i] >> 4) - 8;    // Upper nibble

// Dequantize
float w0 = q0 * scale;
float w1 = q1 * scale;

Q4_1: 4-bit with Min Value

Structure

typedef struct {
    ck_half d;           // 2 bytes: scale
    ck_half m;           // 2 bytes: minimum value
    uint8_t qs[16];      // 16 bytes: 32 x 4-bit weights
} block_q4_1;            // Total: 20 bytes per 32 weights

Bits per weight: 20 * 8 / 32 = 5.0 bits

Why Store Min?

Asymmetric quantization: dequant = q * scale + min

Better for weight distributions not centered at zero (e.g., biased layers).

Q8_0: 8-bit Integer

Structure

typedef struct {
    ck_half d;           // 2 bytes: scale
    int8_t qs[32];       // 32 bytes: 32 x 8-bit signed weights
} block_q8_0;            // Total: 34 bytes per 32 weights

Bits per weight: 34 * 8 / 32 = 8.5 bits

Trade-off

Higher precision (256 levels vs 16 for Q4), but 2x larger than Q4_0.

Often used for activations or when accuracy is critical.

Q8_K: 8-bit K-Quant Activation Format

Structure (Superblock of 256 values)

typedef struct {
    float d;                  // 4 bytes: FP32 scale
    int8_t qs[256];           // 256 bytes: 256 x 8-bit signed values
    int16_t bsums[16];        // 32 bytes: sums of each 16-value chunk
} block_q8_K;                 // Total: 292 bytes per 256 values

Bits per value: 292 * 8 / 256 = 9.125 bits

Why This Exists

Q8_K is the activation-side companion for mixed K-quant kernels like gemv_q4_k_q8_k, gemv_q6_k_q8_k, and gemm_nt_q4_k_q8_k. In the current v7 runtime, the relevant path is often FP32 hidden -> quantize_row_q8_k -> Q4_K/Q6_K matvec.

Recent Bug Fix (2026-03-09)

We fixed a subtle parity bug in the SSE implementation of quantize_row_q8_k. The fix restored the exact signed-max selection and bsums contract expected by llama.cpp/ref. See commit 224a4d30.

This kind of bug is easy to miss because outputs can remain plausible while parity drift grows at mixed-quant boundaries. It often only shows up under stricter per-op or model-family bring-up checks.

Q4_K: K-Quant with Nested Scales

Structure (Superblock of 256 weights)

typedef struct {
    ck_half d;                  // 2 bytes: superblock scale
    ck_half dmin;               // 2 bytes: superblock min
    uint8_t scales[12];         // 12 bytes: 8 sub-block scales (6-bit each)
    uint8_t qs[128];            // 128 bytes: 256 x 4-bit weights
} block_q4_K;                   // Total: 144 bytes per 256 weights

Bits per weight: 144 * 8 / 256 = 4.5 bits

Nested Scale Hierarchy

Superblock (256 weights)
|-- d (FP16): overall scale
|-- dmin (FP16): overall minimum
+-- 8 Sub-blocks (32 weights each)
    |-- scales[i] (6-bit): local scale adjustment
    +-- qs[...]: 4-bit quantized weights

Dequant: weight = q * (d * sub_scale) - dmin * sub_min

Best of both worlds: Fine-grained local scales without per-32-weight FP16 overhead.

Format Comparison

Format	Bits/Weight	Block Size	Quality	Speed
FP32	32	N/A	Perfect	Slow (memory-bound)
FP16	16	N/A	Excellent	2x faster
Q8_0	8.5	32	Very Good	~3.5x faster
Q4_0	4.5	32	Good	~6x faster
Q4_K	4.5	256	Very Good	~6x faster

Memory Layout in Bump Allocator

Our bump allocator manages quantized weights, activations, and scratch space. Key principle: never mix types within a region.

Region Separation

// Bump allocator layout for quantized inference
typedef struct {
    // Region 1: Quantized weights (read-only after load)
    uint8_t *weights_q4;      // Q4_0 or Q4_K blocks
    size_t weights_size;

    // Region 2: Activations (FP32, read-write)
    float *activations;       // Input/output tensors
    size_t act_size;

    // Region 3: Scratch (FP32, temporary)
    float *scratch;           // Intermediate buffers
    size_t scratch_size;

    // Region 4: Dequantized cache (optional)
    float *dequant_cache;     // Hot weights kept dequantized
    size_t cache_size;
} BumpAllocator;

Allocation Strategy

// Allocate quantized weight buffer
size_t num_blocks = (num_weights + 31) / 32;  // Round up
size_t q4_0_size = num_blocks * sizeof(block_q4_0);  // 18 bytes/block

void* weights = bump_alloc(allocator, q4_0_size, REGION_WEIGHTS);

// Allocate activation buffer (always FP32)
size_t act_size = batch * seq_len * hidden_dim * sizeof(float);
float* activations = bump_alloc(allocator, act_size, REGION_ACTIVATIONS);

Why Separate Regions?
1. Cache efficiency: Homogeneous data types within cache lines
2. Alignment: Each region can have optimal alignment for its type
3. Prefetching: Sequential access patterns per region
4. Safety: Type confusion bugs are impossible

Kernel Dispatch by Type

Type checking happens at tensor level, never in the hot path.

Type-Tagged Tensors

typedef enum {
    DTYPE_FP32,
    DTYPE_FP16,
    DTYPE_BF16,
    DTYPE_Q4_0,
    DTYPE_Q4_1,
    DTYPE_Q5_0,
    DTYPE_Q5_1,
    DTYPE_Q4_K,
    DTYPE_Q6_K,
    DTYPE_Q8_0,
    DTYPE_Q8_K
} DType;

typedef struct {
    void *data;
    size_t shape[4];
    DType dtype;
    // ... other metadata
} Tensor;

Dispatch at Operation Level

const char *select_linear_kernel(DType weight, DType activation, bool prefill) {
    if (!prefill && weight == DTYPE_Q4_K && activation == DTYPE_Q8_K) return "gemv_q4_k_q8_k";
    if (!prefill && weight == DTYPE_Q6_K && activation == DTYPE_Q8_K) return "gemv_q6_k_q8_k";
    if (prefill && weight == DTYPE_Q4_K && activation == DTYPE_Q8_K) return "gemm_nt_q4_k_q8_k";
    if (prefill && weight == DTYPE_Q4_K && activation == DTYPE_FP32) return "gemm_nt_q4_k";
    if (!prefill && weight == DTYPE_Q8_0 && activation == DTYPE_FP32) return "gemv_q8_0";
    return "fallback_or_error";
}

void lower_mixed_quant_linear(...) {
    quantize_row_q8_k(hidden_fp32, hidden_q8k, K);
    gemv_q4_k_q8_k(out_fp32, weight_q4k, hidden_q8k, K);
}

Why This Bug Was Hard To Notice

The March 9 fix was not a flashy crash bug. The runtime could still generate text, and some top-1 logits remained unchanged. But the wrong signed-max choice or incorrect bsums bookkeeping perturbs internal mixed-quant math just enough to accumulate drift later.

That is why quantization bugs like this often survive casual manual testing and only show up in strict parity tooling or in model families that stress a specific kernel path.

Cache Line Access Patterns

Understanding cache behavior is critical for quantized kernel performance.

Cache Line Basics

Cache line size: 64 bytes (typical x86)

Q4_0 block: 18 bytes -> ~3.5 blocks per cache line
Q4_K block: 144 bytes -> 2.25 cache lines per block

Key insight: Scale and weights should be in SAME cache line
when possible for minimal memory traffic.

Good: Q4_0 Access

Cache line 1 (64B):
|-- Block 0: scale (2B) + qs (16B)
|-- Block 1: scale (2B) + qs (16B)
|-- Block 2: scale (2B) + qs (16B)
+-- Block 3: scale (2B) + 10B qs

One fetch -> 3.5 blocks -> 112 weights!

Bad: Scattered Scales

If scales were separate from weights:

Cache line 1: scales[0..31]
Cache line 2: weights[0..127]
Cache line 3: weights[128..255]

3 fetches for same work!

SIMD-Aligned Processing

// Process 32 weights at a time (one Q4_0 block)
// Perfectly aligned for AVX-512: 16 FP32 results at a time

__m512 dequant_q4_0_avx512(const block_q4_0* block) {
    // Load scale (broadcast to all lanes)
    __m512 scale = _mm512_set1_ps(CK_FP16_TO_FP32(block->d));

    // Load 16 bytes of quantized weights
    __m128i qs = _mm_loadu_si128((__m128i*)block->qs);

    // Unpack 4-bit to 8-bit
    __m256i unpacked = unpack_4bit_to_8bit(qs);

    // Convert to FP32 and scale
    __m512 weights = _mm512_cvtepi32_ps(_mm512_cvtepi8_epi32(unpacked));
    return _mm512_mul_ps(weights, scale);
}

Dequantization in Registers

Critical optimization: dequantize into CPU registers, never write back to memory.

The Pattern

// WRONG: Dequantize to memory, then use
float* temp = malloc(hidden_dim * sizeof(float));
dequantize_q4_0(weights_q4, temp, hidden_dim);  // Write to RAM
matmul(out, input, temp, ...);                   // Read from RAM
free(temp);

// RIGHT: Dequantize in registers during computation
for (int block = 0; block < num_blocks; block++) {
    // Dequantize 32 weights -> 32 floats in YMM/ZMM registers
    __m512 w = dequant_q4_0_avx512(&weights_q4[block]);

    // Immediately use in FMA (still in registers!)
    acc = _mm512_fmadd_ps(input_vec, w, acc);

    // w never touches RAM after dequantization
}

Why This Matters
Register bandwidth: ~1 TB/s
L1 cache bandwidth: ~200 GB/s
RAM bandwidth: ~50 GB/s

By keeping dequantized values in registers, we get 20x better bandwidth than RAM.

Complete Quantized GEMV Example

// Quantized matrix-vector multiply: y = A_q4 @ x
void gemv_q4_0(float* y, const block_q4_0* A, const float* x,
               int M, int K) {
    const int blocks_per_row = K / 32;

    for (int row = 0; row < M; row++) {
        __m512 acc = _mm512_setzero_ps();

        for (int b = 0; b < blocks_per_row; b++) {
            // Load quantized block (18 bytes from RAM)
            const block_q4_0* block = &A[row * blocks_per_row + b];

            // Dequantize in registers (no memory write!)
            __m512 w_lo = dequant_q4_0_lower(block);  // weights 0-15
            __m512 w_hi = dequant_q4_0_upper(block);  // weights 16-31

            // Load corresponding input (32 floats)
            __m512 x_lo = _mm512_loadu_ps(&x[b * 32]);
            __m512 x_hi = _mm512_loadu_ps(&x[b * 32 + 16]);

            // FMA: accumulate dot product
            acc = _mm512_fmadd_ps(w_lo, x_lo, acc);
            acc = _mm512_fmadd_ps(w_hi, x_hi, acc);
        }

        // Horizontal sum and store single result
        y[row] = _mm512_reduce_add_ps(acc);
    }
}

AMX-INT8: The Hardware Reality

Intel's AMX (Advanced Matrix Extensions) offers significant speedups, but understanding its constraints is critical for correct implementation.

AMX-INT8 Constraint

_tile_dpbusd:  UINT8 x INT8 -> INT32
               -----   ----
               tile A  tile B

Both tiles must be INT8. No exceptions.

There's no AMX instruction for:
  BF16 x INT8 -> ???    // Doesn't exist
  FP32 x INT8 -> ???    // Doesn't exist

Option 1: Full INT8 Pipeline

Quantize both weights and activations to INT8:

FP32 input
    | quantize (AVX-512)
INT8 activations
    |
AMX-INT8 (tile_dpbusd)
    |
INT32 accumulator
    | dequantize (AVX-512)
FP32 output

Requires:

Calibration data for activation ranges
Per-layer scales for activations
~1-2% accuracy loss (often acceptable for vision, not for LLMs)

Option 2: Weight-Only INT8

Keep activations in BF16/FP32, cannot use AMX-INT8:

BF16 activations
    |
    +---> AMX-BF16 <--- BF16 weights
              ^
INT8 weights -> Dequant (AVX-512)

Advantages:

No calibration needed
Preserves activation precision
This is what most LLM deployments use

LLM Reality
Most LLM deployments use INT4/INT8 weights with BF16/FP16 activations. AMX-INT8 is rarely used because quantizing activations hurts quality too much.

The Dequantization Cost

// Dequant INT8 -> BF16 using AVX-512
// This is "wasted" work before AMX compute

__m512i w_int8 = _mm512_loadu_si512(weights);     // 64 INT8 values
__m512 scale = _mm512_set1_ps(group_scale);

// Unpack to 32-bit, convert to float, scale, convert to BF16
__m512i w_lo = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(w_int8, 0));
__m512 w_f32 = _mm512_mul_ps(_mm512_cvtepi32_ps(w_lo), scale);
__m256i w_bf16 = _mm512_cvtneps_pbh(w_f32);  // FP32 -> BF16

// Now feed to AMX-BF16...

This dequant step eats cycles. The question is: does memory bandwidth savings outweigh dequant overhead?

The Math: When INT8 Wins

Memory bandwidth saved:
  BF16 weights: 2 bytes/weight
  INT8 weights: 1 byte/weight
  Savings: ~50% bandwidth

Dequant cost:
  ~4-8 AVX-512 instructions per 64 weights
  Maybe 10-20 cycles per 64 weights

Break-Even Analysis

Memory-bound (batch=1, token gen):
  -> INT8 + dequant WINS
  -> Bandwidth is bottleneck

Compute-bound (batch=64, prefill):
  -> BF16 direct WINS
  -> Dequant is pure overhead

Q4_K: K-Quants Deep Dive

The "K" in Q4_K refers to a family of block quantizers that use nested scales (a super-block scale plus per-sub-block adjustments). This format is popular because it achieves strong quality at ~4.5 bits/weight while keeping metadata compact.

Q4_K Structure Definition

// Q4_K block: 256 weights per block

typedef struct {
    ck_half d;             // 2 bytes: super-block scale (FP16)
    ck_half dmin;          // 2 bytes: super-block minimum (FP16)
    uint8_t scales[12];    // 12 bytes: sub-block scales (6-bit each, packed)
    uint8_t qs[128];       // 128 bytes: 256 INT4 weights (4 bits x 256)
} block_q4_K;              // Total: 144 bytes for 256 weights = 4.5 bpw

Two-Level Scaling Hierarchy

Super-block (256 weights)
|-- d (FP16): overall scale for the whole block
|-- dmin (FP16): overall minimum
+-- 8 Sub-blocks (32 weights each)
    |-- scales[i] (6-bit): local scale adjustment
    +-- qs[...]: 4-bit quantized weights

Dequantization:
  w_fp32 = q * (d * sub_scale) - dmin * sub_min

Why nested scales? Best of both worlds: fine-grained local adaptation without per-32-weight FP16 overhead.

K-Quant Variants

Format	Description	Quality	Bits/Weight
`Q4_K_S`	Small - more aggressive quantization	Good	4.5
`Q4_K_M`	Medium - balanced (most common)	Very Good	4.5
`Q5_K_M`	5-bit with k-quant structure	Excellent	5.5
`Q6_K`	6-bit for near-FP16 quality	Near-perfect	6.5

How Blockwise Dequant Works

For weight-only quantization, the key performance trick is to avoid a full dequantization pass. Instead, dequantize a small block in registers and immediately consume it in the dot-product/FMA loop.

Activations: Always FP32

Weight-only quantized inference (typical CPU path):

Activations: FP32
Weights:     Q4_K, Q4_0, Q8_0, etc.
Compute:     Dequant (registers) -> FMA accumulate

Fused Dequant + Compute

Critical insight: do not dequantize the whole tensor, then matmul. Dequantize-and-compute per block:

for each block (256 weights):
  load (d, dmin, sub-scales)
  unpack 4-bit nibbles
  convert to fp32 and apply scales
  fused multiply-add with fp32 activations
  accumulate into fp32

Q4_K Path

Pure AVX-512 (dequant + FMA fused)

Load Q4_K block
  |
Unpack nibbles (AVX-512)
  |
Apply scales (AVX-512)
  |
FMA with activations
  |
Accumulate

No AMX used - all in one fused kernel.

BF16 Path

Direct AMX-BF16 (no dequant needed)

Load BF16 weights
  |
Load BF16 activations
  |
AMX tile matmul
  |
FP32 accumulator

AMX only used when both operands are BF16.

Why Not AVX-512 || AMX Pipelining?
The two-step AVX->AMX handoff you might imagine would actually be slower due to tile load/store overhead. Better to stay in one domain. For quantized formats, pure AVX-512 with fused dequant is the fastest path.

Bump Allocator Integration

Our bump allocator manages quantized weights, activations, and scratch space. The key principle: never mix types within a region.

Tensor Structure for Quantized Types

typedef enum {
    CK_DT_FP32,
    CK_DT_FP16,
    CK_DT_BF16,
    CK_DT_Q4_0,    // Simple: 32 weights, 1 scale
    CK_DT_Q4_K,    // K-quant: 256 weights, hierarchical scales
    CK_DT_Q8_0,
} CKDataType;

typedef struct {
    CKDataType dtype;
    uint32_t ne[4];      // Dimensions
    size_t nb[4];        // Strides in bytes
    size_t block_size;   // Weights per block (32 for Q4_0, 256 for Q4_K)
    size_t type_size;    // Bytes per block
    void* data;          // Points into bump allocator
} tensor;

Size Calculation

size_t tensor_size_bytes(CKDataType dt, int64_t nelements) {
    // See: include/ckernel_dtype.h (ck_dtype_row_bytes)
    // Quantized types use "bytes per block" x "blocks per row".
    ...
}

// Example: 4096 x 4096 weight matrix
// Total weights: 16M
// Q4_K: 16M / 256 x 144 = 9.4 MB (vs 64 MB for FP32!)

Region Separation

Region 1: Tensor Headers
  +-- Metadata only (pointers to data)

Region 2: Quantized Weights (Q4_K)
  +-- Read-only after load
  +-- Can be mmap'd

Region 3: Dequant Cache (BF16)
  +-- Optional, for hot layers

Region 4: Activations (FP32)
  +-- Always full precision
  +-- Double-buffered

Region 5: Scratch (FP32)
  +-- Temporary buffers
  +-- Reused per layer

Why Separate Regions?

Cache efficiency: Homogeneous types in cache lines
Alignment: Each region optimally aligned for its type
Prefetching: Sequential access patterns per region
Safety: Type confusion bugs are impossible
Sharing: Read-only weights can be mmap'd, shared across processes

Memory Savings: 7B Parameter Model

Format	Weight Size	Savings
FP32	28 GB	Baseline
BF16	14 GB	2x
Q8_0	~7.5 GB	3.7x
Q4_K	~4 GB	7x

+ Activations: ~500 MB per batch (always FP32)

Practical Implementation

Realistic LLM Inference on Xeon

void matmul_q4(float* out, const float* x, const tensor* W) {
    if (batch_size == 1 && seq_len <= 32) {
        // Memory-bound: token generation
        // INT4 weights + dequant + AVX-512 FP32
        // (AMX setup overhead not worth it for small tiles)
        matmul_q4_avx512(out, x, W);
    }
    else {
        // Compute-bound: prefill or batched
        // Dequant to BF16 + AMX-BF16
        dequant_q4_to_bf16(W_bf16, W_q4);  // AVX-512
        amx_matmul_bf16(out, x, W_bf16);   // AMX
    }
}

Token Generation (batch=1)

Memory-bound: loading 4GB of weights
Pure AVX-512 with fused dequant
AMX setup overhead not worth it
Latency: ~50ms per token (7B model)

Prefill (batch=64)

Compute-bound: weight reuse across batch
Dequant weights to BF16 once
AMX-BF16 for the big matmul
Throughput: ~1000 tokens/second

Implementation Advice
Don't try to mix AVX-512 and AMX per operation. Two clean paths:
1. Q4 path: Fused dequant+compute in AVX-512
2. BF16 path: Direct AMX-BF16
Get the simple paths working first, profile, then optimize.

Summary: Key Takeaways

Grouping is Essential

Single scale loses small weights
Per-group scales preserve precision
32 elements per group is the sweet spot
K-quants use nested scales for efficiency

Memory Layout Matters

Keep scale adjacent to weights
Separate regions by data type
Align to cache lines
Dequantize in registers, not RAM

Format Quick Reference

Format	Bytes/32 Weights	Structure	Best For
Q4_0	18	scale + 32x4bit	General use, fastest
Q4_1	20	scale + min + 32x4bit	Asymmetric distributions
Q8_0	34	scale + 32x8bit	High quality requirements
Q6_K	210/256	K-quant with int8 sub-scales	Sensitive MLP / output weights
Q8_K	292/256	FP32 scale + 256x8bit + bsums	Mixed K-quant activation path
Q4_K	144/256	Nested superblock	Best quality/size ratio

Quantization Deep Dive

Visual Format Reference

Table of Contents

Why Quantize?

FP32 (32-bit)

INT8 (8-bit)

INT4 (4-bit)

Memory Bandwidth is the Bottleneck

Floating Point Primer

FP32 Format (32 bits)

BF16 Format (16 bits)

Basic Quantization Math

Symmetric Quantization

Asymmetric Quantization

The Problem with Single Scales

Single Scale Failure

Grouping: The Solution

Group 1: Large Values

Group 2: Small Values

Why Group Size Matters

Quantization Formats

Q4_0: The Simplest 4-bit Format

Structure

Memory Layout

Dequantization

Q4_1: 4-bit with Min Value

Structure

Why Store Min?

Q8_0: 8-bit Integer

Structure

Trade-off

Q8_K: 8-bit K-Quant Activation Format

Structure (Superblock of 256 values)

Why This Exists

Recent Bug Fix (2026-03-09)

Q4_K: K-Quant with Nested Scales

Structure (Superblock of 256 weights)

Nested Scale Hierarchy

Format Comparison

Memory Layout in Bump Allocator

Region Separation

Allocation Strategy

Kernel Dispatch by Type

Type-Tagged Tensors

Dispatch at Operation Level

Why This Bug Was Hard To Notice

Cache Line Access Patterns

Cache Line Basics

Good: Q4_0 Access

Bad: Scattered Scales

SIMD-Aligned Processing

Dequantization in Registers

The Pattern

Complete Quantized GEMV Example

AMX-INT8: The Hardware Reality

AMX-INT8 Constraint

Option 1: Full INT8 Pipeline

Option 2: Weight-Only INT8

The Dequantization Cost

The Math: When INT8 Wins

Break-Even Analysis

Q4_K: K-Quants Deep Dive

Q4_K Structure Definition

Two-Level Scaling Hierarchy

K-Quant Variants

How Blockwise Dequant Works

Activations: Always FP32

Fused Dequant + Compute

Q4_K Path

BF16 Path

Bump Allocator Integration

Tensor Structure for Quantized Types

Size Calculation

Region Separation

Why Separate Regions?

Memory Savings: 7B Parameter Model

Practical Implementation

Realistic LLM Inference on Xeon

Token Generation (batch=1)

Prefill (batch=64)