Quantization Deep Dive

Reduce model size by 4-8x while maintaining accuracy. Learn how block quantization formats work, why grouping matters, and how to integrate quantized weights with our bump allocator.

What You'll Learn
This guide covers the fundamentals of weight quantization: from the basic math behind INT4/INT8 representation to advanced grouping strategies used by modern block quantization formats.

Visual Format Reference

See byte-level structure, bit packing, and dequantization math for all supported formats: Q4_0, Q4_1, Q5_0, Q5_1, Q5_K, Q8_0, Q4_K, Q6_K, Q8_K. Activations run in BF16/FP32 with Q8_0/Q8_K quantized paths.

View Infographic

Table of Contents


Why Quantize?

Large language models are memory-bound, not compute-bound. A 7B parameter model in FP32 requires 28 GB of memory just for weights. Quantization dramatically reduces this.

FP32 (32-bit)

28 GB for 7B params

Full precision, no loss

INT8 (8-bit)

7 GB for 7B params

4x reduction, ~0.1% loss

INT4 (4-bit)

3.5 GB for 7B params

8x reduction, ~0.5% loss

Memory Bandwidth is the Bottleneck

Modern CPUs can do billions of FLOPs/second, but memory bandwidth is limited:

HardwareComputeMemory BWArithmetic Intensity Needed
Intel Xeon Gold 6542Y~4 TFLOPS (FP32)~300 GB/s13 FLOPS/byte
NVIDIA A10019.5 TFLOPS (FP32)2 TB/s10 FLOPS/byte

Matrix-vector multiply (inference) has arithmetic intensity of ~2 FLOPS/byte. We're memory-bound, so smaller weights = faster inference.


Floating Point Primer

Before diving into quantization, let's understand what we're compressing.

FP32 Format (32 bits)

S
1 bit
Exponent
8 bits
Mantissa (Fraction)
23 bits
Value = (-1)^S x 2^(E-127) x 1.Mantissa

Example: 0.15625
  Sign: 0 (positive)
  Exponent: 01111100 (124 - 127 = -3)
  Mantissa: 01000000000000000000000 (1.25)
  = 1 x 2^(-3) x 1.25 = 0.15625

Precision: ~7 significant decimal digits (relative error ~1.2e-7)

BF16 Format (16 bits)

S
1 bit
Exponent
8 bits
Mantissa
7 bits

Same exponent range as FP32, but reduced precision (7 bits = ~2 decimal digits).

Relative error: ~7.8e-3 (~0.78%). This is why our BF16 kernel tests use 1e-2 tolerance.


Basic Quantization Math

Quantization maps continuous floating-point values to discrete integers.

Symmetric Quantization

# Quantize: FP32 -> INT
scale = max(abs(weights)) / (2^(bits-1) - 1)
quantized = round(weights / scale)

# Dequantize: INT -> FP32
dequantized = quantized * scale

Example (INT8, range [-127, 127]):
  weights = [-0.5, 0.2, 1.0, -0.3]
  scale = 1.0 / 127 = 0.00787
  quantized = [-64, 25, 127, -38]
  dequantized = [-0.504, 0.197, 1.0, -0.299]  <- small error
    

Asymmetric Quantization

# Also stores zero-point for non-symmetric distributions
scale = (max - min) / (2^bits - 1)
zero_point = round(-min / scale)
quantized = round(weights / scale) + zero_point

# Useful when weights are not centered around zero
# (e.g., after ReLU activations)
    
!
The Catch
With a single scale for the entire weight matrix, small values get quantized to zero. This is catastrophic for neural networks where every parameter matters.

The Problem with Single Scales

Consider a weight matrix with values ranging from -2.0 to +2.0, where some small but important values exist around 0.01:

Quantization Infographic showing FP32 to INT4 conversion

Single Scale Failure

Weight matrix: [-2.0, 0.01, 1.5, -0.005, 0.8]

With single scale (INT4, range [-8, 7]):
  scale = 2.0 / 7 = 0.286

  -2.0  -> round(-2.0 / 0.286) = -7  -> dequant: -2.0   OK
   0.01 -> round(0.01 / 0.286) =  0  -> dequant:  0.0   LOST!
   1.5  -> round(1.5 / 0.286)  =  5  -> dequant:  1.43  ~ok
  -0.005-> round(-0.005/0.286) =  0  -> dequant:  0.0   LOST!
   0.8  -> round(0.8 / 0.286)  =  3  -> dequant:  0.86  ~ok
    

Small weights (0.01, -0.005) become exactly zero. These might be critical for model behavior!


Grouping: The Solution

Instead of one scale for the entire matrix, use per-group scales. Each group gets its own scale, tailored to its local range.

Detailed diagram of quantization grouping and memory layout

Group 1: Large Values

Values: [-2.0, 1.5, 0.8, -1.2]
Local max: 2.0
Scale: 2.0 / 7 = 0.286

Each value uses full INT4 range
relative to THIS group's scale.
        

Group 2: Small Values

Values: [0.01, -0.005, 0.008, 0.003]
Local max: 0.01
Scale: 0.01 / 7 = 0.00143

Now small values use full range!
0.01 -> 7, -0.005 -> -4, etc.
        

Why Group Size Matters

Group SizeScale OverheadPrecisionUse Case
321 FP16 per 32 weightsBestQ4_0, Q8_0 (common)
641 FP16 per 64 weightsGoodSome custom formats
1281 FP16 per 128 weightsMediumAggressive compression
256Nested scalesVariesQ4_K, Q5_K "k-quants"

Trade-off: Smaller groups = better precision but more scale storage overhead.


Quantization Formats

The block formats below are widely used for weight-only quantized inference and are common in GGUF-style model files. We implement them to make the runtime compatible with those on-disk layouts.

Q4_0: The Simplest 4-bit Format

Structure

typedef struct {
    ck_half d;           // 2 bytes: scale (delta)
    uint8_t qs[16];      // 16 bytes: 32 x 4-bit weights (2 per byte)
} block_q4_0;            // Total: 18 bytes per 32 weights

Bits per weight: 18 * 8 / 32 = 4.5 bits
    

Memory Layout

scale
2B
w0,w1
w2,w3
w4,w5
...
w30,w31

Dequantization

// Extract 4-bit values (signed, range -8 to 7)
int8_t q0 = (qs[i] & 0xF) - 8;   // Lower nibble
int8_t q1 = (qs[i] >> 4) - 8;    // Upper nibble

// Dequantize
float w0 = q0 * scale;
float w1 = q1 * scale;
    

Q4_1: 4-bit with Min Value

Structure

typedef struct {
    ck_half d;           // 2 bytes: scale
    ck_half m;           // 2 bytes: minimum value
    uint8_t qs[16];      // 16 bytes: 32 x 4-bit weights
} block_q4_1;            // Total: 20 bytes per 32 weights

Bits per weight: 20 * 8 / 32 = 5.0 bits
    

Why Store Min?

Asymmetric quantization: dequant = q * scale + min

Better for weight distributions not centered at zero (e.g., biased layers).

Q8_0: 8-bit Integer

Structure

typedef struct {
    ck_half d;           // 2 bytes: scale
    int8_t qs[32];       // 32 bytes: 32 x 8-bit signed weights
} block_q8_0;            // Total: 34 bytes per 32 weights

Bits per weight: 34 * 8 / 32 = 8.5 bits
    

Trade-off

Higher precision (256 levels vs 16 for Q4), but 2x larger than Q4_0.

Often used for activations or when accuracy is critical.

Q8_K: 8-bit K-Quant Activation Format

Structure (Superblock of 256 values)

typedef struct {
    float d;                  // 4 bytes: FP32 scale
    int8_t qs[256];           // 256 bytes: 256 x 8-bit signed values
    int16_t bsums[16];        // 32 bytes: sums of each 16-value chunk
} block_q8_K;                 // Total: 292 bytes per 256 values

Bits per value: 292 * 8 / 256 = 9.125 bits
    

Why This Exists

Q8_K is the activation-side companion for mixed K-quant kernels like gemv_q4_k_q8_k, gemv_q6_k_q8_k, and gemm_nt_q4_k_q8_k. In the current v7 runtime, the relevant path is often FP32 hidden -> quantize_row_q8_k -> Q4_K/Q6_K matvec.

Recent Bug Fix (2026-03-09)

We fixed a subtle parity bug in the SSE implementation of quantize_row_q8_k. The fix restored the exact signed-max selection and bsums contract expected by llama.cpp/ref. See commit 224a4d30.

This kind of bug is easy to miss because outputs can remain plausible while parity drift grows at mixed-quant boundaries. It often only shows up under stricter per-op or model-family bring-up checks.

Q4_K: K-Quant with Nested Scales

Structure (Superblock of 256 weights)

typedef struct {
    ck_half d;                  // 2 bytes: superblock scale
    ck_half dmin;               // 2 bytes: superblock min
    uint8_t scales[12];         // 12 bytes: 8 sub-block scales (6-bit each)
    uint8_t qs[128];            // 128 bytes: 256 x 4-bit weights
} block_q4_K;                   // Total: 144 bytes per 256 weights

Bits per weight: 144 * 8 / 256 = 4.5 bits
    

Nested Scale Hierarchy

Superblock (256 weights)
|-- d (FP16): overall scale
|-- dmin (FP16): overall minimum
+-- 8 Sub-blocks (32 weights each)
    |-- scales[i] (6-bit): local scale adjustment
    +-- qs[...]: 4-bit quantized weights

Dequant: weight = q * (d * sub_scale) - dmin * sub_min
    

Best of both worlds: Fine-grained local scales without per-32-weight FP16 overhead.

Format Comparison

Format Bits/Weight Block Size Quality Speed
FP32 32 N/A Perfect Slow (memory-bound)
FP16 16 N/A Excellent 2x faster
Q8_0 8.5 32 Very Good ~3.5x faster
Q4_0 4.5 32 Good ~6x faster
Q4_K 4.5 256 Very Good ~6x faster

Memory Layout in Bump Allocator

Our bump allocator manages quantized weights, activations, and scratch space. Key principle: never mix types within a region.

Region Separation

// Bump allocator layout for quantized inference
typedef struct {
    // Region 1: Quantized weights (read-only after load)
    uint8_t *weights_q4;      // Q4_0 or Q4_K blocks
    size_t weights_size;

    // Region 2: Activations (FP32, read-write)
    float *activations;       // Input/output tensors
    size_t act_size;

    // Region 3: Scratch (FP32, temporary)
    float *scratch;           // Intermediate buffers
    size_t scratch_size;

    // Region 4: Dequantized cache (optional)
    float *dequant_cache;     // Hot weights kept dequantized
    size_t cache_size;
} BumpAllocator;
    

Allocation Strategy

// Allocate quantized weight buffer
size_t num_blocks = (num_weights + 31) / 32;  // Round up
size_t q4_0_size = num_blocks * sizeof(block_q4_0);  // 18 bytes/block

void* weights = bump_alloc(allocator, q4_0_size, REGION_WEIGHTS);

// Allocate activation buffer (always FP32)
size_t act_size = batch * seq_len * hidden_dim * sizeof(float);
float* activations = bump_alloc(allocator, act_size, REGION_ACTIVATIONS);
    
Why Separate Regions?
1. Cache efficiency: Homogeneous data types within cache lines
2. Alignment: Each region can have optimal alignment for its type
3. Prefetching: Sequential access patterns per region
4. Safety: Type confusion bugs are impossible

Kernel Dispatch by Type

Type checking happens at tensor level, never in the hot path.

Type-Tagged Tensors

typedef enum {
    DTYPE_FP32,
    DTYPE_FP16,
    DTYPE_BF16,
    DTYPE_Q4_0,
    DTYPE_Q4_1,
    DTYPE_Q5_0,
    DTYPE_Q5_1,
    DTYPE_Q4_K,
    DTYPE_Q6_K,
    DTYPE_Q8_0,
    DTYPE_Q8_K
} DType;

typedef struct {
    void *data;
    size_t shape[4];
    DType dtype;
    // ... other metadata
} Tensor;
    

Dispatch at Operation Level

const char *select_linear_kernel(DType weight, DType activation, bool prefill) {
    if (!prefill && weight == DTYPE_Q4_K && activation == DTYPE_Q8_K) return "gemv_q4_k_q8_k";
    if (!prefill && weight == DTYPE_Q6_K && activation == DTYPE_Q8_K) return "gemv_q6_k_q8_k";
    if (prefill && weight == DTYPE_Q4_K && activation == DTYPE_Q8_K) return "gemm_nt_q4_k_q8_k";
    if (prefill && weight == DTYPE_Q4_K && activation == DTYPE_FP32) return "gemm_nt_q4_k";
    if (!prefill && weight == DTYPE_Q8_0 && activation == DTYPE_FP32) return "gemv_q8_0";
    return "fallback_or_error";
}

void lower_mixed_quant_linear(...) {
    quantize_row_q8_k(hidden_fp32, hidden_q8k, K);
    gemv_q4_k_q8_k(out_fp32, weight_q4k, hidden_q8k, K);
}
    

Why This Bug Was Hard To Notice

The March 9 fix was not a flashy crash bug. The runtime could still generate text, and some top-1 logits remained unchanged. But the wrong signed-max choice or incorrect bsums bookkeeping perturbs internal mixed-quant math just enough to accumulate drift later.

That is why quantization bugs like this often survive casual manual testing and only show up in strict parity tooling or in model families that stress a specific kernel path.


Cache Line Access Patterns

Understanding cache behavior is critical for quantized kernel performance.

Cache Line Basics

Cache line size: 64 bytes (typical x86)

Q4_0 block: 18 bytes -> ~3.5 blocks per cache line
Q4_K block: 144 bytes -> 2.25 cache lines per block

Key insight: Scale and weights should be in SAME cache line
when possible for minimal memory traffic.
    

Good: Q4_0 Access

Cache line 1 (64B):
|-- Block 0: scale (2B) + qs (16B)
|-- Block 1: scale (2B) + qs (16B)
|-- Block 2: scale (2B) + qs (16B)
+-- Block 3: scale (2B) + 10B qs

One fetch -> 3.5 blocks -> 112 weights!
        

Bad: Scattered Scales

If scales were separate from weights:

Cache line 1: scales[0..31]
Cache line 2: weights[0..127]
Cache line 3: weights[128..255]

3 fetches for same work!
        

SIMD-Aligned Processing

// Process 32 weights at a time (one Q4_0 block)
// Perfectly aligned for AVX-512: 16 FP32 results at a time

__m512 dequant_q4_0_avx512(const block_q4_0* block) {
    // Load scale (broadcast to all lanes)
    __m512 scale = _mm512_set1_ps(CK_FP16_TO_FP32(block->d));

    // Load 16 bytes of quantized weights
    __m128i qs = _mm_loadu_si128((__m128i*)block->qs);

    // Unpack 4-bit to 8-bit
    __m256i unpacked = unpack_4bit_to_8bit(qs);

    // Convert to FP32 and scale
    __m512 weights = _mm512_cvtepi32_ps(_mm512_cvtepi8_epi32(unpacked));
    return _mm512_mul_ps(weights, scale);
}
    

Dequantization in Registers

Critical optimization: dequantize into CPU registers, never write back to memory.

The Pattern

// WRONG: Dequantize to memory, then use
float* temp = malloc(hidden_dim * sizeof(float));
dequantize_q4_0(weights_q4, temp, hidden_dim);  // Write to RAM
matmul(out, input, temp, ...);                   // Read from RAM
free(temp);

// RIGHT: Dequantize in registers during computation
for (int block = 0; block < num_blocks; block++) {
    // Dequantize 32 weights -> 32 floats in YMM/ZMM registers
    __m512 w = dequant_q4_0_avx512(&weights_q4[block]);

    // Immediately use in FMA (still in registers!)
    acc = _mm512_fmadd_ps(input_vec, w, acc);

    // w never touches RAM after dequantization
}
    
!
Why This Matters
Register bandwidth: ~1 TB/s
L1 cache bandwidth: ~200 GB/s
RAM bandwidth: ~50 GB/s

By keeping dequantized values in registers, we get 20x better bandwidth than RAM.

Complete Quantized GEMV Example

// Quantized matrix-vector multiply: y = A_q4 @ x
void gemv_q4_0(float* y, const block_q4_0* A, const float* x,
               int M, int K) {
    const int blocks_per_row = K / 32;

    for (int row = 0; row < M; row++) {
        __m512 acc = _mm512_setzero_ps();

        for (int b = 0; b < blocks_per_row; b++) {
            // Load quantized block (18 bytes from RAM)
            const block_q4_0* block = &A[row * blocks_per_row + b];

            // Dequantize in registers (no memory write!)
            __m512 w_lo = dequant_q4_0_lower(block);  // weights 0-15
            __m512 w_hi = dequant_q4_0_upper(block);  // weights 16-31

            // Load corresponding input (32 floats)
            __m512 x_lo = _mm512_loadu_ps(&x[b * 32]);
            __m512 x_hi = _mm512_loadu_ps(&x[b * 32 + 16]);

            // FMA: accumulate dot product
            acc = _mm512_fmadd_ps(w_lo, x_lo, acc);
            acc = _mm512_fmadd_ps(w_hi, x_hi, acc);
        }

        // Horizontal sum and store single result
        y[row] = _mm512_reduce_add_ps(acc);
    }
}
    

AMX-INT8: The Hardware Reality

Intel's AMX (Advanced Matrix Extensions) offers significant speedups, but understanding its constraints is critical for correct implementation.

AMX Pipeline Comparison

AMX-INT8 Constraint

_tile_dpbusd:  UINT8 x INT8 -> INT32
               -----   ----
               tile A  tile B

Both tiles must be INT8. No exceptions.

There's no AMX instruction for:
  BF16 x INT8 -> ???    // Doesn't exist
  FP32 x INT8 -> ???    // Doesn't exist
    

Option 1: Full INT8 Pipeline

Quantize both weights and activations to INT8:

FP32 input
    | quantize (AVX-512)
INT8 activations
    |
AMX-INT8 (tile_dpbusd)
    |
INT32 accumulator
    | dequantize (AVX-512)
FP32 output
        

Requires:

  • Calibration data for activation ranges
  • Per-layer scales for activations
  • ~1-2% accuracy loss (often acceptable for vision, not for LLMs)

Option 2: Weight-Only INT8

Keep activations in BF16/FP32, cannot use AMX-INT8:

BF16 activations
    |
    +---> AMX-BF16 <--- BF16 weights
              ^
INT8 weights -> Dequant (AVX-512)
        

Advantages:

  • No calibration needed
  • Preserves activation precision
  • This is what most LLM deployments use
!
LLM Reality
Most LLM deployments use INT4/INT8 weights with BF16/FP16 activations. AMX-INT8 is rarely used because quantizing activations hurts quality too much.

The Dequantization Cost

// Dequant INT8 -> BF16 using AVX-512
// This is "wasted" work before AMX compute

__m512i w_int8 = _mm512_loadu_si512(weights);     // 64 INT8 values
__m512 scale = _mm512_set1_ps(group_scale);

// Unpack to 32-bit, convert to float, scale, convert to BF16
__m512i w_lo = _mm512_cvtepi8_epi32(_mm512_extracti32x4_epi32(w_int8, 0));
__m512 w_f32 = _mm512_mul_ps(_mm512_cvtepi32_ps(w_lo), scale);
__m256i w_bf16 = _mm512_cvtneps_pbh(w_f32);  // FP32 -> BF16

// Now feed to AMX-BF16...
    

This dequant step eats cycles. The question is: does memory bandwidth savings outweigh dequant overhead?

The Math: When INT8 Wins

Memory bandwidth saved:
  BF16 weights: 2 bytes/weight
  INT8 weights: 1 byte/weight
  Savings: ~50% bandwidth

Dequant cost:
  ~4-8 AVX-512 instructions per 64 weights
  Maybe 10-20 cycles per 64 weights
        

Break-Even Analysis

Memory-bound (batch=1, token gen):
  -> INT8 + dequant WINS
  -> Bandwidth is bottleneck

Compute-bound (batch=64, prefill):
  -> BF16 direct WINS
  -> Dequant is pure overhead
        

Q4_K: K-Quants Deep Dive

The "K" in Q4_K refers to a family of block quantizers that use nested scales (a super-block scale plus per-sub-block adjustments). This format is popular because it achieves strong quality at ~4.5 bits/weight while keeping metadata compact.

Q4_K Block Structure

Q4_K Structure Definition

// Q4_K block: 256 weights per block

typedef struct {
    ck_half d;             // 2 bytes: super-block scale (FP16)
    ck_half dmin;          // 2 bytes: super-block minimum (FP16)
    uint8_t scales[12];    // 12 bytes: sub-block scales (6-bit each, packed)
    uint8_t qs[128];       // 128 bytes: 256 INT4 weights (4 bits x 256)
} block_q4_K;              // Total: 144 bytes for 256 weights = 4.5 bpw
    

Two-Level Scaling Hierarchy

Super-block (256 weights)
|-- d (FP16): overall scale for the whole block
|-- dmin (FP16): overall minimum
+-- 8 Sub-blocks (32 weights each)
    |-- scales[i] (6-bit): local scale adjustment
    +-- qs[...]: 4-bit quantized weights

Dequantization:
  w_fp32 = q * (d * sub_scale) - dmin * sub_min
    

Why nested scales? Best of both worlds: fine-grained local adaptation without per-32-weight FP16 overhead.

K-Quant Variants

Format Description Quality Bits/Weight
Q4_K_S Small - more aggressive quantization Good 4.5
Q4_K_M Medium - balanced (most common) Very Good 4.5
Q5_K_M 5-bit with k-quant structure Excellent 5.5
Q6_K 6-bit for near-FP16 quality Near-perfect 6.5

How Blockwise Dequant Works

For weight-only quantization, the key performance trick is to avoid a full dequantization pass. Instead, dequantize a small block in registers and immediately consume it in the dot-product/FMA loop.

Activations: Always FP32

Weight-only quantized inference (typical CPU path):

Activations: FP32
Weights:     Q4_K, Q4_0, Q8_0, etc.
Compute:     Dequant (registers) -> FMA accumulate
    

Fused Dequant + Compute

Critical insight: do not dequantize the whole tensor, then matmul. Dequantize-and-compute per block:

for each block (256 weights):
  load (d, dmin, sub-scales)
  unpack 4-bit nibbles
  convert to fp32 and apply scales
  fused multiply-add with fp32 activations
  accumulate into fp32
    

Q4_K Path

Pure AVX-512 (dequant + FMA fused)

Load Q4_K block
  |
Unpack nibbles (AVX-512)
  |
Apply scales (AVX-512)
  |
FMA with activations
  |
Accumulate
        

No AMX used - all in one fused kernel.

BF16 Path

Direct AMX-BF16 (no dequant needed)

Load BF16 weights
  |
Load BF16 activations
  |
AMX tile matmul
  |
FP32 accumulator
        

AMX only used when both operands are BF16.

Why Not AVX-512 || AMX Pipelining?
The two-step AVX->AMX handoff you might imagine would actually be slower due to tile load/store overhead. Better to stay in one domain. For quantized formats, pure AVX-512 with fused dequant is the fastest path.

Bump Allocator Integration

Our bump allocator manages quantized weights, activations, and scratch space. The key principle: never mix types within a region.

Bump Allocator Memory Layout

Tensor Structure for Quantized Types

typedef enum {
    CK_DT_FP32,
    CK_DT_FP16,
    CK_DT_BF16,
    CK_DT_Q4_0,    // Simple: 32 weights, 1 scale
    CK_DT_Q4_K,    // K-quant: 256 weights, hierarchical scales
    CK_DT_Q8_0,
} CKDataType;

typedef struct {
    CKDataType dtype;
    uint32_t ne[4];      // Dimensions
    size_t nb[4];        // Strides in bytes
    size_t block_size;   // Weights per block (32 for Q4_0, 256 for Q4_K)
    size_t type_size;    // Bytes per block
    void* data;          // Points into bump allocator
} tensor;
    

Size Calculation

size_t tensor_size_bytes(CKDataType dt, int64_t nelements) {
    // See: include/ckernel_dtype.h (ck_dtype_row_bytes)
    // Quantized types use "bytes per block" x "blocks per row".
    ...
}

// Example: 4096 x 4096 weight matrix
// Total weights: 16M
// Q4_K: 16M / 256 x 144 = 9.4 MB (vs 64 MB for FP32!)
    

Region Separation

Region 1: Tensor Headers
  +-- Metadata only (pointers to data)

Region 2: Quantized Weights (Q4_K)
  +-- Read-only after load
  +-- Can be mmap'd

Region 3: Dequant Cache (BF16)
  +-- Optional, for hot layers

Region 4: Activations (FP32)
  +-- Always full precision
  +-- Double-buffered

Region 5: Scratch (FP32)
  +-- Temporary buffers
  +-- Reused per layer
        

Why Separate Regions?

  • Cache efficiency: Homogeneous types in cache lines
  • Alignment: Each region optimally aligned for its type
  • Prefetching: Sequential access patterns per region
  • Safety: Type confusion bugs are impossible
  • Sharing: Read-only weights can be mmap'd, shared across processes

Memory Savings: 7B Parameter Model

Format Weight Size Savings
FP32 28 GB Baseline
BF16 14 GB 2x
Q8_0 ~7.5 GB 3.7x
Q4_K ~4 GB 7x

+ Activations: ~500 MB per batch (always FP32)


Practical Implementation

Realistic LLM Inference on Xeon

void matmul_q4(float* out, const float* x, const tensor* W) {
    if (batch_size == 1 && seq_len <= 32) {
        // Memory-bound: token generation
        // INT4 weights + dequant + AVX-512 FP32
        // (AMX setup overhead not worth it for small tiles)
        matmul_q4_avx512(out, x, W);
    }
    else {
        // Compute-bound: prefill or batched
        // Dequant to BF16 + AMX-BF16
        dequant_q4_to_bf16(W_bf16, W_q4);  // AVX-512
        amx_matmul_bf16(out, x, W_bf16);   // AMX
    }
}
    

Token Generation (batch=1)

  • Memory-bound: loading 4GB of weights
  • Pure AVX-512 with fused dequant
  • AMX setup overhead not worth it
  • Latency: ~50ms per token (7B model)

Prefill (batch=64)

  • Compute-bound: weight reuse across batch
  • Dequant weights to BF16 once
  • AMX-BF16 for the big matmul
  • Throughput: ~1000 tokens/second
Implementation Advice
Don't try to mix AVX-512 and AMX per operation. Two clean paths:
1. Q4 path: Fused dequant+compute in AVX-512
2. BF16 path: Direct AMX-BF16
Get the simple paths working first, profile, then optimize.

Summary: Key Takeaways

Grouping is Essential

  • Single scale loses small weights
  • Per-group scales preserve precision
  • 32 elements per group is the sweet spot
  • K-quants use nested scales for efficiency

Memory Layout Matters

  • Keep scale adjacent to weights
  • Separate regions by data type
  • Align to cache lines
  • Dequantize in registers, not RAM

Format Quick Reference

FormatBytes/32 WeightsStructureBest For
Q4_018scale + 32x4bitGeneral use, fastest
Q4_120scale + min + 32x4bitAsymmetric distributions
Q8_034scale + 32x8bitHigh quality requirements
Q6_K210/256K-quant with int8 sub-scalesSensitive MLP / output weights
Q8_K292/256FP32 scale + 256x8bit + bsumsMixed K-quant activation path
Q4_K144/256Nested superblockBest quality/size ratio

Further Reading

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close