SIMD & Quantization Architecture | C-Kernel-Engine

SIMD Instructions & Quantization for LLM Inference

1. SIMD Instruction Hierarchy

Intel x86 SIMD instructions evolved to provide increasingly wider vector operations and specialized instructions for neural network workloads.

CPU Generation Timeline AVX 256-bit vectors 8 x FP32 per instruction _mm256_add_ps _mm256_mul_ps Sandy Bridge 2011 AVX2 + FMA 256-bit + Integer ops FMA: a*b+c in 1 cycle _mm256_fmadd_ps _mm256_maddubs_epi16 INT8 multiply-add Haswell 2013 AVX-512 512-bit vectors 16 x FP32 per instruction 32 ZMM registers _mm512_fmadd_ps _mm512_reduce_add_ps Masked operations Skylake-X 2017 AVX-512 VNNI Vector Neural Network 4x INT8 ops/cycle! _mm512_dpbusd_epi32 64 INT8 muls + 16 INT32 adds Best for Q4_K decode! No BF16 (2nd Gen) Cascade Lake 2019 (2nd Gen) AMX Advanced Matrix Ext. 16x16 Tile Units 256 INT8 muls/tile op 16 INT32 accumulates Best for batch > 1 _tile_dpbusd (INT8) _tile_dpbf16ps (BF16!) 8 tile registers (TMM0-7) Sapphire Rapids 2023 (4th Gen) INT8 Dot Product Throughput (ops/cycle) AVX: 32 (via INT16) AVX2: 64 (maddubs) AVX-512: 128 (2x width) VNNI: 256 (dpbusd) - 4x faster! AMX: 512+ (tile ops) - for batch

Key Insight: VNNI for Decode, AMX for Prefill

Single-token decode (M=1): Use _mm512_dpbusd_epi32 (VNNI). AMX tile setup overhead makes it slower for single rows.

Prefill (M=prompt_length): Use AMX tiles for 16x16 matrix blocks. Amortizes tile load/store overhead.

Xeon Scalable Generations: Format Support Matrix

Generation Codename Year AVX-512 VNNI (INT8) BF16 AMX Best For
1st Gen Skylake-SP 2017 Yes No No No FP32 only
2nd Gen Cascade Lake 2019 Yes Yes No No Q4_K + VNNI decode
3rd Gen Ice Lake-SP 2021 Yes Yes AVX-512 BF16 No BF16 training, Q4_K inference
4th Gen Sapphire Rapids 2023 Yes Yes AVX-512 + AMX INT8 + BF16 Full: Q4_K decode, AMX batch, BF16 train
5th Gen Emerald Rapids 2024 Yes Yes AVX-512 + AMX INT8 + BF16 Same as 4th Gen (improved cache)

Note: 2nd Gen Xeon has VNNI but no BF16 - you can still get excellent Q4_K performance! BF16 requires 3rd Gen+ for training workloads.

Weight Formats: Training vs Inference

Training

Master weights: FP32 (for optimizer state: Adam moments, gradients)

Forward/backward: BF16 preferred (same dynamic range as FP32, 2x memory savings)

Why not FP16: FP16 has limited range (6e-5 to 65504), causes overflow in gradients

Why not INT8: Gradients need full precision, quantization noise accumulates

Inference

Weights: Q4_K (4.5 bits) - 7x smaller than FP32, minimal quality loss

Activations: Quantize to Q8 on-the-fly (fast, ~1KB per layer)

KV cache: FP16 or FP32 (quality-sensitive, stays in cache)

Compute kernels: VNNI INT8 for GEMV, FP32 for softmax/RMSNorm

Why Quantization Helps (Memory Bandwidth is the Bottleneck)

For single-token decode, the model is memory-bandwidth bound, not compute-bound:

  • Each token requires loading the entire model weights from DRAM
  • Qwen2-0.5B FP32: ~2GB weights × 1 token = 2GB DRAM read per token
  • Qwen2-0.5B Q4_K: ~350MB weights × 1 token = 350MB DRAM read per token
  • DDR5 bandwidth: ~200 GB/s sustained → Q4_K gives ~570 tok/s theoretical vs ~100 tok/s FP32

Bottom line: Quantization is NOT optional for competitive inference speed. Q4_K with VNNI is the sweet spot for quality vs performance on Xeon.

2. Quantization Formats: Q4_K and Q8_K

llama.cpp uses block-wise quantization where weights and activations are quantized in groups with per-block scale factors.

Q4_K Block Format (256 weights = QK_K) Block Header (12 bytes) d (FP16) scale dmin (FP16) min scale scales[12] - 6-bit packed sub-block scales Quantized Weights (128 bytes) qs[128] - 256 x 4-bit weights packed as nibbles w0 w1 = 1 byte ... x 128 bytes 8 Sub-blocks of 32 weights each Each sub-block has: scale[i] (6-bit) + min[i] (6-bit) Dequant: w = d * scale[i] * (q - 8) + dmin * min[i] Q8_K Block Format (256 values) Block Header d (FP16) bsums[16] - INT16 block sums Quantized Values (256 bytes) qs[256] - 256 x INT8 values Dequant: x = d * qs[i] Memory Footprint Comparison FP32 (32-bit): 256 weights x 4 bytes = 1024 bytes (100%) BF16 (16-bit): 512 bytes (50%) Q8_K (8-bit): ~292 bytes (28%) Q4_K (~4.5-bit): ~144 bytes (14%) 7x less memory bandwidth! VNNI-Friendly Data Layout _mm512_dpbusd_epi32 expects: Unsigned INT8 (activations) x Signed INT8 (weights) Groups of 4 bytes: [a0,a1,a2,a3] dot [w0,w1,w2,w3] -> INT32 Activations (Q8_K): 4 x INT8 unsigned x Weights (Q4_K unpacked): 4 x INT8 signed = Accumulator: INT32 sum += a*w Note: Q4_K weights must be dequantized to INT8 (or kept as packed nibbles with special handling)

3. Transformer Operations & SIMD Mapping

Different operations in the transformer benefit from different SIMD approaches.

Transformer Layer: Operations & Optimal SIMD Input: x [1, D] RMSNorm x_norm = x * rsqrt(mean(x^2) + eps) * gamma AVX-512 FP32 Compute-bound, stays in FP32 Q/K/V Projections (3x GEMV) Q = x @ W_q [1,D] @ [D,H*hd] = [1,H*hd] VNNI Q4_K AMX (if M>1) K = x @ W_k [1,D] @ [D,Hkv*hd] VNNI Q4_K V = x @ W_v [1,D] @ [D,Hkv*hd] VNNI Q4_K MEMORY BOUND - Q4_K weights reduce bandwidth 7x! RoPE (Rotary Position Embedding) q_rot = q * cos + rotate(q) * sin AVX-512 FP32 Attention: softmax(Q @ K^T / sqrt(d)) @ V Q @ K^T (scores) FP32 or FP16 cache Softmax AVX-512 FP32 Attn @ V FP32/FP16 Out Proj VNNI Q4_K Residual Add + RMSNorm AVX-512 FP32 MLP (SwiGLU): gate * up * silu(gate) then down Gate: x @ W_gate [1,D] @ [D,4D] VNNI Q4_K Up: x @ W_up [1,D] @ [D,4D] VNNI Q4_K SiLU + Mul AVX-512 FP32 Down: swiglu @ W_down [1,4D] @ [4D,D] VNNI Q4_K MLP is ~67% of compute - Q4_K VNNI here is critical! VNNI Q4_K (Memory-bound GEMV) AMX (Batch > 1 only) AVX-512 FP32 (Compute-bound) RMSNorm (Small, FP32)

Operation Breakdown by Time (Decode, M=1)

Operation% of TimeBottleneckBest SIMDBest Format
Q/K/V Projections~25%Memory BWVNNIQ4_K weights, Q8 activations
Output Projection~8%Memory BWVNNIQ4_K weights
MLP (gate+up+down)~60%Memory BWVNNIQ4_K weights
Attention (Q@K, softmax, @V)~5%ComputeAVX-512 FP32FP32/FP16 KV cache
RMSNorm (x2)~1%ComputeAVX-512 FP32FP32
RoPE, Residuals~1%ComputeAVX-512 FP32FP32

4. Cache Hierarchy & Data Flow

Understanding cache behavior is crucial for keeping activations "hot" and avoiding DRAM round-trips.

5th Gen Xeon Memory Hierarchy Registers (ZMM0-31: 2KB total) Access: 0 cycles | 32 x 512-bit ZMM + 8 x 1KB AMX tiles Keep: Current VNNI accumulator, loop indices, scales L1 Data Cache (48KB per core) Access: 4-5 cycles | ~2 TB/s bandwidth Keep: Current activation row (896 floats = 3.5KB), RMSNorm gamma (3.5KB) Activations fit! L2 Cache (2MB per core) Access: 12-14 cycles | ~1 TB/s bandwidth Keep: KV cache for recent tokens, frequently used weight blocks Q8 activations (~1KB) KV cache (token_pos * 2 * Hkv * hd * 2 bytes) L3 Cache (Shared, 105MB for 5th Gen Xeon) Access: 40-50 cycles | ~500 GB/s bandwidth Partial weight tiles may stay hot across tokens Qwen2-0.5B Q4_K: ~350MB weights (won't fit) | Qwen2-0.5B FP32: ~2GB (definitely won't fit) DDR5 DRAM (8 channels @ 4800 MT/s) Access: 100+ cycles | ~300 GB/s peak (~200 GB/s sustained) Model weights live here - this is the bottleneck! Weight loading dominates decode time HOT COLD Key: Keep activations in L1/L2, accept weight streaming from DRAM. Q4_K reduces DRAM traffic 7x vs FP32!

5. Activation Storage & Hot Data Path

How to keep activations hot in cache while streaming weights from DRAM.

Decode Token Data Flow (Keeping Activations Hot) Time (operations within one layer) L1/L2: Activation Buffer (reused) RMSNorm Read: x (L1) Write: x_norm (L1) Quant Q8 x_norm → Q8 (stays L1) Q = x @ W_q Stream W_q (DRAM) x_q8 stays L1! Write Q (L2) K = x @ W_k Stream W_k x_q8 still L1 V = x @ W_v Stream W_v x_q8 still L1 Attention Q,K,V from L2 KV cache (L2/L3) Out → L1 Out Proj Stream W_o MLP Stream W1,W2 Fused: no DRAM trip DRAM: Weight Streaming (~52MB Q4_K per layer, ~200MB FP32) Key Insight: Activation Reuse Pattern 1. Quantize x to Q8 ONCE after RMSNorm (stays in L1: ~1KB) 2. Reuse Q8 activations for Q, K, V projections (3x reuse before eviction) 3. Fused MLP: quantize once, use for gate+up+down (eliminates intermediate DRAM write) Weights stream through, activations stay hot! DRAM streaming (weights) L1 hot (activations) L2 (Q/K/V, KV cache) Qwen2-0.5B decode: ~52MB weights/layer x 24 layers = 1.25GB per token @ 200GB/s = ~6.25ms/token = 160 tok/s theoretical max (Q4_K) FP32: ~200MB/layer x 24 = 4.8GB per token @ 200GB/s = ~24ms = 42 tok/s max

6. Implementation Strategy

Phase 1: VNNI Q4_K Decode (Highest Impact)

  1. Implement quantize_row_q8_k() for activations
  2. Implement vec_dot_q4_k_q8_k_vnni() using _mm512_dpbusd_epi32
  3. Wire into decode path: after RMSNorm, quantize to Q8, use VNNI matvec
  4. Keep FP32 for: RMSNorm, RoPE, Attention scores, Softmax

Expected speedup: 5-10x for decode

Phase 2: AMX for Prefill

  1. Implement weight repacking to VNNI tile format at load time
  2. Implement AMX tile kernel for Q4_K x Q8_K
  3. Use for M > 16 (prefill batch size)
  4. Keep VNNI for M = 1 (single token decode)

Expected speedup: 2-3x for prefill

Phase 3: SubNUMA & Hugepages

  1. Detect SubNUMA clusters (SNC mode on Xeon)
  2. Allocate model weights per-SNC with mbind()
  3. Use 2MB hugepages for weight buffers
  4. Pin threads to SNC, access local weights

Competitive advantage over llama.cpp!

Phase 4: BF16 for Training

  1. Keep FP32 master weights for optimizer state
  2. Use BF16 for forward/backward activations
  3. Implement _mm512_dpbf16_ps for BF16 GEMM
  4. Optional: AMX-BF16 tiles for batch training

For training workloads

Summary: What Each SIMD Is Best For

SIMDBest Use CaseData TypesOperations
AVX-512 FP32 RMSNorm, Softmax, RoPE, Residuals FP32 _mm512_fmadd_ps, _mm512_exp_ps
AVX-512 VNNI Q4_K decode (M=1 GEMV) INT8 x INT8 → INT32 _mm512_dpbusd_epi32 - 4x faster!
AMX INT8 Prefill (M>1), batch inference INT8 tiles _tile_dpbusd - 16x16 tiles
AVX-512 BF16 Training, high-precision inference BF16 _mm512_dpbf16_ps
AMX BF16 Training batches, dense inference BF16 tiles _tile_dpbf16ps
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close