SIMD & Quantization Architecture | C-Kernel-Engine

SIMD Instructions & Quantization for LLM Inference

Table of Contents 1. SIMD Instruction Hierarchy 2. Quantization Formats (Q4_K, Q8_K) 3. Transformer Operations & SIMD Mapping 4. Cache Hierarchy & Data Flow 5. Activation Storage & Hot Data Path 6. Implementation Strategy

1. SIMD Instruction Hierarchy

Intel x86 SIMD instructions evolved to provide increasingly wider vector operations and specialized instructions for neural network workloads.

Key Insight: VNNI for Decode, AMX for Prefill

Single-token decode (M=1): Use _mm512_dpbusd_epi32 (VNNI). AMX tile setup overhead makes it slower for single rows.

Prefill (M=prompt_length): Use AMX tiles for 16x16 matrix blocks. Amortizes tile load/store overhead.

Xeon Scalable Generations: Format Support Matrix

Generation	Codename	Year	AVX-512	VNNI (INT8)	BF16	AMX	Best For
1st Gen	Skylake-SP	2017	Yes	No	No	No	FP32 only
2nd Gen	Cascade Lake	2019	Yes	Yes	No	No	Q4_K + VNNI decode
3rd Gen	Ice Lake-SP	2021	Yes	Yes	AVX-512 BF16	No	BF16 training, Q4_K inference
4th Gen	Sapphire Rapids	2023	Yes	Yes	AVX-512 + AMX	INT8 + BF16	Full: Q4_K decode, AMX batch, BF16 train
5th Gen	Emerald Rapids	2024	Yes	Yes	AVX-512 + AMX	INT8 + BF16	Same as 4th Gen (improved cache)

Note: 2nd Gen Xeon has VNNI but no BF16 - you can still get excellent Q4_K performance! BF16 requires 3rd Gen+ for training workloads.

Weight Formats: Training vs Inference

Training

Master weights: FP32 (for optimizer state: Adam moments, gradients)

Forward/backward: BF16 preferred (same dynamic range as FP32, 2x memory savings)

Why not FP16: FP16 has limited range (6e-5 to 65504), causes overflow in gradients

Why not INT8: Gradients need full precision, quantization noise accumulates

Inference

Weights: Q4_K (4.5 bits) - 7x smaller than FP32, minimal quality loss

Activations: Quantize to Q8 on-the-fly (fast, ~1KB per layer)

KV cache: FP16 or FP32 (quality-sensitive, stays in cache)

Compute kernels: VNNI INT8 for GEMV, FP32 for softmax/RMSNorm

Why Quantization Helps (Memory Bandwidth is the Bottleneck)

For single-token decode, the model is memory-bandwidth bound, not compute-bound:

Each token requires loading the entire model weights from DRAM
Qwen2-0.5B FP32: ~2GB weights × 1 token = 2GB DRAM read per token
Qwen2-0.5B Q4_K: ~350MB weights × 1 token = 350MB DRAM read per token
DDR5 bandwidth: ~200 GB/s sustained → Q4_K gives ~570 tok/s theoretical vs ~100 tok/s FP32

Bottom line: Quantization is NOT optional for competitive inference speed. Q4_K with VNNI is the sweet spot for quality vs performance on Xeon.

2. Quantization Formats: Q4_K and Q8_K

llama.cpp uses block-wise quantization where weights and activations are quantized in groups with per-block scale factors.

3. Transformer Operations & SIMD Mapping

Different operations in the transformer benefit from different SIMD approaches.

Operation Breakdown by Time (Decode, M=1)

Operation	% of Time	Bottleneck	Best SIMD	Best Format
Q/K/V Projections	~25%	Memory BW	VNNI	Q4_K weights, Q8 activations
Output Projection	~8%	Memory BW	VNNI	Q4_K weights
MLP (gate+up+down)	~60%	Memory BW	VNNI	Q4_K weights
Attention (Q@K, softmax, @V)	~5%	Compute	AVX-512 FP32	FP32/FP16 KV cache
RMSNorm (x2)	~1%	Compute	AVX-512 FP32	FP32
RoPE, Residuals	~1%	Compute	AVX-512 FP32	FP32

4. Cache Hierarchy & Data Flow

Understanding cache behavior is crucial for keeping activations "hot" and avoiding DRAM round-trips.

5. Activation Storage & Hot Data Path

How to keep activations hot in cache while streaming weights from DRAM.

6. Implementation Strategy

Phase 1: VNNI Q4_K Decode (Highest Impact)

Implement quantize_row_q8_k() for activations
Implement vec_dot_q4_k_q8_k_vnni() using _mm512_dpbusd_epi32
Wire into decode path: after RMSNorm, quantize to Q8, use VNNI matvec
Keep FP32 for: RMSNorm, RoPE, Attention scores, Softmax

Expected speedup: 5-10x for decode

Phase 2: AMX for Prefill

Implement weight repacking to VNNI tile format at load time
Implement AMX tile kernel for Q4_K x Q8_K
Use for M > 16 (prefill batch size)
Keep VNNI for M = 1 (single token decode)

Expected speedup: 2-3x for prefill

Phase 3: SubNUMA & Hugepages

Detect SubNUMA clusters (SNC mode on Xeon)
Allocate model weights per-SNC with mbind()
Use 2MB hugepages for weight buffers
Pin threads to SNC, access local weights

Competitive advantage over llama.cpp!

Phase 4: BF16 for Training

Keep FP32 master weights for optimizer state
Use BF16 for forward/backward activations
Implement _mm512_dpbf16_ps for BF16 GEMM
Optional: AMX-BF16 tiles for batch training

For training workloads

Summary: What Each SIMD Is Best For

SIMD	Best Use Case	Data Types	Operations
AVX-512 FP32	RMSNorm, Softmax, RoPE, Residuals	FP32	`_mm512_fmadd_ps`, `_mm512_exp_ps`
AVX-512 VNNI	Q4_K decode (M=1 GEMV)	INT8 x INT8 → INT32	`_mm512_dpbusd_epi32` - 4x faster!
AMX INT8	Prefill (M>1), batch inference	INT8 tiles	`_tile_dpbusd` - 16x16 tiles
AVX-512 BF16	Training, high-precision inference	BF16	`_mm512_dpbf16_ps`
AMX BF16	Training batches, dense inference	BF16 tiles	`_tile_dpbf16ps`