SIMD Instructions & Quantization for LLM Inference
1. SIMD Instruction Hierarchy
Intel x86 SIMD instructions evolved to provide increasingly wider vector operations and specialized instructions for neural network workloads.
Key Insight: VNNI for Decode, AMX for Prefill
Single-token decode (M=1): Use _mm512_dpbusd_epi32 (VNNI). AMX tile setup overhead makes it slower for single rows.
Prefill (M=prompt_length): Use AMX tiles for 16x16 matrix blocks. Amortizes tile load/store overhead.
Xeon Scalable Generations: Format Support Matrix
| Generation | Codename | Year | AVX-512 | VNNI (INT8) | BF16 | AMX | Best For |
|---|---|---|---|---|---|---|---|
| 1st Gen | Skylake-SP | 2017 | Yes | No | No | No | FP32 only |
| 2nd Gen | Cascade Lake | 2019 | Yes | Yes | No | No | Q4_K + VNNI decode |
| 3rd Gen | Ice Lake-SP | 2021 | Yes | Yes | AVX-512 BF16 | No | BF16 training, Q4_K inference |
| 4th Gen | Sapphire Rapids | 2023 | Yes | Yes | AVX-512 + AMX | INT8 + BF16 | Full: Q4_K decode, AMX batch, BF16 train |
| 5th Gen | Emerald Rapids | 2024 | Yes | Yes | AVX-512 + AMX | INT8 + BF16 | Same as 4th Gen (improved cache) |
Note: 2nd Gen Xeon has VNNI but no BF16 - you can still get excellent Q4_K performance! BF16 requires 3rd Gen+ for training workloads.
Weight Formats: Training vs Inference
Training
Master weights: FP32 (for optimizer state: Adam moments, gradients)
Forward/backward: BF16 preferred (same dynamic range as FP32, 2x memory savings)
Why not FP16: FP16 has limited range (6e-5 to 65504), causes overflow in gradients
Why not INT8: Gradients need full precision, quantization noise accumulates
Inference
Weights: Q4_K (4.5 bits) - 7x smaller than FP32, minimal quality loss
Activations: Quantize to Q8 on-the-fly (fast, ~1KB per layer)
KV cache: FP16 or FP32 (quality-sensitive, stays in cache)
Compute kernels: VNNI INT8 for GEMV, FP32 for softmax/RMSNorm
Why Quantization Helps (Memory Bandwidth is the Bottleneck)
For single-token decode, the model is memory-bandwidth bound, not compute-bound:
- Each token requires loading the entire model weights from DRAM
- Qwen2-0.5B FP32: ~2GB weights × 1 token = 2GB DRAM read per token
- Qwen2-0.5B Q4_K: ~350MB weights × 1 token = 350MB DRAM read per token
- DDR5 bandwidth: ~200 GB/s sustained → Q4_K gives ~570 tok/s theoretical vs ~100 tok/s FP32
Bottom line: Quantization is NOT optional for competitive inference speed. Q4_K with VNNI is the sweet spot for quality vs performance on Xeon.
2. Quantization Formats: Q4_K and Q8_K
llama.cpp uses block-wise quantization where weights and activations are quantized in groups with per-block scale factors.
3. Transformer Operations & SIMD Mapping
Different operations in the transformer benefit from different SIMD approaches.
Operation Breakdown by Time (Decode, M=1)
| Operation | % of Time | Bottleneck | Best SIMD | Best Format |
|---|---|---|---|---|
| Q/K/V Projections | ~25% | Memory BW | VNNI | Q4_K weights, Q8 activations |
| Output Projection | ~8% | Memory BW | VNNI | Q4_K weights |
| MLP (gate+up+down) | ~60% | Memory BW | VNNI | Q4_K weights |
| Attention (Q@K, softmax, @V) | ~5% | Compute | AVX-512 FP32 | FP32/FP16 KV cache |
| RMSNorm (x2) | ~1% | Compute | AVX-512 FP32 | FP32 |
| RoPE, Residuals | ~1% | Compute | AVX-512 FP32 | FP32 |
4. Cache Hierarchy & Data Flow
Understanding cache behavior is crucial for keeping activations "hot" and avoiding DRAM round-trips.
5. Activation Storage & Hot Data Path
How to keep activations hot in cache while streaming weights from DRAM.
6. Implementation Strategy
Phase 1: VNNI Q4_K Decode (Highest Impact)
- Implement
quantize_row_q8_k()for activations - Implement
vec_dot_q4_k_q8_k_vnni()using_mm512_dpbusd_epi32 - Wire into decode path: after RMSNorm, quantize to Q8, use VNNI matvec
- Keep FP32 for: RMSNorm, RoPE, Attention scores, Softmax
Expected speedup: 5-10x for decode
Phase 2: AMX for Prefill
- Implement weight repacking to VNNI tile format at load time
- Implement AMX tile kernel for Q4_K x Q8_K
- Use for M > 16 (prefill batch size)
- Keep VNNI for M = 1 (single token decode)
Expected speedup: 2-3x for prefill
Phase 3: SubNUMA & Hugepages
- Detect SubNUMA clusters (SNC mode on Xeon)
- Allocate model weights per-SNC with
mbind() - Use 2MB hugepages for weight buffers
- Pin threads to SNC, access local weights
Competitive advantage over llama.cpp!
Phase 4: BF16 for Training
- Keep FP32 master weights for optimizer state
- Use BF16 for forward/backward activations
- Implement
_mm512_dpbf16_psfor BF16 GEMM - Optional: AMX-BF16 tiles for batch training
For training workloads
Summary: What Each SIMD Is Best For
| SIMD | Best Use Case | Data Types | Operations |
|---|---|---|---|
| AVX-512 FP32 | RMSNorm, Softmax, RoPE, Residuals | FP32 | _mm512_fmadd_ps, _mm512_exp_ps |
| AVX-512 VNNI | Q4_K decode (M=1 GEMV) | INT8 x INT8 → INT32 | _mm512_dpbusd_epi32 - 4x faster! |
| AMX INT8 | Prefill (M>1), batch inference | INT8 tiles | _tile_dpbusd - 16x16 tiles |
| AVX-512 BF16 | Training, high-precision inference | BF16 | _mm512_dpbf16_ps |
| AMX BF16 | Training batches, dense inference | BF16 tiles | _tile_dpbf16ps |