Testing Methodology
Multi-level numerical parity testing to ensure correctness against PyTorch.
No-root Quantized Performance Workflow
On shared Xeon systems where perf, VTune, or Advisor are unavailable,
use repeatable CK-vs-llama.cpp timing gates and kernel-scoped benchmarks instead
of hardware counters. Keep BF16/AMX training experiments separate from quantized
inference patches unless the model path actually uses BF16 tensors.
LLAMA_CPP_DIR=/opt/app-root/src/Software/llama.cpp make test-gemv-comprehensive-quick- quick Q4/Q5/Q8 GEMV parity and speed against llama.cpp.make test-gemv-omp-quick- serial vs OpenMP GEMV parity, useful for proving when OpenMP is not the right production scheduler.make test-threadpool-parity-quick- production threadpool dispatch parity and speed for GEMV/GEMM decode/prefill shapes.LLAMA_CPP_DIR=/opt/app-root/src/Software/llama.cpp make test-head-major-q5-vs-llama-quick- head-major Q5/Q8 output projection against llama.cpp.make llamacpp-parity-nightly- nightly llama.cpp parity profile; includes the quantized GEMV quick lane, threadpool dispatch, DeltaNet ISA, and head-major Q5 checks.make llamacpp-parity-performake llamacpp-parity-perf-large- longer performance runs when changing shared quantized kernels.
Current optimization priority for inference is shared quantized kernels and orchestration:
gemv_q4_k_q8_k, gemv_q6_k_q8_k,
gemm_nt_q4_k_q8_k, Q8_K activation quantization, final logits /
head-major projection, and row-sweep dispatch through the persistent threadpool.
AMX BF16 work is valuable for BF16 training and BF16 serving, but it is not a substitute
for measuring the quantized decode path used by Qwen, Gemma, Nanbeige, and vision GGUFs.
Testing Pyramid
Tests are organized in levels, from individual kernels up to full training loops. If a higher-level test fails but lower-level tests pass, the bug is in integration, not kernels.
Level 1: Kernel Unit Tests
Each kernel is tested independently against a PyTorch reference.
What We Test
| Test File | Kernels Tested | Forward | Backward |
|---|---|---|---|
test_rope.py | RoPE precompute, apply | ✓ | ✓ |
test_rmsnorm.py | RMS normalization | ✓ | ✓ |
test_attention.py | Causal attention (MHA, GQA) | ✓ | - |
test_attention_backward.py | Attention gradients | - | ✓ |
test_kv_cache_attention.py | Flash prefill + KV-cache decode attention | ✓ | - |
test_gemm.py | Matrix multiplication | ✓ | - |
test_mlp.py | Feed-forward layers | ✓ | ✓ |
test_gelu.py | GELU activation | ✓ | ✓ |
test_swiglu.py | SwiGLU activation | ✓ | ✓ |
test_cross_entropy.py | Loss function | ✓ | ✓ |
test_embedding.py | Token embeddings | ✓ | ✓ |
How It Works
# Example: test_rope.py 1. Load C library via ctypes 2. Generate random input tensors 3. Run C kernel 4. Run PyTorch reference 5. Compare: max_diff should be < 1e-5
Running
make test # Run all kernel unit tests make test-bf16 # Run BF16 tests (skips on unsupported CPUs) python unittest/test_rope.py # Run specific test
Level 2: Full Layer Tests
Tests a complete transformer layer with all kernels combined.
What It Validates
- RMSNorm → QKV projection → RoPE → Attention → Output projection → Residual
- RMSNorm → MLP (SwiGLU) → Residual
- Forward AND backward through entire layer
# unittest/test_orchestration_layer.py - Uses ck_layer_forward_rmsnorm_swiglu() - Uses ck_layer_backward_rmsnorm_swiglu() - Compares output and all intermediate buffers
# unittest/test_kv_cache_layer_decode.py - Prefill with ck_layer_forward_rmsnorm_swiglu() - Decode with ck_layer_forward_rmsnorm_swiglu_decode() - Compares full-layer outputs (prefill+decode vs full forward)
Level 3: Per-Stage Layer Diff
Uses real model weights to test each stage within a layer.
Why This Matters
Kernel tests use random weights. Real model weights may expose edge cases (specific value ranges, correlations) that random tests miss.
What It Reports
$ make smollm-layer-diff SMOLLM_LAYER=0 SMOLLM_STAGE_DUMP=1 Stage-by-stage diffs for layer 0: ln1_out: max_diff=1.53e-05 q (post-rope): max_diff=2.10e-05 k (post-rope): max_diff=1.89e-05 scores: max_diff=3.21e-05 attn_out: max_diff=2.87e-05 proj_out: max_diff=4.12e-05 ln2_out: max_diff=1.98e-05 fc1_out: max_diff=3.55e-05 swiglu_out: max_diff=2.44e-05 output: max_diff=5.02e-05
Debugging Pattern
If q/k diverge after RoPE → RoPE bug If scores/attn_out diverge → Attention bug If proj diverges while attn_out clean → Output projection bug If mlp_out diverges → MLP/SwiGLU bug
Level 4: Full Forward Parity
Runs all layers and compares final logits to PyTorch.
$ make smollm-forward SMOLLM_TEXT="Hello world" SMOLLM_CONTEXT=5 Logits diff: max_abs=3.05e-05 mean_abs=1.23e-05 pos 0 top1: C=28 Torch=28 [OK] pos 1 top1: C=1715 Torch=1715 [OK] pos 2 top1: C=504 Torch=504 [OK] pos 3 top1: C=33 Torch=33 [OK] pos 4 top1: C=29 Torch=29 [OK]
What It Catches
- Weight loading/conversion bugs (HF → bump format)
- Layer chaining bugs (output of layer N → input of layer N+1)
- Error accumulation across many layers
- Context-length dependent bugs (RoPE at different positions)
Level 5: Training Parity
The most comprehensive test: forward + backward + weight update.
What It Tests
1. Initialize identical weights in C and PyTorch 2. Run forward pass → compare loss 3. Run backward pass → compute gradients 4. Apply SGD update → compare updated weights 5. Repeat for N steps → check error accumulation
Example Output (5 steps)
$ make tiny-parity TINY_STEPS=5 step 0 loss=5.567055 step 1 loss=5.557948 step 2 loss=5.548960 step 3 loss=5.540207 step 4 loss=5.531786 C loss: 5.531786 | Torch loss: 5.520767 Max weight diff: 8.535e-04 layer.0.wq max_diff=2.235e-08 layer.0.wk max_diff=2.980e-08 layer.0.wv max_diff=4.891e-06 ...
Expected Tolerances
| Steps | Max Weight Diff | Loss Diff | Status |
|---|---|---|---|
| 1 | < 1e-5 | < 1e-5 | Excellent |
| 5 | < 1e-3 | < 0.01 | Good |
| 20 | < 1e-2 | < 0.1 | Acceptable |
| 100+ | Compare curves | Same trend | Valid |
Test Configurations
We test across multiple model configurations to catch edge cases.
| Config | Hidden | Heads | KV Heads | Layers | Purpose |
|---|---|---|---|---|---|
| tiny | 64 | 2 | 2 | 2 | Baseline |
| gqa | 128 | 8 | 2 | 2 | Grouped Query Attention |
| deep | 64 | 2 | 2 | 6 | Error accumulation |
| wide | 128 | 8 | 8 | 2 | Many heads |
| no_rope | 64 | 2 | 2 | 2 | Positional embeddings only |
| medium | 256 | 4 | 4 | 4 | Larger scale |
Running the Full Test Suite
# Quick tests (< 1 minute) ./scripts/run_all_tests.sh quick # Full tests (5-10 minutes) ./scripts/run_all_tests.sh full # Stress tests (10+ minutes) ./scripts/run_all_tests.sh stress
v6.6 Contract + Matrix Gates
Before heavy E2E/parity runs, v6.6 uses two explicit preflight gates: a static tooling-contract checker and a dynamic 3-model matrix checker.
# Primary v6.6 release gate (recommended) make v6.6-gate # Runtime-required parity matrix (when llama parity runtime is present) make v6.6-validate-parity-matrix-required # Optional standalone gates make v6.6-validate-contracts make v6.6-kernel-map-gate make v6.6-validate-matrix-smoke make v6.6-validate-longdecode
Reports are also written to:
version/v6.6/tools/contract_report_latest.json and
version/v6.6/tools/model_matrix_report_latest.json.
When kernel maps change, regenerate with make v6.6-kernel-map-regenerate.
Sliding-Window Regression Coverage
Sliding-window attention now has an explicit contract test that is wired into regular test, llama.cpp parity flow, and nightly reporting.
# Run only sliding-window contract check make test-attention-sliding # Included in main kernel suite make test # Included in llama.cpp parity smoke/full flows make llamacpp-parity make llamacpp-parity-full
Implementation path:
unittest/test_attention_sliding_contract.py,
scripts/run_parity_smoketest.sh, and
scripts/nightly_runner.py.
Case Study: Finding the RoPE Bug
This methodology found a subtle RoPE rotation convention bug that caused predictions to diverge at later context positions.
Symptoms
- Kernel tests: PASSED (1e-5 precision) - Layer tests: PASSED (1e-5 precision) - Full forward: FAILED (predictions diverge at position 3+)
Investigation
1. Kernel tests passed → kernels are correct for SOME convention 2. Layer tests passed → same story 3. But full model failed → bug is in convention, not math 4. Per-stage diff showed: q/k looked "correct" but were wrong convention
Root Cause
Our RoPE: (x[0], x[1]), (x[2], x[3]), ... (interleaved pairs) Llama/SmolLM: (x[0..half], x[half..dim]) (rotate-half) Both are mathematically valid RoPE, but weights are trained for one convention.
Fix
1. Update C kernel: src/kernels/rope_kernels.c 2. Update Python references: unittest/test_rope.py, scripts/smollm_layer_stage_diff.py 3. Re-run all tests: ALL PASS
Adding Tests for New Models
When adding support for a new model architecture, follow this checklist:
- Kernel tests: Add tests for any new kernels (e.g., sliding window attention)
- Layer tests: Test full layer if architecture differs
- Weight conversion: Run
compare_bump_to_hf.pyto verify weights - Forward parity: Run forward pass and compare logits
- Training parity: Run 5-20 steps and compare weight updates
- Stress test: Run 100+ steps to check for numerical drift
Numerical Precision Guidelines
| Comparison | Float32 Threshold | Notes |
|---|---|---|
| Kernel output | 1e-5 | Should be near machine epsilon |
| Layer output | 1e-4 | Small accumulation acceptable |
| Full forward (logits) | 1e-3 | 30 layers × 1e-5 = 3e-4 |
| 1-step gradients | 1e-5 | Should match very closely |
| Multi-step weights | 1e-3 | Errors compound over steps |
| Token embeddings grad | 1e-2 | Aggregates from all layers |
- Any diff > 1.0 → Bug, not numerical precision
- Diffs growing exponentially over steps → Numerical instability
- One specific layer/kernel much worse than others → Bug in that component