Testing Methodology
Multi-level numerical parity testing to ensure correctness against PyTorch.
Testing Pyramid
Tests are organized in levels, from individual kernels up to full training loops. If a higher-level test fails but lower-level tests pass, the bug is in integration, not kernels.
Level 1: Kernel Unit Tests
Each kernel is tested independently against a PyTorch reference.
What We Test
| Test File | Kernels Tested | Forward | Backward |
|---|---|---|---|
test_rope.py | RoPE precompute, apply | ✓ | ✓ |
test_rmsnorm.py | RMS normalization | ✓ | ✓ |
test_attention.py | Causal attention (MHA, GQA) | ✓ | - |
test_attention_backward.py | Attention gradients | - | ✓ |
test_kv_cache_attention.py | Flash prefill + KV-cache decode attention | ✓ | - |
test_gemm.py | Matrix multiplication | ✓ | - |
test_mlp.py | Feed-forward layers | ✓ | ✓ |
test_gelu.py | GELU activation | ✓ | ✓ |
test_swiglu.py | SwiGLU activation | ✓ | ✓ |
test_cross_entropy.py | Loss function | ✓ | ✓ |
test_embedding.py | Token embeddings | ✓ | ✓ |
How It Works
# Example: test_rope.py 1. Load C library via ctypes 2. Generate random input tensors 3. Run C kernel 4. Run PyTorch reference 5. Compare: max_diff should be < 1e-5
Running
make test # Run all kernel unit tests make test-bf16 # Run BF16 tests (skips on unsupported CPUs) python unittest/test_rope.py # Run specific test
Level 2: Full Layer Tests
Tests a complete transformer layer with all kernels combined.
What It Validates
- RMSNorm → QKV projection → RoPE → Attention → Output projection → Residual
- RMSNorm → MLP (SwiGLU) → Residual
- Forward AND backward through entire layer
# unittest/test_orchestration_layer.py - Uses ck_layer_forward_rmsnorm_swiglu() - Uses ck_layer_backward_rmsnorm_swiglu() - Compares output and all intermediate buffers
# unittest/test_kv_cache_layer_decode.py - Prefill with ck_layer_forward_rmsnorm_swiglu() - Decode with ck_layer_forward_rmsnorm_swiglu_decode() - Compares full-layer outputs (prefill+decode vs full forward)
Level 3: Per-Stage Layer Diff
Uses real model weights to test each stage within a layer.
Why This Matters
Kernel tests use random weights. Real model weights may expose edge cases (specific value ranges, correlations) that random tests miss.
What It Reports
$ make smollm-layer-diff SMOLLM_LAYER=0 SMOLLM_STAGE_DUMP=1 Stage-by-stage diffs for layer 0: ln1_out: max_diff=1.53e-05 q (post-rope): max_diff=2.10e-05 k (post-rope): max_diff=1.89e-05 scores: max_diff=3.21e-05 attn_out: max_diff=2.87e-05 proj_out: max_diff=4.12e-05 ln2_out: max_diff=1.98e-05 fc1_out: max_diff=3.55e-05 swiglu_out: max_diff=2.44e-05 output: max_diff=5.02e-05
Debugging Pattern
If q/k diverge after RoPE → RoPE bug If scores/attn_out diverge → Attention bug If proj diverges while attn_out clean → Output projection bug If mlp_out diverges → MLP/SwiGLU bug
Level 4: Full Forward Parity
Runs all layers and compares final logits to PyTorch.
$ make smollm-forward SMOLLM_TEXT="Hello world" SMOLLM_CONTEXT=5 Logits diff: max_abs=3.05e-05 mean_abs=1.23e-05 pos 0 top1: C=28 Torch=28 [OK] pos 1 top1: C=1715 Torch=1715 [OK] pos 2 top1: C=504 Torch=504 [OK] pos 3 top1: C=33 Torch=33 [OK] pos 4 top1: C=29 Torch=29 [OK]
What It Catches
- Weight loading/conversion bugs (HF → bump format)
- Layer chaining bugs (output of layer N → input of layer N+1)
- Error accumulation across many layers
- Context-length dependent bugs (RoPE at different positions)
Level 5: Training Parity
The most comprehensive test: forward + backward + weight update.
What It Tests
1. Initialize identical weights in C and PyTorch 2. Run forward pass → compare loss 3. Run backward pass → compute gradients 4. Apply SGD update → compare updated weights 5. Repeat for N steps → check error accumulation
Example Output (5 steps)
$ make tiny-parity TINY_STEPS=5 step 0 loss=5.567055 step 1 loss=5.557948 step 2 loss=5.548960 step 3 loss=5.540207 step 4 loss=5.531786 C loss: 5.531786 | Torch loss: 5.520767 Max weight diff: 8.535e-04 layer.0.wq max_diff=2.235e-08 layer.0.wk max_diff=2.980e-08 layer.0.wv max_diff=4.891e-06 ...
Expected Tolerances
| Steps | Max Weight Diff | Loss Diff | Status |
|---|---|---|---|
| 1 | < 1e-5 | < 1e-5 | Excellent |
| 5 | < 1e-3 | < 0.01 | Good |
| 20 | < 1e-2 | < 0.1 | Acceptable |
| 100+ | Compare curves | Same trend | Valid |
Test Configurations
We test across multiple model configurations to catch edge cases.
| Config | Hidden | Heads | KV Heads | Layers | Purpose |
|---|---|---|---|---|---|
| tiny | 64 | 2 | 2 | 2 | Baseline |
| gqa | 128 | 8 | 2 | 2 | Grouped Query Attention |
| deep | 64 | 2 | 2 | 6 | Error accumulation |
| wide | 128 | 8 | 8 | 2 | Many heads |
| no_rope | 64 | 2 | 2 | 2 | Positional embeddings only |
| medium | 256 | 4 | 4 | 4 | Larger scale |
Running the Full Test Suite
# Quick tests (< 1 minute) ./scripts/run_all_tests.sh quick # Full tests (5-10 minutes) ./scripts/run_all_tests.sh full # Stress tests (10+ minutes) ./scripts/run_all_tests.sh stress
v6.6 Contract + Matrix Gates
Before heavy E2E/parity runs, v6.6 uses two explicit preflight gates: a static tooling-contract checker and a dynamic 3-model matrix checker.
# Primary v6.6 release gate (recommended) make v6.6-gate # Runtime-required parity matrix (when llama parity runtime is present) make v6.6-validate-parity-matrix-required # Optional standalone gates make v6.6-validate-contracts make v6.6-kernel-map-gate make v6.6-validate-matrix-smoke make v6.6-validate-longdecode
Reports are also written to:
version/v6.6/tools/contract_report_latest.json and
version/v6.6/tools/model_matrix_report_latest.json.
When kernel maps change, regenerate with make v6.6-kernel-map-regenerate.
Sliding-Window Regression Coverage
Sliding-window attention now has an explicit contract test that is wired into regular test, llama.cpp parity flow, and nightly reporting.
# Run only sliding-window contract check make test-attention-sliding # Included in main kernel suite make test # Included in llama.cpp parity smoke/full flows make llamacpp-parity make llamacpp-parity-full
Implementation path:
unittest/test_attention_sliding_contract.py,
scripts/run_parity_smoketest.sh, and
scripts/nightly_runner.py.
Case Study: Finding the RoPE Bug
This methodology found a subtle RoPE rotation convention bug that caused predictions to diverge at later context positions.
Symptoms
- Kernel tests: PASSED (1e-5 precision) - Layer tests: PASSED (1e-5 precision) - Full forward: FAILED (predictions diverge at position 3+)
Investigation
1. Kernel tests passed → kernels are correct for SOME convention 2. Layer tests passed → same story 3. But full model failed → bug is in convention, not math 4. Per-stage diff showed: q/k looked "correct" but were wrong convention
Root Cause
Our RoPE: (x[0], x[1]), (x[2], x[3]), ... (interleaved pairs) Llama/SmolLM: (x[0..half], x[half..dim]) (rotate-half) Both are mathematically valid RoPE, but weights are trained for one convention.
Fix
1. Update C kernel: src/kernels/rope_kernels.c 2. Update Python references: unittest/test_rope.py, scripts/smollm_layer_stage_diff.py 3. Re-run all tests: ALL PASS
Adding Tests for New Models
When adding support for a new model architecture, follow this checklist:
- Kernel tests: Add tests for any new kernels (e.g., sliding window attention)
- Layer tests: Test full layer if architecture differs
- Weight conversion: Run
compare_bump_to_hf.pyto verify weights - Forward parity: Run forward pass and compare logits
- Training parity: Run 5-20 steps and compare weight updates
- Stress test: Run 100+ steps to check for numerical drift
Numerical Precision Guidelines
| Comparison | Float32 Threshold | Notes |
|---|---|---|
| Kernel output | 1e-5 | Should be near machine epsilon |
| Layer output | 1e-4 | Small accumulation acceptable |
| Full forward (logits) | 1e-3 | 30 layers × 1e-5 = 3e-4 |
| 1-step gradients | 1e-5 | Should match very closely |
| Multi-step weights | 1e-3 | Errors compound over steps |
| Token embeddings grad | 1e-2 | Aggregates from all layers |
- Any diff > 1.0 → Bug, not numerical precision
- Diffs growing exponentially over steps → Numerical instability
- One specific layer/kernel much worse than others → Bug in that component