--- layout: default title: Testing Methodology permalink: /testing/ ---

Testing Methodology

Multi-level numerical parity testing to ensure correctness against PyTorch.

Philosophy: Every kernel, layer, and model must match PyTorch to machine precision. Bugs in numerical code compound across layers and steps - a 1e-6 error per layer becomes 1e-4 after 100 layers. We test at every level to catch bugs where they originate.
Vision encoder parity note: the Qwen3-VL encoder parity closure and postmortem now live at Vision Encoder Parity. That page documents the actual failure mode, the debugging sequence, and the regressions to watch for on future encoder ports.

Testing Pyramid

Tests are organized in levels, from individual kernels up to full training loops. If a higher-level test fails but lower-level tests pass, the bug is in integration, not kernels.

5
Training Parity
Forward + Backward + Weight Updates
Catches: Gradient accumulation, optimizer bugs
< 1e-3
4
Full Forward Parity
All Layers → Compare Logits to PyTorch
Catches: Layer chaining, weight loading bugs
< 1e-3
3
Per-Stage Layer Diff
RoPE → Attention → Projection → MLP
Catches: Stage-specific bugs (RoPE, attention)
< 1e-4
2
Full Layer Tests
Complete Transformer Layer (Forward + Backward)
Catches: Kernel integration, buffer handling
< 1e-4
1
Kernel Unit Tests
RoPE, RMSNorm, Attention, GEMM, SwiGLU, Cross-Entropy
Catches: Individual kernel bugs, math errors
< 1e-5
Foundation
Integration
End-to-End
▲ Build up from foundation

Level 1: Kernel Unit Tests

Each kernel is tested independently against a PyTorch reference.

What We Test

Test FileKernels TestedForwardBackward
test_rope.pyRoPE precompute, apply
test_rmsnorm.pyRMS normalization
test_attention.pyCausal attention (MHA, GQA)-
test_attention_backward.pyAttention gradients-
test_kv_cache_attention.pyFlash prefill + KV-cache decode attention-
test_gemm.pyMatrix multiplication-
test_mlp.pyFeed-forward layers
test_gelu.pyGELU activation
test_swiglu.pySwiGLU activation
test_cross_entropy.pyLoss function
test_embedding.pyToken embeddings

How It Works

# Example: test_rope.py
1. Load C library via ctypes
2. Generate random input tensors
3. Run C kernel
4. Run PyTorch reference
5. Compare: max_diff should be < 1e-5
  

Running

make test              # Run all kernel unit tests
make test-bf16         # Run BF16 tests (skips on unsupported CPUs)
python unittest/test_rope.py  # Run specific test
  

Level 2: Full Layer Tests

Tests a complete transformer layer with all kernels combined.

What It Validates

# unittest/test_orchestration_layer.py
- Uses ck_layer_forward_rmsnorm_swiglu()
- Uses ck_layer_backward_rmsnorm_swiglu()
- Compares output and all intermediate buffers
  
# unittest/test_kv_cache_layer_decode.py
- Prefill with ck_layer_forward_rmsnorm_swiglu()
- Decode with ck_layer_forward_rmsnorm_swiglu_decode()
- Compares full-layer outputs (prefill+decode vs full forward)
  

Level 3: Per-Stage Layer Diff

Uses real model weights to test each stage within a layer.

Why This Matters

Kernel tests use random weights. Real model weights may expose edge cases (specific value ranges, correlations) that random tests miss.

What It Reports

$ make smollm-layer-diff SMOLLM_LAYER=0 SMOLLM_STAGE_DUMP=1

Stage-by-stage diffs for layer 0:
  ln1_out:    max_diff=1.53e-05
  q (post-rope): max_diff=2.10e-05
  k (post-rope): max_diff=1.89e-05
  scores:     max_diff=3.21e-05
  attn_out:   max_diff=2.87e-05
  proj_out:   max_diff=4.12e-05
  ln2_out:    max_diff=1.98e-05
  fc1_out:    max_diff=3.55e-05
  swiglu_out: max_diff=2.44e-05
  output:     max_diff=5.02e-05
  

Debugging Pattern

If q/k diverge after RoPE → RoPE bug
If scores/attn_out diverge → Attention bug
If proj diverges while attn_out clean → Output projection bug
If mlp_out diverges → MLP/SwiGLU bug
  

Level 4: Full Forward Parity

Runs all layers and compares final logits to PyTorch.

$ make smollm-forward SMOLLM_TEXT="Hello world" SMOLLM_CONTEXT=5

Logits diff: max_abs=3.05e-05 mean_abs=1.23e-05
pos 0 top1: C=28 Torch=28 [OK]
pos 1 top1: C=1715 Torch=1715 [OK]
pos 2 top1: C=504 Torch=504 [OK]
pos 3 top1: C=33 Torch=33 [OK]
pos 4 top1: C=29 Torch=29 [OK]
  

What It Catches


Level 5: Training Parity

The most comprehensive test: forward + backward + weight update.

What It Tests

1. Initialize identical weights in C and PyTorch
2. Run forward pass → compare loss
3. Run backward pass → compute gradients
4. Apply SGD update → compare updated weights
5. Repeat for N steps → check error accumulation
  

Example Output (5 steps)

$ make tiny-parity TINY_STEPS=5

step 0 loss=5.567055
step 1 loss=5.557948
step 2 loss=5.548960
step 3 loss=5.540207
step 4 loss=5.531786

C loss: 5.531786 | Torch loss: 5.520767
Max weight diff: 8.535e-04

layer.0.wq   max_diff=2.235e-08
layer.0.wk   max_diff=2.980e-08
layer.0.wv   max_diff=4.891e-06
...
  

Expected Tolerances

StepsMax Weight DiffLoss DiffStatus
1< 1e-5< 1e-5Excellent
5< 1e-3< 0.01Good
20< 1e-2< 0.1Acceptable
100+Compare curvesSame trendValid

Test Configurations

We test across multiple model configurations to catch edge cases.

ConfigHiddenHeadsKV HeadsLayersPurpose
tiny64222Baseline
gqa128822Grouped Query Attention
deep64226Error accumulation
wide128882Many heads
no_rope64222Positional embeddings only
medium256444Larger scale

Running the Full Test Suite

# Quick tests (< 1 minute)
./scripts/run_all_tests.sh quick

# Full tests (5-10 minutes)
./scripts/run_all_tests.sh full

# Stress tests (10+ minutes)
./scripts/run_all_tests.sh stress
  

v6.6 Contract + Matrix Gates

Before heavy E2E/parity runs, v6.6 uses two explicit preflight gates: a static tooling-contract checker and a dynamic 3-model matrix checker.

# Primary v6.6 release gate (recommended)
make v6.6-gate

# Runtime-required parity matrix (when llama parity runtime is present)
make v6.6-validate-parity-matrix-required

# Optional standalone gates
make v6.6-validate-contracts
make v6.6-kernel-map-gate
make v6.6-validate-matrix-smoke
make v6.6-validate-longdecode
  

Reports are also written to: version/v6.6/tools/contract_report_latest.json and version/v6.6/tools/model_matrix_report_latest.json. When kernel maps change, regenerate with make v6.6-kernel-map-regenerate.


Sliding-Window Regression Coverage

Sliding-window attention now has an explicit contract test that is wired into regular test, llama.cpp parity flow, and nightly reporting.

# Run only sliding-window contract check
make test-attention-sliding

# Included in main kernel suite
make test

# Included in llama.cpp parity smoke/full flows
make llamacpp-parity
make llamacpp-parity-full
  

Implementation path: unittest/test_attention_sliding_contract.py, scripts/run_parity_smoketest.sh, and scripts/nightly_runner.py.


Case Study: Finding the RoPE Bug

This methodology found a subtle RoPE rotation convention bug that caused predictions to diverge at later context positions.

Symptoms

- Kernel tests: PASSED (1e-5 precision)
- Layer tests: PASSED (1e-5 precision)
- Full forward: FAILED (predictions diverge at position 3+)
  

Investigation

1. Kernel tests passed → kernels are correct for SOME convention
2. Layer tests passed → same story
3. But full model failed → bug is in convention, not math
4. Per-stage diff showed: q/k looked "correct" but were wrong convention
  

Root Cause

Our RoPE:     (x[0], x[1]), (x[2], x[3]), ...  (interleaved pairs)
Llama/SmolLM: (x[0..half], x[half..dim])       (rotate-half)

Both are mathematically valid RoPE, but weights are trained for one convention.
  

Fix

1. Update C kernel: src/kernels/rope_kernels.c
2. Update Python references: unittest/test_rope.py, scripts/smollm_layer_stage_diff.py
3. Re-run all tests: ALL PASS
  

Adding Tests for New Models

When adding support for a new model architecture, follow this checklist:

  1. Kernel tests: Add tests for any new kernels (e.g., sliding window attention)
  2. Layer tests: Test full layer if architecture differs
  3. Weight conversion: Run compare_bump_to_hf.py to verify weights
  4. Forward parity: Run forward pass and compare logits
  5. Training parity: Run 5-20 steps and compare weight updates
  6. Stress test: Run 100+ steps to check for numerical drift

Numerical Precision Guidelines

ComparisonFloat32 ThresholdNotes
Kernel output1e-5Should be near machine epsilon
Layer output1e-4Small accumulation acceptable
Full forward (logits)1e-330 layers × 1e-5 = 3e-4
1-step gradients1e-5Should match very closely
Multi-step weights1e-3Errors compound over steps
Token embeddings grad1e-2Aggregates from all layers
Red Flags:
  • Any diff > 1.0 → Bug, not numerical precision
  • Diffs growing exponentially over steps → Numerical instability
  • One specific layer/kernel much worse than others → Bug in that component
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close