--- layout: default title: Testing Methodology permalink: /testing/ ---

Testing Methodology

Multi-level numerical parity testing to ensure correctness against PyTorch.

Philosophy: Every kernel, layer, and model must match PyTorch to machine precision. Bugs in numerical code compound across layers and steps - a 1e-6 error per layer becomes 1e-4 after 100 layers. We test at every level to catch bugs where they originate.

Vision encoder parity note: the Qwen3-VL encoder parity closure and postmortem now live at Vision Encoder Parity. That page documents the actual failure mode, the debugging sequence, and the regressions to watch for on future encoder ports.

Stitched divergence harness: backend-level parity now uses a staged harness that compares CK against llama.cpp/mtmd for GGUF paths and can be extended to PyTorch when a matching tensor-dump adapter exists. See Stitched Divergence Harness for the backend contract, dump format, and first-divergence workflow.

Quantized inference performance gate: BF16 kernels are primarily the training / BF16-serving path. Production CPU inference for GGUF models is usually dominated by quantized Q4_K/Q5_K/Q6_K/Q8_K GEMV/GEMM, activation quantization, head-major output projection, and threadpool orchestration. Kernel changes in those paths must be checked against llama.cpp for both parity and speed before they are considered production-ready.

No-root Quantized Performance Workflow

On shared Xeon systems where perf, VTune, or Advisor are unavailable, use repeatable CK-vs-llama.cpp timing gates and kernel-scoped benchmarks instead of hardware counters. Keep BF16/AMX training experiments separate from quantized inference patches unless the model path actually uses BF16 tensors.

LLAMA_CPP_DIR=/opt/app-root/src/Software/llama.cpp make test-gemv-comprehensive-quick - quick Q4/Q5/Q8 GEMV parity and speed against llama.cpp.
make test-gemv-omp-quick - serial vs OpenMP GEMV parity, useful for proving when OpenMP is not the right production scheduler.
make test-threadpool-parity-quick - production threadpool dispatch parity and speed for GEMV/GEMM decode/prefill shapes.
LLAMA_CPP_DIR=/opt/app-root/src/Software/llama.cpp make test-head-major-q5-vs-llama-quick - head-major Q5/Q8 output projection against llama.cpp.
make llamacpp-parity-nightly - nightly llama.cpp parity profile; includes the quantized GEMV quick lane, threadpool dispatch, DeltaNet ISA, and head-major Q5 checks.
make llamacpp-parity-perf or make llamacpp-parity-perf-large - longer performance runs when changing shared quantized kernels.

Current optimization priority for inference is shared quantized kernels and orchestration: gemv_q4_k_q8_k, gemv_q6_k_q8_k, gemm_nt_q4_k_q8_k, Q8_K activation quantization, final logits / head-major projection, and row-sweep dispatch through the persistent threadpool. AMX BF16 work is valuable for BF16 training and BF16 serving, but it is not a substitute for measuring the quantized decode path used by Qwen, Gemma, Nanbeige, and vision GGUFs.

Q6_K prefill performance note: the v8 Q6_K x Q8_K 2D prefill scheduler is shape-gated based on local roofline-style measurements. See Q6_K Prefill Roofline Notes for the measured shapes, commands, and dispatch policy. For the broader GEMM/GEMV production-kernel model, see GEMM vs GEMV in Production LLM Serving.

Testing Pyramid

Tests are organized in levels, from individual kernels up to full training loops. If a higher-level test fails but lower-level tests pass, the bug is in integration, not kernels.

Training Parity

Forward + Backward + Weight Updates

Catches: Gradient accumulation, optimizer bugs

< 1e-3

▲

Full Forward Parity

All Layers → Compare Logits to PyTorch

Catches: Layer chaining, weight loading bugs

< 1e-3

▲

Per-Stage Layer Diff

RoPE → Attention → Projection → MLP

Catches: Stage-specific bugs (RoPE, attention)

< 1e-4

▲

Full Layer Tests

Complete Transformer Layer (Forward + Backward)

Catches: Kernel integration, buffer handling

< 1e-4

▲

Kernel Unit Tests

RoPE, RMSNorm, Attention, GEMM, SwiGLU, Cross-Entropy

Catches: Individual kernel bugs, math errors

< 1e-5

Foundation

Integration

End-to-End

▲ Build up from foundation

Level 1: Kernel Unit Tests

Each kernel is tested independently against a PyTorch reference.

What We Test

Test File	Kernels Tested	Forward	Backward
`test_rope.py`	RoPE precompute, apply	✓	✓
`test_rmsnorm.py`	RMS normalization	✓	✓
`test_attention.py`	Causal attention (MHA, GQA)	✓	-
`test_attention_backward.py`	Attention gradients	-	✓
`test_kv_cache_attention.py`	Flash prefill + KV-cache decode attention	✓	-
`test_gemm.py`	Matrix multiplication	✓	-
`test_mlp.py`	Feed-forward layers	✓	✓
`test_gelu.py`	GELU activation	✓	✓
`test_swiglu.py`	SwiGLU activation	✓	✓
`test_cross_entropy.py`	Loss function	✓	✓
`test_embedding.py`	Token embeddings	✓	✓

How It Works

# Example: test_rope.py
1. Load C library via ctypes
2. Generate random input tensors
3. Run C kernel
4. Run PyTorch reference
5. Compare: max_diff should be < 1e-5

Running

make test              # Run all kernel unit tests
make test-bf16         # Run BF16 tests (skips on unsupported CPUs)
python unittest/test_rope.py  # Run specific test

Level 2: Full Layer Tests

Tests a complete transformer layer with all kernels combined.

What It Validates

RMSNorm → QKV projection → RoPE → Attention → Output projection → Residual
RMSNorm → MLP (SwiGLU) → Residual
Forward AND backward through entire layer

# unittest/test_orchestration_layer.py
- Uses ck_layer_forward_rmsnorm_swiglu()
- Uses ck_layer_backward_rmsnorm_swiglu()
- Compares output and all intermediate buffers

# unittest/test_kv_cache_layer_decode.py
- Prefill with ck_layer_forward_rmsnorm_swiglu()
- Decode with ck_layer_forward_rmsnorm_swiglu_decode()
- Compares full-layer outputs (prefill+decode vs full forward)

Level 3: Per-Stage Layer Diff

Uses real model weights to test each stage within a layer.

Why This Matters

Kernel tests use random weights. Real model weights may expose edge cases (specific value ranges, correlations) that random tests miss.

What It Reports

$ make smollm-layer-diff SMOLLM_LAYER=0 SMOLLM_STAGE_DUMP=1

Stage-by-stage diffs for layer 0:
  ln1_out:    max_diff=1.53e-05
  q (post-rope): max_diff=2.10e-05
  k (post-rope): max_diff=1.89e-05
  scores:     max_diff=3.21e-05
  attn_out:   max_diff=2.87e-05
  proj_out:   max_diff=4.12e-05
  ln2_out:    max_diff=1.98e-05
  fc1_out:    max_diff=3.55e-05
  swiglu_out: max_diff=2.44e-05
  output:     max_diff=5.02e-05

Debugging Pattern

If q/k diverge after RoPE → RoPE bug
If scores/attn_out diverge → Attention bug
If proj diverges while attn_out clean → Output projection bug
If mlp_out diverges → MLP/SwiGLU bug

Level 4: Full Forward Parity

Runs all layers and compares final logits to PyTorch.

$ make smollm-forward SMOLLM_TEXT="Hello world" SMOLLM_CONTEXT=5

Logits diff: max_abs=3.05e-05 mean_abs=1.23e-05
pos 0 top1: C=28 Torch=28 [OK]
pos 1 top1: C=1715 Torch=1715 [OK]
pos 2 top1: C=504 Torch=504 [OK]
pos 3 top1: C=33 Torch=33 [OK]
pos 4 top1: C=29 Torch=29 [OK]

What It Catches

Weight loading/conversion bugs (HF → bump format)
Layer chaining bugs (output of layer N → input of layer N+1)
Error accumulation across many layers
Context-length dependent bugs (RoPE at different positions)

Level 5: Training Parity

The most comprehensive test: forward + backward + weight update.

What It Tests

1. Initialize identical weights in C and PyTorch
2. Run forward pass → compare loss
3. Run backward pass → compute gradients
4. Apply SGD update → compare updated weights
5. Repeat for N steps → check error accumulation

Example Output (5 steps)

$ make tiny-parity TINY_STEPS=5

step 0 loss=5.567055
step 1 loss=5.557948
step 2 loss=5.548960
step 3 loss=5.540207
step 4 loss=5.531786

C loss: 5.531786 | Torch loss: 5.520767
Max weight diff: 8.535e-04

layer.0.wq   max_diff=2.235e-08
layer.0.wk   max_diff=2.980e-08
layer.0.wv   max_diff=4.891e-06
...

Expected Tolerances

Steps	Max Weight Diff	Loss Diff	Status
1	< 1e-5	< 1e-5	Excellent
5	< 1e-3	< 0.01	Good
20	< 1e-2	< 0.1	Acceptable
100+	Compare curves	Same trend	Valid

Test Configurations

We test across multiple model configurations to catch edge cases.

Config	Hidden	Heads	KV Heads	Layers	Purpose
tiny	64	2	2	2	Baseline
gqa	128	8	2	2	Grouped Query Attention
deep	64	2	2	6	Error accumulation
wide	128	8	8	2	Many heads
no_rope	64	2	2	2	Positional embeddings only
medium	256	4	4	4	Larger scale

Running the Full Test Suite

# Quick tests (< 1 minute)
./scripts/run_all_tests.sh quick

# Full tests (5-10 minutes)
./scripts/run_all_tests.sh full

# Stress tests (10+ minutes)
./scripts/run_all_tests.sh stress

Current Version Gates

Active testing is split by runtime responsibility. v8 is the promoted inference/runtime lane, while v7 remains the promoted training/backprop lane. v6.6 is retained for historical compatibility checks, but it is no longer a default pre-push or nightly blocking gate.

# Local pre-push/pre-flight shape
git push

# v8 inference/runtime contract lane
make nightly-inference
make v8-validate-contracts
make v8-kernel-map-contracts
make v8-regression-fast

# v7 training/backprop lane
make v7-train-ir-smoke
make v7-kernel-parity-train
make regression-training-full

# Historical/manual v6.6 compatibility lane
make nightly-archive
make v6.6-gate

The v8 inference regression manifest currently covers Gemma3, Qwen2, Qwen3, Qwen3.5, and Nanbeige. Qwen3-VL is covered by the explicit vision smoke lane because the 8B decoder + mmproj path needs cached artifacts and more RAM than small CI runners usually provide. Gemma4, Nemotron, and GLM4 high-memory smokes stay in the inference lane but skip unless enough RAM is available. v7 training gates cover train IR, backward synthesis, layout/memory audit, and training-kernel parity. v6.6 reports still write under version/v6.6/tools/ when run manually.

Lane	Primary command	Purpose	Status
v8 inference	`make nightly-inference`	Runtime contracts, kernel-map contracts, model-family smoke/coherence, decoder matrix.	Default active lane
v7 training	`make regression-training-full`	Backprop family regression, long-epoch drift, training-kernel parity, visualizer training reports.	Default active lane
v6.6 archive	`make nightly-archive`	Legacy tooling contracts, kernel-map gate, model matrix compatibility.	Manual only

Sliding-Window Regression Coverage

Sliding-window attention now has an explicit contract test that is wired into regular test, llama.cpp parity flow, and nightly reporting.

# Run only sliding-window contract check
make test-attention-sliding

# Included in main kernel suite
make test

# Included in llama.cpp parity smoke/full flows
make llamacpp-parity
make llamacpp-parity-full

Implementation path: unittest/test_attention_sliding_contract.py, scripts/run_parity_smoketest.sh, and scripts/nightly_runner.py.

Case Study: Finding the RoPE Bug

This methodology found a subtle RoPE rotation convention bug that caused predictions to diverge at later context positions.

Symptoms

- Kernel tests: PASSED (1e-5 precision)
- Layer tests: PASSED (1e-5 precision)
- Full forward: FAILED (predictions diverge at position 3+)

Investigation

1. Kernel tests passed → kernels are correct for SOME convention
2. Layer tests passed → same story
3. But full model failed → bug is in convention, not math
4. Per-stage diff showed: q/k looked "correct" but were wrong convention

Root Cause

Our RoPE:     (x[0], x[1]), (x[2], x[3]), ...  (interleaved pairs)
Llama/SmolLM: (x[0..half], x[half..dim])       (rotate-half)

Both are mathematically valid RoPE, but weights are trained for one convention.

Fix

1. Update C kernel: src/kernels/rope_kernels.c
2. Update Python references: unittest/test_rope.py, scripts/smollm_layer_stage_diff.py
3. Re-run all tests: ALL PASS

Adding Tests for New Models

When adding support for a new model architecture, follow this checklist:

Kernel tests: Add tests for any new kernels (e.g., sliding window attention)
Layer tests: Test full layer if architecture differs
Weight conversion: Run compare_bump_to_hf.py to verify weights
Forward parity: Run forward pass and compare logits
Training parity: Run 5-20 steps and compare weight updates
Stress test: Run 100+ steps to check for numerical drift

Numerical Precision Guidelines

Comparison	Float32 Threshold	Notes
Kernel output	1e-5	Should be near machine epsilon
Layer output	1e-4	Small accumulation acceptable
Full forward (logits)	1e-3	30 layers × 1e-5 = 3e-4
1-step gradients	1e-5	Should match very closely
Multi-step weights	1e-3	Errors compound over steps
Token embeddings grad	1e-2	Aggregates from all layers

Red Flags:

Any diff > 1.0 → Bug, not numerical precision
Diffs growing exponentially over steps → Numerical instability
One specific layer/kernel much worse than others → Bug in that component

High-Memory Model Smoke

The nightly inference lane includes make test-v8-gemma4-highmem. It skips unless enough RAM is available.