Model + Kernel Matrix
A visual map of what the C Kernel Engine supports: IR templates, kernel families, quant contracts, and pipeline coverage (inference-only vs. inference + v7 training). Grounded in the actual templates, IR lowering passes, and GGUF test corpus run locally.
template: qwen2.json
Qwen2
RMSNorm · RoPE split-half · SwiGLU · GQA
QKV bias from weights · BPE tokenizer
Quantized inference — GGUF, GEMM/GEMV dispatch
v7 training — IR backprop, AdamW, parity gates
Q4_K_M ★
Q5_K
Q6_K
Q8_0
FP32
template: qwen3.json
Qwen3
RMSNorm · RoPE split-half · QK‑Norm · SwiGLU
No QKV bias · BPE tokenizer · learned QK‑Norm
Quantized inference — GGUF, GEMM/GEMV dispatch
v7 training — IR backprop, AdamW, parity gates
Q4_K_M ★
Q5_K
Q6_K
Q8_0
FP32
template: qwen35.json
Qwen3.5
Hybrid recurrent + gated attention · RMSNorm · RoPE
QK‑Norm (full‑attn only) · SwiGLU · Gated DeltaNet · BPE tokenizer
Block pattern: 3×recurrent → 1×full_attention
Quantized inference — hybrid recurrent-attention decode, DeltaNet kernel parity
v7 training — IR backprop, DeltaNet backward kernels, AdamW, parity gates
Bring-up fix: preserved <think> / </think> markers on the built-in C tokenizer path and moved visible vs. suppressed thinking into the exported chat contract, instead of relying on Python-tokenizer fallback.
hybrid_recurrent_attention
gated_deltanet
0.8B dense compatible
Q4_K_M ★
Q8_0
FP32
template: gemma3.json
Gemma3
RMSNorm · RoPE split-half · QK‑Norm · GeGLU · Sliding Window
Post-norms · embed scale √dim · SentencePiece
Quantized inference — GGUF, sliding-window attn kernel, split-half RoPE parity restored
v7 training — IR backprop, AdamW, parity gates
Fix detail: Gemma was applying RoPE with the wrong dimension pairing. Gemma expects split-half RoPE, where the first half of Q/K is paired with the second half during the rotation step. The runtime could still choose the other common layout, so attention drifted from the reference path. The fix was to read exported rope_layout metadata and select the matching rope_qk kernel automatically.
first-divergence @ rope_qk
rope_layout contract honored
Q8_0 ★
Q4_K
BF16
FP32
template: llama.json
Llama / Nanbeige
RMSNorm · RoPE pairwise · SwiGLU · GQA
Untied LM head supported · SentencePiece · ChatML markers via GGUF
Quantized inference — active GGUF bring-up lane on Nanbeige 4.1 3B
v7 training — not yet a documented preset; use template-audit + parity gates first
Current bring-up lane keeps SentencePiece + ChatML on the C-tokenizer path, preserves untied output.weight, and treats Nanbeige long coherent think traces as model behavior instead of a kernel/parity failure.
untied output.weight
ChatML auto template
template-audit first
Q4_K_M
Q8_0
FP32
template: gpt2.json
GPT‑2
LayerNorm · Learned PosEmb · GELU
QKV bias · tied embeddings · BPE tokenizer
Quantized inference — GGUF baseline
v7 training — not yet in scope
Q4_0
Q4_1
Q5_0
Q8_0
FP32
Kernel Families by Model
Model
Tokenizer
QKV Bias
QK‑Norm
RoPE / PosEmb
Attention
MLP
Post‑Norms
Logits
GPT‑2
BPE
Yes
No
Learned
Causal
GELU
No
Tied
Qwen2
BPE
From weights
No
RoPE split-half
Causal
SwiGLU
No
Tied
Qwen3
BPE
No
Yes
RoPE split-half
Causal
SwiGLU
No
LM head
Qwen3.5
BPE
No
Full‑attn only
RoPE
Hybrid DeltaNet + Gated
SwiGLU
No
Tied
Gemma3
SentencePiece
From weights
Yes
RoPE
Sliding
GeGLU
Yes
Tied
Llama / Nanbeige
SentencePiece
From weights
No
RoPE pairwise
Causal GQA
SwiGLU
No
Untied head
IR + Quant Contracts
Model
Embed Scale
Prefer Q8 Contract
Prefer FP32 Logits
Tokenizer I/O
Attention Kernel
MLP Kernel
Residual Save
LM Head Tie
GPT‑2
No
No
No
BPE + vocab
attn
gelu
Yes
Yes
Qwen2
No
No
No
BPE + vocab
attn
silu_mul
Yes
Yes
Qwen3
No
No
No
BPE + vocab
attn + qk_norm
silu_mul
Yes
No
Qwen3.5
No
No
No
BPE + vocab
hybrid_recurrent + gated_attn
silu_mul
Yes
Yes
Gemma3
sqrt(dim)
Yes
Yes
SentencePiece
attn_sliding
geglu
Yes
Yes
Llama / Nanbeige
No
No
No
SentencePiece + ChatML
attn
silu_mul
Yes
No
Supported Quant + DType Coverage
Weight-only quantized GEMM/GEMV kernels with BF16/FP32 activations and Q8_* activation contracts.
Quantized kernels (Q4_K, Q5_K, Q6_K, Q8_0) are verified against both PyTorch and llama.cpp reference output — both references must agree before a kernel is considered validated.
Q4_0
Q4_1
Q4_K
Q5_0
Q5_1
Q5_K
Q6_K
Q8_0
Q8_K
BF16
FP32
GEMV: q4_k, q5_0, q5_1, q5_k, q6_k, q8_0, fp32
GEMM: q4_k, q5_0, q5_k, q6_k, q8_0, fp32
Activations: fp32, bf16, q8_0, q8_k
Fused: attention + outproj + MLP (prefill) variants
Recurrent: src/kernels/deltanet_kernels.c for FP32 Gated DeltaNet parity
Hybrid Architecture Support
Qwen3.5 / qwen3next — Full Template
Full qwen35.json template with hybrid block pattern: 3×recurrent → 1×full_attention.
Recurrent blocks: x → [q,k,v,z] + [beta,alpha] → conv(q/k/v) → DeltaNet state update S_t → RMSNorm(h) * SiLU(z) → outproj.
Full-attention blocks: q_gate_proj → split_q_gate → k/v_proj → qk_norm → rope → gated_attention → sigmoid_mul → outproj.
Dedicated kernel parity in src/kernels/deltanet_kernels.c (forward + backward).
Compatible with Qwen3.5-0.8B dense variant.
Gated attention
Treat it as a composed graph: causal attention + sigmoid gate multiply + output projection. Keep it stitched in CK unless profiling proves a fused epilogue is worth the extra maintenance.
Interpretation: DeltaNet earns a dedicated kernel because it mutates recurrent state; gated attention usually does not, because the gate is just an elementwise epilogue on top of otherwise standard attention.
Why Build C Kernel Engine?
What this teaches
This stack forces us to understand the invariants that actually matter:
memory layout, kernel contracts, numerical stability, and graph scheduling.
It turns the model into something we can inspect — dump activations, trace
tensor ranges, validate gradients later — without guessing what’s happening
inside a framework. That clarity is the foundation for training on CPU.
Sources: calculus history (17th century), matrix theory formalization (19th century),
AI as a field (1956), backpropagation popularization (1986).
Current GGUF Test Corpus
gpt2 (baseline)
Qwen--Qwen2-0.5B-Instruct-GGUF
Qwen--Qwen3-0.6B-GGUF
unsloth--gemma-3-270m-it-GGUF
mradermacher--Nanbeige4.1-3B-GGUF
Qwen--Qwen3.5-0.8B-GGUF
Bring-up note: if a Llama-family/Nanbeige first reply starts with <think> or echoes
<|im_start|>assistant, the runtime is using the wrong chat wrapper or stop policy.
Preserve GGUF-exported ChatML markers and keep output.weight untied. For Gemma, if first-token parity collapses after a clean qk_norm stage, check the exported rope_layout before suspecting tokenizer or stop-token bugs.
Recent Bring-up Fixes
Gemma 3: the bug was not in tokenization. We were applying RoPE with the wrong pairing style for Q/K. Gemma expects split-half RoPE, but the runtime could still use the other layout. Fix: read exported rope_layout and choose the matching RoPE kernel automatically.
Qwen3.5: hybrid recurrent-attention decode and Gated DeltaNet parity are now on the normal C-tokenizer route. <think> markers are preserved as special tokens, and visible vs. suppressed thinking is now a generic chat-contract mode instead of ad hoc runtime logic.
Llama / Nanbeige: SentencePiece + ChatML bring-up stabilized with untied output.weight preserved. Nanbeige remains an active inference lane; long coherent think traces are treated as model style, not as evidence of kernel breakage.
Validation Commands (v7 inference matrix)
These commands are the current inference readiness gates for Qwen2, Qwen3, Qwen3.5, Gemma3, and the active Nanbeige bring-up lane.
python3 version/v7/scripts/validate_model_matrix_v7.py
python3 version/v7/scripts/validate_parity_matrix_v7.py
python3 version/v7/scripts/validate_long_decode_stability_v7.py
# Nanbeige-specific stitched call-IR regression
python3 version/v7/scripts/test_kv_cache_batch_copy_call_ir_v7.py \
--run-dir ~/.cache/ck-engine-v7/models/mradermacher--Nanbeige4.1-3B-GGUF
Matrix validators honor cached GGUFs by default; add --allow-download when you want them to fetch missing models.
Growth path: inference coverage now spans Qwen2/Qwen3/Qwen3.5/Gemma3 plus the Llama/Nanbeige bring-up lane.
Qwen3.5 adds hybrid recurrent-attention coverage with Gated DeltaNet kernels (forward + backward), compatible with the 0.8B dense variant.
Gemma now runs on the correct split-half RoPE parity path, and Nanbeige has a documented SentencePiece/ChatML bring-up lane on the C tokenizer.
v7 still targets full training expansion: backward kernels, optimizer state, gradient reduction, and IR‑driven training schedules.
This page will keep evolving into a single matrix that shows inference + training coverage per model family.