Model + Kernel Matrix

A visual map of what the C Kernel Engine supports: IR templates, kernel families, quant contracts, and pipeline coverage (inference-only vs. inference + v7 training). Grounded in the actual templates, IR lowering passes, and GGUF test corpus run locally.

Recommended runner track: this model page points to the latest hardened v8 command path for hands-on smoke tests. Version pages remain historical snapshots of each engine generation; the cards below show the current practical command to convert, compile, run, and optionally emit the IR visualizer for that family. Large cards such as Gemma4, Nemotron, GLM4, and vision models assume a high-memory CPU host.

One Pipeline, Every Model

Every supported family enters through the same deterministic compile path. The artifact (GGUF) carries the numbers, the template carries the structure, and lowering turns both into a concrete C runtime with a fixed memory plan. Vision families add a second lane that produces an encoder prefix and bridges it into the decoder.

Qwen2

RMSNorm · RoPE split-half · SwiGLU · GQA
QKV bias from weights · BPE tokenizer

Quantized inference — GGUF, GEMM/GEMV dispatch

v7 training — IR backprop, AdamW, parity gates

Latest hardened v8 smoke

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen2-0.5B-Instruct-GGUF/qwen2-0_5b-instruct-q4_k_m.gguf \
  --context-len 1024 --force-compile --force-convert --chat-template=auto \
  --generate-visualizer

Q4_K_M ★ Q5_K Q6_K Q8_0 FP32

Qwen3

RMSNorm · RoPE split-half · QK‑Norm · SwiGLU
No QKV bias · BPE tokenizer · learned QK‑Norm

Quantized inference — GGUF, GEMM/GEMV dispatch

v7 training — IR backprop, AdamW, parity gates

Latest hardened v8 smoke

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \
  --context-len 1024 --force-compile --force-convert --chat-template=auto \
  --generate-visualizer

Q4_K_M ★ Q5_K Q6_K Q8_0 FP32

Qwen3.5

Hybrid recurrent + gated attention · RMSNorm · RoPE
QK‑Norm (full‑attn only) · SwiGLU · Gated DeltaNet · BPE tokenizer
Block pattern: 3×recurrent → 1×full_attention

Quantized inference — hybrid recurrent-attention decode, DeltaNet kernel parity

v7 training — IR backprop, DeltaNet backward kernels, AdamW, parity gates

Bring-up fix: preserved <think> / </think> markers on the built-in C tokenizer path and moved visible vs. suppressed thinking into the exported chat contract, instead of relying on Python-tokenizer fallback.

hybrid_recurrent_attention gated_deltanet 0.8B dense compatible

Latest hardened v8 smoke

version/v8/scripts/cks-v8-run run \
  hf://unsloth/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q4_K_M.gguf \
  --context-len 1034 --force-compile --force-convert --chat-template=qwen35 \
  --generate-visualizer

Q4_K_M ★ Q8_0 FP32

Gemma3

RMSNorm · RoPE split-half · QK‑Norm · GeGLU · Sliding Window
Post-norms · embed scale √dim · SentencePiece

Quantized inference — GGUF, sliding-window attn kernel, split-half RoPE parity restored

v7 training — IR backprop, AdamW, parity gates

Fix detail: Gemma was applying RoPE with the wrong dimension pairing. Gemma expects split-half RoPE, where the first half of Q/K is paired with the second half during the rotation step. The runtime could still choose the other common layout, so attention drifted from the reference path. The fix was to read exported rope_layout metadata and select the matching rope_qk kernel automatically.

first-divergence @ rope_qk rope_layout contract honored

Latest hardened v8 smoke

version/v8/scripts/cks-v8-run run \
  hf://unsloth/gemma-3-270m-it-GGUF/gemma-3-270m-it-Q5_K_M.gguf \
  --context-len 1024 --force-compile --force-convert --chat-template=auto \
  --generate-visualizer

Q8_0 ★ Q4_K BF16 FP32

Qwen3-VL 8B Instruct

Vision encoder bridge · multimodal token stitching · Qwen3 decoder
GGUF decoder + mmproj runtime · thinking-mode control

v8 multimodal inference — image preprocessing, bridge, mixed prompt prefill

Nightly vision smoke — promoted baseline before adding more vision families

Bring-up scope: Qwen3-VL is the current promoted multimodal baseline. It validates that the encoder bridge, image marker contract, and decoder mixed-prefill path work together before Gemma4 vision and later audio lanes are treated as production-ready.

Qwen3-VL vision smoke

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf \
  --mmproj hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \
  --image-path version/v8/test_assets/v8_vision_doc_card_72.png \
  --prompt 'Explain this image.' \
  --context-len 1024 --force-compile --force-convert \
  --thinking-mode suppressed

multimodal_bridge mixed_prefill vision nightly smoke

Q4_K_M decoder ★ Q8_0 mmproj image bridge

GLM4

RMSNorm · partial RoPE · GQA · SwiGLU
BPE tokenizer · attention bias · GGUF + safetensors conversion lane

v8 template — GLM4 chat contract and partial-RoPE metadata

Conversion/parity harness — GGUF fixes, safetensors declarative map, PyTorch-vs-CK comparator

Bring-up scope: GLM4 now has the real GGUF runtime lane plus the safetensors/PyTorch parity lane. The key hardening was making partial pairwise RoPE and projection producer/consumer wiring explicit enough that Q4_K_M GGUF generation stays coherent instead of collapsing at the first token.

declarative safetensors map partial_rope contract synthetic conversion tested

Latest hardened v8 smoke

version/v8/scripts/cks-v8-run run \
  hf://unsloth/GLM-4-9B-0414-GGUF/GLM-4-9B-0414-Q4_K_M.gguf \
  --context-len 1024 --force-compile --force-convert --chat-template=glm4 \
  --prompt 'Give me a detailed example of C, Python and SQL code.' \
  --max-tokens 256 --temperature 0.0 --generate-visualizer

BF16 safetensors GGUF Q4_K Q8_0

Kimi / MLA Text Decoder

Multi-Head Latent Attention · partial RoPE concat · MoE path
Explicit prefill/decode MLA cache contract · DeepSeek-style reference kernels

v8 template/lowering — explicit MLA cache store/read ops

Kernel contract — MLA decompress, partial RoPE concat, decode cache attention

Full Kimi smoke — high-memory host required; not a laptop CI lane

Bring-up scope: Kimi/MLA support now lowers the latent KV path explicitly instead of hiding cache behavior inside attention. Decode inserts mla_kv_cache_store and switches mla_attention to deepseek_mla_attention_decode_f32; prefill inserts mla_kv_cache_batch_store.

explicit MLA decode cache BF16 kv_lora path template contract tested

Laptop-safe contract check

.venv/bin/python -m py_compile version/v8/scripts/build_ir_v8.py
make build/libckernel_engine.so
.venv/bin/python unittest/test_deepseek_reference_kernels.py
.venv/bin/python -m unittest tests.test_v8_kimi_template

BF16 FP32 reference MLA cache MoE

Gemma4 E4B IT

RMSNorm · split-half direct RoPE · QK‑Norm · GeGLU
Hybrid full/sliding attention · per-layer theta · SentencePiece

Quantized inference — GGUF Q4_K_M coherent long-answer smoke

BF16 safetensors → BUMP — PyTorch parity bring-up path

Vision bridge smoke — Gemma4 mmproj-F16 encoder → mixed decoder prefill

Bring-up fix: Gemma4 keeps the split-half RoPE channel layout, but it cannot use one global cached split-half contract. The template exports per-layer/direct RoPE metadata, and lowering selects rope_forward_qk_split_direct_f32 from IR data rather than hard-coding a Gemma-only kernel.

rope_param_mode: per_layer_direct BF16 PyTorch parity lane GGUF Q4_K_M smoke-tested vision bridge smoke-tested

Latest hardened v8 text smoke

version/v8/scripts/cks-v8-run run \
  hf://unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-Q4_K_M.gguf \
  --context-len 2048 --force-compile --force-convert --chat-template=gemma4 \
  --prompt 'Give me a detailed example of C code.' \
  --max-tokens 256 --temperature 0.0 --generate-visualizer

Gemma4 vision bridge smoke

version/v8/scripts/cks-v8-run run \
  hf://unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-Q4_K_M.gguf \
  --mmproj hf://unsloth/gemma-4-E4B-it-GGUF/mmproj-F16.gguf \
  --image-path version/v8/test_assets/v8_vision_doc_card_72.ppm \
  --prompt 'Explain this image in one short paragraph.' \
  --context-len 1024 --chat-template=gemma4 \
  --max-tokens 8 --temperature 0.0

Q4_K_M ★ BF16 safetensors Q4_K Q6_K

Nemotron Nano 9B v2

Nemotron-H · Mamba2 recurrent blocks · sparse attention layers
ReLU2 MLP · Q5_0/Q4_K/Q8_0 GGUF · BPE tokenizer

Quantized inference — GGUF Q4_K_M coherent C-code smoke

Safetensors/PyTorch parity lane — Mamba2 stitching and state-shape guardrails

Bring-up fix: Nemotron-H uses Mamba2 state shaped as [heads, head_dim, state_dim], not a square DeltaNet-style recurrent matrix. The template/lowering path now carries that contract explicitly, stores no-RoPE attention KV after v_proj, and generated BPE runtimes can encode raw prompts end-to-end by default.

Mamba2 state shape explicit native BPE encode default-on GGUF Q4_K_M smoke-tested

Latest hardened v8 smoke

version/v8/scripts/cks-v8-run run \
  hf://bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF/nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q4_K_M.gguf \
  --context-len 1024 --force-compile --force-convert --chat-template=auto \
  --prompt 'Give me a detailed example of C, Python and SQL code.' \
  --max-tokens 256 --temperature 0.0 --generate-visualizer

Q4_K_M ★ Q5_0 Q8_0 BF16 safetensors

Llama / Nanbeige

RMSNorm · RoPE pairwise · SwiGLU · GQA
Untied LM head supported · SentencePiece · ChatML markers via GGUF

Quantized inference — active GGUF bring-up lane on Nanbeige 4.1 3B

v7 training — not yet a documented preset; use template-audit + parity gates first

Current bring-up lane keeps SentencePiece + ChatML on the C-tokenizer path, preserves untied output.weight, and treats Nanbeige long coherent think traces as model behavior instead of a kernel/parity failure.

untied output.weight ChatML auto template template-audit first

Latest hardened v8 smoke

version/v8/scripts/cks-v8-run run \
  hf://mradermacher/Nanbeige4.1-3B-GGUF/Nanbeige4.1-3B.Q4_K_M.gguf \
  --context-len 1024 --force-compile --force-convert --chat-template=auto \
  --generate-visualizer

Q4_K_M Q8_0 FP32

GPT‑2

LayerNorm · Learned PosEmb · GELU
QKV bias · tied embeddings · BPE tokenizer

Quantized inference — GGUF baseline

v7 training — not yet in scope

Q4_0 Q4_1 Q5_0 Q8_0 FP32

Gemma4 Vision

SigLIP-style ViT encoder · XY position embeddings · average-pool merge
projector prep → Gemma4 decoder bridge · deterministic patch frontend

v8 multimodal inference — patch frontend, encoder graph, projector bridge

Bring-up lane — staged behind the promoted Qwen3-VL baseline

Vision lane scope: shares the same lowering + memory-planning stack as Qwen3-VL. Uses position_embeddings_add_gemma4v_xy and spatial_average_pool_contiguous for the Gemma4 merge path instead of Qwen3-VL's 2×2 spatial merge.

gemma4v_xy_posemb avgpool_merge projector_bridge

Q4_K_M decoder ★ Q8_0 mmproj FP32 encoder image bridge

SigLIP ViT (encoder)

Standalone vision tower · im2patch → patch embed · LayerNorm · full attention
learned position embeddings · GELU MLP · reusable encoder contract

v8 encoder graph — patch frontend, transformer body, projector footer

Encoder-only — pairs with a decoder template via the bridge contract

Why it exists: SigLIP is the shared ViT backbone behind several vision-language families. Templating it on its own keeps the encoder reusable, so a new VLM can be brought up by swapping the decoder + projector instead of rewriting the vision tower.

reusable_vit im2patch encoder_contract

FP32 BF16 encoder prefix

Kernel Families by Model

Model

Tokenizer

QKV Bias

QK‑Norm

RoPE / PosEmb

Attention

MLP

Post‑Norms

Logits

GPT‑2

BPE

Yes

Learned

Causal

GELU

Tied

Qwen2

BPE

From weights

RoPE split-half

Causal

SwiGLU

Tied

Qwen3

BPE

Yes

RoPE split-half

Causal

SwiGLU

LM head

Qwen3.5

BPE

Full‑attn only

RoPE

Hybrid DeltaNet + Gated

SwiGLU

Tied

Qwen3-VL

BPE + image markers

Yes

RoPE + vision pos

Vision bridge + causal

SwiGLU

Bridge norms

LM head

GLM4

BPE

Attention bias

Partial pairwise RoPE

Causal GQA

SwiGLU

LM head

Gemma3

SentencePiece

From weights

Yes

RoPE

Sliding

GeGLU

Yes

Tied

Gemma4 E4B

SentencePiece

From weights

Yes

Split-half direct

Full + sliding

GeGLU

Yes

Tied

Nemotron-H

BPE

From weights

No-RoPE KV + sparse attn

Mamba2 + sparse attn

ReLU2

LM head

Llama / Nanbeige

SentencePiece

From weights

RoPE pairwise

Causal GQA

SwiGLU

Untied head

IR + Quant Contracts

Model

Embed Scale

Prefer Q8 Contract

Prefer FP32 Logits

Tokenizer I/O

Attention Kernel

MLP Kernel

Residual Save

LM Head Tie

GPT‑2

BPE + vocab

attn

gelu

Yes

Qwen2

BPE + vocab

attn

silu_mul

Yes

Qwen3

BPE + vocab

attn + qk_norm

silu_mul

Yes

Qwen3.5

BPE + vocab

hybrid_recurrent + gated_attn

silu_mul

Yes

Qwen3-VL

BPE + image markers

vision_bridge + attn

silu_mul

Yes

GLM4

BPE + vocab

attn + partial_rope

silu_mul

Yes

Gemma3

sqrt(dim)

Yes

SentencePiece

attn_sliding

geglu

Yes

Gemma4 E4B

embed + per-layer

Yes

SentencePiece

attn + attn_sliding

geglu

Yes

Nemotron-H

Yes

BPE + vocab

mamba2 + sparse_attn

relu2

Yes

Llama / Nanbeige

SentencePiece + ChatML

attn

silu_mul

Yes

Supported Quant + DType Coverage

Weight-only quantized GEMM/GEMV kernels with BF16/FP32 activations and Q8_* activation contracts. Quantized kernels (Q4_K, Q5_K, Q6_K, Q8_0) are verified against both PyTorch and llama.cpp reference output — both references must agree before a kernel is considered validated.

Q4_0 Q4_1 Q4_K Q5_0 Q5_1 Q5_K Q6_K Q8_0 Q8_K BF16 FP32

GEMV: q4_k, q5_0, q5_1, q5_k, q6_k, q8_0, fp32

GEMM: q4_k, q5_0, q5_k, q6_k, q8_0, fp32

Activations: fp32, bf16, q8_0, q8_k

Fused: attention + outproj + MLP (prefill) variants

Recurrent: src/kernels/deltanet_kernels.c for FP32 Gated DeltaNet parity

Hybrid Architecture Support

Qwen3.5 / qwen3next — Full Template
Full qwen35.json template with hybrid block pattern: 3×recurrent → 1×full_attention. Recurrent blocks: x → [q,k,v,z] + [beta,alpha] → conv(q/k/v) → DeltaNet state update S_t → RMSNorm(h) * SiLU(z) → outproj. Full-attention blocks: q_gate_proj → split_q_gate → k/v_proj → qk_norm → rope → gated_attention → sigmoid_mul → outproj. Dedicated kernel parity in src/kernels/deltanet_kernels.c (forward + backward). Compatible with Qwen3.5-0.8B dense variant.

Gated attention
Treat it as a composed graph: causal attention + sigmoid gate multiply + output projection. Keep it stitched in CK unless profiling proves a fused epilogue is worth the extra maintenance.

Interpretation: DeltaNet earns a dedicated kernel because it mutates recurrent state; gated attention usually does not, because the gate is just an elementwise epilogue on top of otherwise standard attention.

Why Build C Kernel Engine?

What this teaches
This stack forces us to understand the invariants that actually matter: memory layout, kernel contracts, numerical stability, and graph scheduling. It turns the model into something we can inspect — dump activations, trace tensor ranges, validate gradients later — without guessing what’s happening inside a framework. That clarity is the foundation for training on CPU.

Ocean depth metaphor: surface chaos (models, agents, APIs, best practices changing weekly) vs stable bedrock (kernels, memory hierarchy, linear algebra, physics)

Click to open full‑size SVG

Sources: calculus history (17th century), matrix theory formalization (19th century), AI as a field (1956), backpropagation popularization (1986).

Current GGUF Test Corpus

gpt2 (baseline)

Qwen--Qwen2-0.5B-Instruct-GGUF

Qwen--Qwen3-0.6B-GGUF

unsloth--gemma-3-270m-it-GGUF

unsloth--gemma-4-E4B-it-GGUF / local Q4_K_M BUMP

mradermacher--Nanbeige4.1-3B-GGUF

Qwen--Qwen3.5-0.8B-GGUF

Qwen--Qwen3-VL-8B-Instruct-GGUF + mmproj

bartowski--nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF / Q4_K_M

unsloth--GLM-4-9B-0414-GGUF / Q4_K_M

Bring-up note: if a Llama-family/Nanbeige first reply starts with <think> or echoes <|im_start|>assistant, the runtime is using the wrong chat wrapper or stop policy. Preserve GGUF-exported ChatML markers and keep output.weight untied. For Gemma, if first-token parity collapses after a clean qk_norm stage, check the exported rope_layout before suspecting tokenizer or stop-token bugs.

Recent Bring-up Fixes

Gemma 3: the bug was not in tokenization. We were applying RoPE with the wrong pairing style for Q/K. Gemma expects split-half RoPE, but the runtime could still use the other layout. Fix: read exported rope_layout and choose the matching RoPE kernel automatically.

Gemma 4: the BF16 safetensors lane proved the circuit against PyTorch first, then the GGUF lane reused the same generic split-direct RoPE op. A 1024-token Q4_K_M C-code prompt now stays coherent; remaining work is performance and tighter quantized parity, not basic stitching.

Qwen3.5: hybrid recurrent-attention decode and Gated DeltaNet parity are now on the normal C-tokenizer route. <think> markers are preserved as special tokens, and visible vs. suppressed thinking is now a generic chat-contract mode instead of ad hoc runtime logic.

Llama / Nanbeige: SentencePiece + ChatML bring-up stabilized with untied output.weight preserved. Nanbeige remains an active inference lane; long coherent think traces are treated as model style, not as evidence of kernel breakage.

Nemotron Nano 9B v2: the first real failure was stitching, not tokenizer or final head: Mamba2 state shape and no-RoPE KV placement had to be explicit in IR. Native BPE encode is now default-on so ck_run_v8.py raw prompts work end-to-end; token-id-only fallback remains available with CK_DISABLE_FULL_BPE_TOKENIZER=1.

GLM4: v8 now has a template, GLM4 GGUF metadata fixes, declarative safetensors mapping, PyTorch-vs-CK parity harnesses, and a coherent Q4_K_M GGUF runtime smoke. The important fix was separating semantic stream sources in the template from physical quantized activation buffers in the kernel ABI.

Qwen3-VL / Gemma4 vision: Qwen3-VL remains the promoted vision baseline. Gemma4 vision now has a matching CLIP/mmproj conversion path, a Gemma4 image marker contract, and a confirmed encoder-prefix-to-decoder mixed-prefill smoke; semantic parity and speed remain active work.

Validation Commands (v7 inference matrix)

These commands are the current inference readiness gates for Qwen2, Qwen3, Qwen3.5, Gemma3, and the active Nanbeige bring-up lane.

python3 version/v7/scripts/validate_model_matrix_v7.py
python3 version/v7/scripts/validate_parity_matrix_v7.py
python3 version/v7/scripts/validate_long_decode_stability_v7.py

# Nanbeige-specific stitched call-IR regression
python3 version/v7/scripts/test_kv_cache_batch_copy_call_ir_v7.py \
  --run-dir ~/.cache/ck-engine-v7/models/mradermacher--Nanbeige4.1-3B-GGUF

Matrix validators honor cached GGUFs by default; add --allow-download when you want them to fetch missing models.

Growth path: inference coverage now spans Qwen2/Qwen3/Qwen3.5/Qwen3-VL/Gemma3/Gemma4 text, Gemma4 vision bridge smoke support, Nemotron-H, GLM4, plus the Llama/Nanbeige bring-up lane. Qwen3.5 adds hybrid recurrent-attention coverage with Gated DeltaNet kernels (forward + backward), compatible with the 0.8B dense variant. Gemma now runs on the correct split-half RoPE parity path, Gemma4 adds per-layer direct RoPE plus a safetensors-to-BUMP parity lane, Nemotron-H adds Mamba2 recurrent state coverage, GLM4 adds partial-RoPE/BPE conversion contracts, and Nanbeige has a documented SentencePiece/ChatML bring-up lane on the C tokenizer. v7 still targets full training expansion: backward kernels, optimizer state, gradient reduction, and IR‑driven training schedules. This page will keep evolving into a single matrix that shows inference + training coverage per model family.

References: Britannica — Calculus, MacTutor — Matrices and determinants, Dartmouth AI 1956, Nature 1986 — Backpropagation