Kernel Catalog
This document lists the high-performance C kernels available in the engine. These kernels are designed to be "micro-libraries"—you can copy src/kernels/rope_kernels.c into your own project without taking the rest of the engine.
Naming Convention
_naive: Reference implementation, easy to read, slow._parallel: OpenMP-accelerated for multi-core CPUs.
_avx512: Explicit intrinsics for AVX-512 (x86_64)._head_major: Optimized memory layout where the "Head" dimension is outermost or stride-optimized.
1. Attention, RoPE & Recurrent State
| Kernel | Source File | Description |
|---|---|---|
attention_forward_causal_head_major |
attention_kernels.c |
Standard Scaled Dot-Product Attention (SDPA) with causal masking. Expects [Head, Token, Dim] layout. |
rope_forward |
rope_kernels.c |
Rotary Positional Embeddings. Rotates query/key vectors in-place. |
causal_softmax_head_major |
softmax_kernels.c |
Softmax applied to attention scores, masking out future tokens (causal mask). |
gated_deltanet_autoregressive_forward |
deltanet_kernels.c |
Single-token Gated DeltaNet recurrent update for Qwen3.5/qwen3next parity. Updates per-head state matrices [Head, Dim, Dim] and emits [Head, Dim] output. |
Composed vs Fused Gates
Not every gated block needs its own standalone kernel file.
- Gated attention is still normal causal attention plus a gate epilogue:
attention -> sigmoid -> mul -> outproj. CK can express that as stitched kernels without inventing a dedicatedgated_attention_kernels.c. - SwiGLU / GeGLU follow the same rule. We keep the atomic activation kernels and only use fused decode helpers where profiling justifies the extra complexity.
- Gated DeltaNet is different because the gate participates inside the recurrent state update itself. That makes
src/kernels/deltanet_kernels.ca real kernel family, not just a wrapper around GEMM/GEMV.
Rule of thumb: keep CK graph-first and stitch atomic kernels by default; add fused wrappers only when measurements show the gate epilogue is a real hotspot.
Qwen3.5 DeltaNet Block Flow
The Qwen3.5 / qwen3next linear-attention block is easier to reason about if you separate the three different gating roles instead of collapsing them into one "state-space" idea.
- Project from the token input
x. The block produces[q, k, v, z]fromssm_in(x)and[beta, alpha]fromssm_beta_alpha(x). - Run the short convolution path. The local conv cache mixes the projected
q/k/vlanes before the recurrent update. Thez,beta, andalphalanes do not go through that conv kernel. - Turn
alphainto the recurrent decay gate. In llama.cpp this becomes a decay controlgviasoftplus(alpha + dt)followed by a learned scale. That gate controls how much of the old recurrent state survives. - Update the per-head memory matrix
S_t. DeltaNet keeps a recurrent state matrix[Head, Dim, Dim], not just a hidden state vector. - Normalize and gate the recurrent output. After the DeltaNet core produces
h_t, the block applies gated normalization:RMSNorm(h_t) * SiLU(z). - Project back out. The gated normalized result goes through
ssm_outto rejoin the residual stream.
For single-token decode, the recurrent update that CK matches in src/kernels/deltanet_kernels.c is:
q_hat = norm(q) / sqrt(d)
k_hat = norm(k)
beta_s = sigmoid(beta)
gate = exp(g)
S_decay = gate * S_prev
kv_mem = S_decay^T * k_hat
delta = beta_s * (v - kv_mem)
S_new = S_decay + outer(k_hat, delta)
out = S_new^T * q_hat
Interpretation:
gcontrols forgetting / decay of old memory.betacontrols how aggressively the block writes the correction term.zis the output gate applied after the recurrent core, during gated normalization.
This is why Qwen3.5 feels Mamba-like without being "just Mamba": it has a short convolution path and a recurrent cache, but the core memory update is a DeltaNet matrix-state rule, and Qwen3.5 alternates these DeltaNet blocks with separate full Gated Attention blocks.
📊 Flash Attention Analysis
For a detailed comparison of attention implementations and performance analysis (including benchmarks vs llama.cpp), see:
Flash Attention Analysis: Why llama.cpp is Faster
SIMD optimizations, threading strategies, and performance trade-offs.
🔁 Gated DeltaNet Deep Dive
Visual infographic of the single-token recurrent update, memory layout, ISA dispatch, and how DeltaNet compares to standard attention:
Gated DeltaNet: Recurrent Attention Deep Dive
SVG diagrams, SIMD tier breakdown, and CK-Engine kernel rules.
2. Elementwise & Activation
| Kernel | Source File | Description |
|---|---|---|
gelu_fast_inplace |
gelu_kernels.c |
Gaussian Error Linear Unit. Uses the fast approximation (tanh). |
swiglu_forward |
swiglu_kernels.c |
SwiGLU activation (Gated Linear Unit with Swish). Used in Llama/Mistral. Input size is 2 * dim. |
sigmoid_forward |
sigmoid_kernels.c |
Standard logistic sigmoid function. |
3. Normalization
| Kernel | Source File | Description |
|---|---|---|
rmsnorm_forward |
rmsnorm_kernels.c |
Root Mean Square Normalization. Lighter than LayerNorm (no mean subtraction). |
layernorm_naive_serial |
layernorm_kernels.c |
Standard LayerNorm. Subtracts mean, divides by variance. |
4. GEMM (Matrix Multiplication)
| Kernel | Source File | Description |
|---|---|---|
gemm_naive_parallel |
gemm_kernels.c |
OpenMP-parallelized SGEMM (C = A * B). |
gemm_avx512_parallel |
gemm_kernels.c |
Hand-tuned AVX-512 implementation for max throughput on modern Intel/AMD chips. |
5. Loss & Training Kernels
| Kernel | Source File | Description |
|---|---|---|
softmax_cross_entropy_loss |
loss_kernels.c |
Fused softmax + CE loss/gradient kernel used by v7 training parity harness. |
softmax_cross_entropy_loss_ptref |
loss_kernels.c |
Strict PyTorch-reference CE variant for parity experiments and drift diagnostics. |
Cross-Entropy Deep Dive (v7)
For full derivation (p - one_hot), PyTorch reduction semantics (ignore_index, denominator rules), and long-horizon drift-fix evidence, see:
Memory Layouts
Most kernels assume Row-Major contiguous memory unless specified otherwise.
- Matrices:
[Rows, Cols] - Attention Tensors:
[Heads, Tokens, Head_Dim](Head-Major) helps with cache locality during the attention loop.