Kernel Catalog

This document lists the high-performance C kernels available in the engine. These kernels are designed to be "micro-libraries"—you can copy src/kernels/rope_kernels.c into your own project without taking the rest of the engine.

Naming Convention

_naive: Reference implementation, easy to read, slow.
_parallel: OpenMP-accelerated for multi-core CPUs.

_avx512: Explicit intrinsics for AVX-512 (x86_64).
_head_major: Optimized memory layout where the "Head" dimension is outermost or stride-optimized.

1. Attention, RoPE & Recurrent State

Kernel	Source File	Description
`attention_forward_causal_head_major`	`attention_kernels.c`	Standard Scaled Dot-Product Attention (SDPA) with causal masking. Expects `[Head, Token, Dim]` layout.
`rope_forward`	`rope_kernels.c`	Rotary Positional Embeddings. Rotates query/key vectors in-place.
`causal_softmax_head_major`	`softmax_kernels.c`	Softmax applied to attention scores, masking out future tokens (causal mask).
`gated_deltanet_autoregressive_forward`	`deltanet_kernels.c`	Single-token Gated DeltaNet recurrent update for Qwen3.5/qwen3next parity. Updates per-head state matrices `[Head, Dim, Dim]` and emits `[Head, Dim]` output.

Composed vs Fused Gates

Not every gated block needs its own standalone kernel file.

Gated attention is still normal causal attention plus a gate epilogue: attention -> sigmoid -> mul -> outproj. CK can express that as stitched kernels without inventing a dedicated gated_attention_kernels.c.
SwiGLU / GeGLU follow the same rule. We keep the atomic activation kernels and only use fused decode helpers where profiling justifies the extra complexity.
Gated DeltaNet is different because the gate participates inside the recurrent state update itself. That makes src/kernels/deltanet_kernels.c a real kernel family, not just a wrapper around GEMM/GEMV.

Rule of thumb: keep CK graph-first and stitch atomic kernels by default; add fused wrappers only when measurements show the gate epilogue is a real hotspot.

Qwen3.5 DeltaNet Block Flow

The Qwen3.5 / qwen3next linear-attention block is easier to reason about if you separate the three different gating roles instead of collapsing them into one "state-space" idea.

Project from the token input x. The block produces [q, k, v, z] from ssm_in(x) and [beta, alpha] from ssm_beta_alpha(x).
Run the short convolution path. The local conv cache mixes the projected q/k/v lanes before the recurrent update. The z, beta, and alpha lanes do not go through that conv kernel.
Turn alpha into the recurrent decay gate. In llama.cpp this becomes a decay control g via softplus(alpha + dt) followed by a learned scale. That gate controls how much of the old recurrent state survives.
Update the per-head memory matrix S_t. DeltaNet keeps a recurrent state matrix [Head, Dim, Dim], not just a hidden state vector.
Normalize and gate the recurrent output. After the DeltaNet core produces h_t, the block applies gated normalization: RMSNorm(h_t) * SiLU(z).
Project back out. The gated normalized result goes through ssm_out to rejoin the residual stream.

For single-token decode, the recurrent update that CK matches in src/kernels/deltanet_kernels.c is:

q_hat   = norm(q) / sqrt(d)
k_hat   = norm(k)
beta_s  = sigmoid(beta)
gate    = exp(g)

S_decay = gate * S_prev
kv_mem  = S_decay^T * k_hat
delta   = beta_s * (v - kv_mem)
S_new   = S_decay + outer(k_hat, delta)
out     = S_new^T * q_hat

Interpretation:

g controls forgetting / decay of old memory.
beta controls how aggressively the block writes the correction term.
z is the output gate applied after the recurrent core, during gated normalization.

This is why Qwen3.5 feels Mamba-like without being "just Mamba": it has a short convolution path and a recurrent cache, but the core memory update is a DeltaNet matrix-state rule, and Qwen3.5 alternates these DeltaNet blocks with separate full Gated Attention blocks.

📊 Flash Attention Analysis

For a detailed comparison of attention implementations and performance analysis (including benchmarks vs llama.cpp), see:

Flash Attention Analysis: Why llama.cpp is Faster

SIMD optimizations, threading strategies, and performance trade-offs.

🔁 Gated DeltaNet Deep Dive

Visual infographic of the single-token recurrent update, memory layout, ISA dispatch, and how DeltaNet compares to standard attention:

Gated DeltaNet: Recurrent Attention Deep Dive

SVG diagrams, SIMD tier breakdown, and CK-Engine kernel rules.

2. Elementwise & Activation

Kernel	Source File	Description
`gelu_fast_inplace`	`gelu_kernels.c`	Gaussian Error Linear Unit. Uses the fast approximation (tanh).
`swiglu_forward`	`swiglu_kernels.c`	SwiGLU activation (Gated Linear Unit with Swish). Used in Llama/Mistral. Input size is `2 * dim`.
`sigmoid_forward`	`sigmoid_kernels.c`	Standard logistic sigmoid function.

3. Normalization

Kernel	Source File	Description
`rmsnorm_forward`	`rmsnorm_kernels.c`	Root Mean Square Normalization. Lighter than LayerNorm (no mean subtraction).
`layernorm_naive_serial`	`layernorm_kernels.c`	Standard LayerNorm. Subtracts mean, divides by variance.

4. GEMM (Matrix Multiplication)

Kernel	Source File	Description
`gemm_naive_parallel`	`gemm_kernels.c`	OpenMP-parallelized SGEMM (C = A * B).
`gemm_avx512_parallel`	`gemm_kernels.c`	Hand-tuned AVX-512 implementation for max throughput on modern Intel/AMD chips.

5. Loss & Training Kernels

Kernel	Source File	Description
`softmax_cross_entropy_loss`	`loss_kernels.c`	Fused softmax + CE loss/gradient kernel used by v7 training parity harness.
`softmax_cross_entropy_loss_ptref`	`loss_kernels.c`	Strict PyTorch-reference CE variant for parity experiments and drift diagnostics.

Cross-Entropy Deep Dive (v7)

For full derivation (p - one_hot), PyTorch reduction semantics (ignore_index, denominator rules), and long-horizon drift-fix evidence, see:

v7 Cross-Entropy Parity

Memory Layouts

Most kernels assume Row-Major contiguous memory unless specified otherwise.

Matrices: [Rows, Cols]
Attention Tensors: [Heads, Tokens, Head_Dim] (Head-Major) helps with cache locality during the attention loop.