Kernel Catalog

This document lists the high-performance C kernels available in the engine. These kernels are designed to be "micro-libraries"—you can copy src/kernels/rope_kernels.c into your own project without taking the rest of the engine.

Naming Convention

  • _naive: Reference implementation, easy to read, slow.
  • _parallel: OpenMP-accelerated for multi-core CPUs.
  • _avx512: Explicit intrinsics for AVX-512 (x86_64).
  • _head_major: Optimized memory layout where the "Head" dimension is outermost or stride-optimized.

1. Attention, RoPE & Recurrent State

Kernel Source File Description
attention_forward_causal_head_major attention_kernels.c Standard Scaled Dot-Product Attention (SDPA) with causal masking. Expects [Head, Token, Dim] layout.
rope_forward rope_kernels.c Rotary Positional Embeddings. Rotates query/key vectors in-place.
causal_softmax_head_major softmax_kernels.c Softmax applied to attention scores, masking out future tokens (causal mask).
gated_deltanet_autoregressive_forward deltanet_kernels.c Single-token Gated DeltaNet recurrent update for Qwen3.5/qwen3next parity. Updates per-head state matrices [Head, Dim, Dim] and emits [Head, Dim] output.

Composed vs Fused Gates

Not every gated block needs its own standalone kernel file.

Rule of thumb: keep CK graph-first and stitch atomic kernels by default; add fused wrappers only when measurements show the gate epilogue is a real hotspot.

Qwen3.5 DeltaNet Block Flow

The Qwen3.5 / qwen3next linear-attention block is easier to reason about if you separate the three different gating roles instead of collapsing them into one "state-space" idea.

  1. Project from the token input x. The block produces [q, k, v, z] from ssm_in(x) and [beta, alpha] from ssm_beta_alpha(x).
  2. Run the short convolution path. The local conv cache mixes the projected q/k/v lanes before the recurrent update. The z, beta, and alpha lanes do not go through that conv kernel.
  3. Turn alpha into the recurrent decay gate. In llama.cpp this becomes a decay control g via softplus(alpha + dt) followed by a learned scale. That gate controls how much of the old recurrent state survives.
  4. Update the per-head memory matrix S_t. DeltaNet keeps a recurrent state matrix [Head, Dim, Dim], not just a hidden state vector.
  5. Normalize and gate the recurrent output. After the DeltaNet core produces h_t, the block applies gated normalization: RMSNorm(h_t) * SiLU(z).
  6. Project back out. The gated normalized result goes through ssm_out to rejoin the residual stream.

For single-token decode, the recurrent update that CK matches in src/kernels/deltanet_kernels.c is:

q_hat   = norm(q) / sqrt(d)
k_hat   = norm(k)
beta_s  = sigmoid(beta)
gate    = exp(g)

S_decay = gate * S_prev
kv_mem  = S_decay^T * k_hat
delta   = beta_s * (v - kv_mem)
S_new   = S_decay + outer(k_hat, delta)
out     = S_new^T * q_hat

Interpretation:

This is why Qwen3.5 feels Mamba-like without being "just Mamba": it has a short convolution path and a recurrent cache, but the core memory update is a DeltaNet matrix-state rule, and Qwen3.5 alternates these DeltaNet blocks with separate full Gated Attention blocks.

📊 Flash Attention Analysis

For a detailed comparison of attention implementations and performance analysis (including benchmarks vs llama.cpp), see:

Flash Attention Analysis: Why llama.cpp is Faster

SIMD optimizations, threading strategies, and performance trade-offs.

🔁 Gated DeltaNet Deep Dive

Visual infographic of the single-token recurrent update, memory layout, ISA dispatch, and how DeltaNet compares to standard attention:

Gated DeltaNet: Recurrent Attention Deep Dive

SVG diagrams, SIMD tier breakdown, and CK-Engine kernel rules.

2. Elementwise & Activation

Kernel Source File Description
gelu_fast_inplace gelu_kernels.c Gaussian Error Linear Unit. Uses the fast approximation (tanh).
swiglu_forward swiglu_kernels.c SwiGLU activation (Gated Linear Unit with Swish). Used in Llama/Mistral. Input size is 2 * dim.
sigmoid_forward sigmoid_kernels.c Standard logistic sigmoid function.

3. Normalization

Kernel Source File Description
rmsnorm_forward rmsnorm_kernels.c Root Mean Square Normalization. Lighter than LayerNorm (no mean subtraction).
layernorm_naive_serial layernorm_kernels.c Standard LayerNorm. Subtracts mean, divides by variance.

4. GEMM (Matrix Multiplication)

Kernel Source File Description
gemm_naive_parallel gemm_kernels.c OpenMP-parallelized SGEMM (C = A * B).
gemm_avx512_parallel gemm_kernels.c Hand-tuned AVX-512 implementation for max throughput on modern Intel/AMD chips.

5. Loss & Training Kernels

Kernel Source File Description
softmax_cross_entropy_loss loss_kernels.c Fused softmax + CE loss/gradient kernel used by v7 training parity harness.
softmax_cross_entropy_loss_ptref loss_kernels.c Strict PyTorch-reference CE variant for parity experiments and drift diagnostics.

Cross-Entropy Deep Dive (v7)

For full derivation (p - one_hot), PyTorch reduction semantics (ignore_index, denominator rules), and long-horizon drift-fix evidence, see:

v7 Cross-Entropy Parity

Memory Layouts

Most kernels assume Row-Major contiguous memory unless specified otherwise.

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close