Gated DeltaNet: Recurrent Attention Deep Dive

Kernel Source
src/kernels/deltanet_kernels.c — FP32 Gated DeltaNet with REF, AVX, AVX2, and AVX-512 implementations. Matches the single-token recurrent path used by Qwen3.5 / qwen3next in llama.cpp.

How One DeltaNet Step Works

The diagram below shows the complete per-head recurrent update that runs at every single-token decode step. Unlike standard attention (which recomputes over the full KV-cache), DeltaNet maintains a fixed-size state matrix S that gets updated in-place.

Mathematical Equations

Per head (h = 0 … num_heads-1):

  q̂   = L2_norm(q) / √state_dim      ← scaled unit query
  k̂   = L2_norm(k)                    ← unit key
  β_s = σ(β)        = 1/(1+e^(-β))    ← write gate  (0,1)
  gate = exp(g)                        ← decay gate  (0,∞)

  S_decay = gate · S_prev              ← element-wise forget
  kv_mem  = S_decayᵀ · k̂              ← what memory recalls for k̂
  δ       = β_s · (v − kv_mem)         ← correction: new info minus old
  S_new   = S_decay + outer(k̂, δ)     ← rank-1 write to memory
  out     = S_newᵀ  · q̂               ← read from updated memory

Key insight: The delta rule computes the error between what the model wants to store (v) and what the state already recalls (kv_mem). Only the correction is written, gated by β_s. This makes DeltaNet a learned associative memory with selective forgetting.

Memory Layout

All arrays are flat row-major FP32 buffers. The head dimension is outermost for cache-friendly per-head iteration.

ISA Dispatch & SIMD Tiers

The public entry point gated_deltanet_autoregressive_forward() selects the best compiled implementation at link time. When strict parity is enabled (via ck_strict_parity_enabled()), it always falls back to the scalar reference.

SIMD Optimization Strategy

Reference (`_ref`)

Pure scalar C — no intrinsics
Triple-nested loops: head → row → col
Used for parity testing against llama.cpp
Bit-exact reference for correctness

AVX (`_avx`)

256-bit SIMD — 8 floats per instruction
Pre-normalizes q̂ / k̂ before the state sweep
Single row walk per state-matrix pass
Scalar tail loop for non-multiple-of-8 dims

AVX2 + FMA (`_avx2`)

Same 256-bit width, adds FMA: a×b+c in 1 cycle
2-row unroll: processes rows in pairs to halve loop overhead
Uses _mm256_fmadd_ps where available
Falls back to mul+add without FMA flag

AVX-512 (`_avx512`)

512-bit SIMD — 16 floats per instruction
Native FMA via _mm512_fmadd_ps
Uses _mm512_reduce_add_ps for hsum
Largest vector width = fewest iterations

DeltaNet vs Standard Attention

Property	Standard Attention	Gated DeltaNet
Memory per head	O(T × d) — grows with context	O(d²) — fixed
Per-token cost	O(T × d) — scans full KV-cache	O(d²) — constant
State	KV-cache (append-only)	Matrix S (overwrite via delta rule)
Forgetting	None (or windowed)	Exponential decay via `exp(g)`
Write mechanism	Append new K,V rows	Rank-1 correction: `outer(k̂, δ)`
Best for	Precise long-range recall	Streaming / very long contexts

CK-Engine Kernel Rules

Like all CK kernels, the DeltaNet implementation follows strict rules:

🚫 No malloc / free

Memory comes via bump allocator. All pointers are passed in. Stack arrays (CK_DELTANET_MAX_STACK_DIM = 4096) are used for temporaries.

🚫 No OpenMP

Parallelization happens at the orchestrator/codegen layer. Kernels are single-threaded, deterministic units.

✅ Pure computation

No side effects, no global state. Given the same inputs, produces identical outputs. Essential for parity testing.

✅ Defined API contract

Every kernel declares: inputs, outputs, workspace requirements, and memory layouts. The dispatcher selects ISA at compile time.

DeltaNet vs SSM (Mamba): Two Paths Beyond Attention

Both DeltaNet and Mamba-style SSMs solve the same fundamental problem: constant-cost per-token inference without a growing KV-cache. But they arrive at that solution through entirely different mathematical paths. C-Kernel-Engine implements both — deltanet_kernels.c (767 lines) and ssm_kernels.c (143 lines) — which gives us a concrete basis for comparison.

Kernel Sources
DeltaNet: src/kernels/deltanet_kernels.c — Gated delta rule with AVX/AVX2/AVX-512
SSM: src/kernels/ssm_kernels.c — Causal depthwise conv1d for Mamba-style layers
Supporting: recurrent_state_kernels.c, recurrent_gate_kernels.c, recurrent_norm_kernels.c

The Core Difference

The difference is best understood as what the state represents and how it gets updated:

DeltaNet: Corrective Associative Memory

The state is a d×d matrix — a learned key-value associative memory. Each step:

Probe the memory: kv_mem = Sᵀ · k̂
Compute the error: δ = β_s · (v − kv_mem)
Write the correction: S += outer(k̂, δ)

The state is repaired at every step. Information that is already correct costs nothing to maintain. Only errors drive updates.

SSM (Mamba): Input-Dependent Linear Recurrence

The state is a d-dimensional hidden vector. Each step:

Decay: h_t = A_t · h_{t-1}
Inject: h_t += B_t · x_t
Read: y_t = C_t · h_t

The state is a linear dynamical system. Information decays continuously via A_t. New inputs are injected via B_t. The output is a learned projection C_t.

Mathematical Comparison

═══════════════════════════════════════════════════════════════
GATED DELTANET (per head, per token)
═══════════════════════════════════════════════════════════════
State:    S ∈ ℝ^{d × d}     (matrix — stores key-value associations)
Decay:    S ← exp(g) · S     (element-wise scalar decay)
Retrieve: kv_mem = Sᵀ · k̂    (what does memory say for this key?)
Error:    δ = σ(β) · (v − kv_mem)  (correction signal)
Update:   S ← S + outer(k̂, δ)     (rank-1 write)
Output:   out = Sᵀ · q̂             (read with query)

Total per step:  O(d²)  — dominated by state matrix operations

═══════════════════════════════════════════════════════════════
MAMBA-2 SSM (per channel, per token)
═══════════════════════════════════════════════════════════════
State:    h ∈ ℝ^d            (vector — linear dynamical system)
Decay:    h ← A_t · h        (input-selective diagonal decay)
Inject:   h ← h + B_t · x_t  (input-selective injection)
Output:   y_t = C_t · h       (input-selective readout)

Conv:     Pre-SSM depthwise conv1d with learned kernel

Total per step:  O(d) for the recurrence + O(k·d) for conv1d

═══════════════════════════════════════════════════════════════
KEY DISTINCTION
═══════════════════════════════════════════════════════════════
DeltaNet writes CORRECTIONS. It asks "what is wrong with my memory?"
SSM writes INPUTS. It asks "what should I add to my state?"

DeltaNet has d² state. SSM has d state.
DeltaNet has one update rule. SSM has conv1d + recurrence.

Comparison Table

Property	Gated DeltaNet	Mamba-2 SSM
State shape	Matrix: d × d per head	Vector: d per channel
State size (typical)	128² × 4B = 64 KB/head	~16–64 × 4B = 64–256 B/channel
Update rule	Delta rule: write the error	Linear recurrence: decay + inject
Selectivity	β gate (how much to correct) + g gate (how much to forget)	A, B, C are all input-dependent (selective scan)
Pre-processing	L2 normalization of q, k	Depthwise conv1d (short kernel, typically k=4)
Compute per token	O(d²) — state matrix sweep	O(k·d) conv + O(d) recurrence
Memory per head/layer	O(d²) — can be large for high-d	O(d + k) — very compact
Prefill strategy	Chunked: process T tokens in chunks, update S incrementally	Parallel scan (associative scan over the recurrence)
What it resembles	Adaptive filter / Hopfield network	Control-theory state observer / IIR filter
Strength	Precise associative recall from compressed memory	Fast streaming, efficient long-range dependency modeling
Weakness	d² cost limits head dimension; rank-1 updates can collide	Diagonal A limits expressive capacity per step
Used by	Qwen3.5 / qwen3next (recurrent layers)	Nemotron 3, Mamba, Mamba-2, Jamba

The Kernel Engineer's View

From an implementation standpoint, these two families have very different performance profiles:

DeltaNet Kernel Profile

Compute-bound for large state_dim (d ≥ 128)
The state sweep is a dense d×d matrix operation — perfect for SIMD
AVX-512 gives 16 floats/iter → the inner loop is tight
The outer product outer(k̂, δ) is a rank-1 update — maps to FMA chains
Memory layout: state is row-major [d×d], vector ops stride by d
767 lines in CK-Engine with 4 ISA tiers (ref/avx/avx2/avx512)

SSM Conv1d Kernel Profile

Memory-bandwidth-bound for typical kernel sizes (k=4)
Depthwise conv1d: each channel has its own tiny filter
Very little arithmetic intensity — just k multiplies per output
Benefits from prefetch and streaming access patterns
Memory layout: [num_seqs × num_channels × (history + tokens)]
143 lines in CK-Engine — the simplicity is the point

Why Hybrid Architectures Use Both (or One + Attention)

Neither DeltaNet nor SSM can fully replace attention. Each trades exact token-by-token recall for efficiency. Modern architectures solve this by interleaving:

Qwen3.5 strategy:
  Layer 0: DeltaNet (recurrent)
  Layer 1: DeltaNet (recurrent)
  Layer 2: DeltaNet (recurrent)
  Layer 3: Full Attention        ← exact recall every 4th layer
  Layer 4: DeltaNet (recurrent)
  ...

Nemotron 3 strategy:
  Layer 0: Mamba-2 SSM
  Layer 1: Mamba-2 SSM
  Layer 2: Mamba-2 SSM
  Layer 3: Full Attention        ← similar interleaving
  Layer 4: Mamba-2 SSM
  ...

Both keep ~75% of layers as constant-cost recurrent.
The remaining ~25% attention layers handle precise retrieval.

The hypothesis behind both approaches: most tokens in a long context only need approximate memory (streaming summary), but some tokens need exact recall (names, numbers, specific facts). The hybrid gives you both.

SSM Kernel Internals — The Conv1d Path

The SSM kernel in CK-Engine implements the causal depthwise convolution that precedes the SSM recurrence in Mamba-style models. This is not the full Mamba forward pass — it is the conv1d stage that prepares inputs for the selective scan.

// ssm_kernels.c — forward pass structure

void ssm_conv1d_forward_ref(
    const float *conv_x,     // [num_seqs, num_channels, kernel_size-1 + num_tokens]
    const float *kernel,     // [num_channels, kernel_size]
    float *out,              // [num_seqs, num_tokens, num_channels]
    int kernel_size,         // typically 4
    int num_channels,        // model dim / num_heads
    int num_tokens,          // sequence length
    int num_seqs             // batch
)

// For each (seq, token, channel):
//   out[seq][tok][ch] = dot(conv_x[seq][ch][tok : tok+kernel_size], kernel[ch][:])
//
// This is a depthwise (channel-independent) causal convolution.
// Each channel has its own learned kernel of size k (typically 4).
// The conv_x buffer carries history from previous chunks (kernel_size-1 slots).

The recurrent_state_kernels.c file handles the state management pipeline — copying history into the conv buffer, appending new q/k/v tokens, and extracting the updated state for the next chunk:

// recurrent_state_kernels.c — state pipeline

void recurrent_conv_state_update_forward(
    const float *state_in,   // previous history [num_seqs × channels × history_len]
    const float *q, *k, *v,  // new token projections
    float *conv_x,           // assembled conv buffer [num_seqs × channels × total_len]
    float *state_out,        // updated history for next call
    int history_len,         // kernel_size - 1
    int num_seqs, num_tokens,
    int q_dim, k_dim, v_dim
)

// Step 1: Copy old history into conv buffer positions [0 : history_len]
// Step 2: Append new tokens into positions [history_len : history_len + num_tokens]
// Step 3: Extract updated history (last history_len positions) → state_out
//
// This is the "shift register" that feeds the causal conv1d kernel.

The diagram below shows what ssm_conv1d_forward_ref actually computes: a per-channel sliding window of width kernel_size walks across the token axis, and each output is a dot product against that channel's own learned kernel. Because every channel uses its own filter and never mixes with neighbours, it is a depthwise convolution — cheap, embarrassingly parallel across channels, and causal by construction.

The Fundamental Analogy

When Each Wins

Scenario	Best choice	Why
Exact retrieval from long context	Attention	Nothing is lost — full KV-cache scan
Long streaming decode (100K+ tokens)	SSM (Mamba)	O(d) per step, no memory growth, fast
Learned key-value associations	DeltaNet	d² state has capacity; delta rule is precise
High throughput, memory-constrained	SSM (Mamba)	Smallest state footprint
Chunked prefill with state reuse	DeltaNet	State survives chunks; no parallel scan needed
CPU inference (SIMD-friendly)	DeltaNet	Dense d×d operations map perfectly to AVX/AVX-512
GPU inference (parallel scan)	SSM (Mamba)	Parallel associative scan maps to GPU warps

The kernel engineer's takeaway: DeltaNet is compute-bound and SIMD-friendly — it rewards wide vector units and dense FMA throughput. SSM conv1d is bandwidth-bound and simple — it rewards prefetch and streaming memory access. On a Xeon with AVX-512, DeltaNet's d×d sweep can actually be faster than you'd expect because the arithmetic intensity is high enough to hide memory latency. SSM's conv with k=4 hits the memory wall almost immediately.

📊 Back to Kernel Catalog

For the full list of CK-Engine kernels (GEMM, RoPE, Softmax, Loss, etc.), see:

Kernel Catalog