Deep Dive: LLM Concepts

This page explains the key concepts, algorithms, and computational techniques used in modern Large Language Models. Understanding these fundamentals helps you know why each kernel exists and how they fit together.

Reading Guide
Each section includes: conceptual explanation, where it fits in the architecture, the math, and our C implementation.

Transformer Architecture Overview
RoPE: Rotary Position Embedding
RoPE Layouts: Pairwise vs Split-Half
GLM-4 RoPE: Pairwise Partial Rotary
Gemma4 RoPE: Per-Layer Direct Split-Half
Attention Mechanisms
Flash Attention
Recurrent Attention and Gated DeltaNet
Sliding-Window Attention
Grouped Query Attention (GQA)
Normalization: RMSNorm vs LayerNorm
Nemotron-H Hybrid Circuit
Mamba2 Reference Kernels
Mixture of Experts: Router + ReLU2 Experts
Activations: ReLU2, SwiGLU, GeGLU, GELU
Vision Patch Embedding
— 1D Text vs 2D Vision Position (RoPE)
Weight Tying

Transformer Architecture Overview

Before diving into individual components, let's see where everything fits in a decoder-only transformer (like Llama, GPT).

The Flow

Token Embedding: Convert token IDs to vectors
Position Encoding: Add position information (RoPE applied inside attention)
N Decoder Layers, each containing:
- RMSNorm → Self-Attention → Residual Add
- RMSNorm → MLP (SwiGLU) → Residual Add
Final RMSNorm
LM Head: Project back to vocabulary (often weight-tied with embedding)

RoPE: Rotary Position Embedding

RoPE encodes position information by rotating query and key vectors based on their position in the sequence. Unlike absolute position embeddings, RoPE naturally captures relative position.

Key Insight: RoPE Replaces Additive Position Embeddings
With RoPE, you go straight from token embedding to decoder layers. There's no separate position_embed to add. Position is injected via rotation inside each attention layer.

The Flow with RoPE

Traditional (GPT-2 style):
  token_embed + position_embed → layers

With RoPE:
  token_embed → layers (RoPE applied inside attention)

Token: "The"  "cat"  "sat"  "on"   "mat"
Pos:     0      1      2      3      4

Embedding lookup (NO position added):
  [vec0] [vec1] [vec2] [vec3] [vec4]
           |
      Decoder Layer
           |
      Q, K, V projection
           |
    +---------------------------+
    |  RoPE: rotate Q and K     |
    |  Q_0 rotated by 0*theta   |
    |  Q_1 rotated by 1*theta   |
    |  Q_2 rotated by 2*theta   |
    |  (same angles for K)      |
    +---------------------------+
           |
      Attention: Q @ K.T (now position-aware!)

What is rope_theta?

rope_theta (default: 10000) is the base frequency that controls how quickly positions rotate through the embedding space.

theta_i = rope_theta ^ (-2i / d)

For d=64, i=0:  theta = 10000^0 = 1
For d=64, i=16: theta = 10000^(-0.5) = 0.01
For d=64, i=31: theta = 10000^(-0.97) ~ 0.0001

Higher rope_theta = slower rotation = better for long contexts
(Llama 3.1 uses 500,000 for 128K context)

Position vs Dimension

Both are involved, but differently:

angle(pos, dim) = position * theta_i

Axis	What it controls
Position	How much to rotate (angle magnitude)
Dim pair	How fast to rotate (frequency)

Dim pair 0: rotates FAST (local patterns)
Dim pair 31: rotates SLOW (global patterns)

Same Rotation for Q and K?

YES! At the same position, both Q and K get the same rotation:

At position m:
  Q_m = rotate(Q_m, angle = m * theta)
  K_m = rotate(K_m, angle = m * theta)

At position n:
  K_n = rotate(K_n, angle = n * theta)

The magic happens in the dot product...

Why Rotation Works

When you compute Q_m · K_n after rotation:

Q_m · K_n = f(Q, K, m - n)

Only the angle difference matters!

Pos 5 → Pos 3: angle diff = 2*theta
Pos 100 → Pos 98: angle diff = 2*theta

Same relative distance = same attention behavior.

RoPE Math

For each pair of dimensions (2i, 2i+1) at position m:

Forward:
    theta_i = rope_theta ^ (-2i / head_dim)
    angle = m * theta_i

    x'[2i]   = x[2i] * cos(angle) - x[2i+1] * sin(angle)
    x'[2i+1] = x[2i] * sin(angle) + x[2i+1] * cos(angle)

Backward:
    Simply rotate by negative angle (transpose of rotation matrix)

Our implementation: rope_kernels.c - precomputes cos/sin cache, applies in-place to Q and K

RoPE Layouts: Pairwise vs Split-Half

One subtle but critical detail: RoPE is not only about the angle schedule. You also need the correct channel pairing layout. Two models can use the same rope_theta, the same cache, and the same attention code, yet still diverge badly if they pair channels differently.

Pairwise versus split-half RoPE layout comparison

Pairwise / Consecutive RoPE

This is the layout used by Llama-family checkpoints, including Nanbeige in our v7 work, and by GLM-4 for its rotary slice.

(0,1), (2,3), (4,5), ...

Each even/odd pair is treated as one 2D rotation plane. That makes the math read exactly like the textbook RoPE formula:

x'[2i]   = x[2i] * cos - x[2i+1] * sin
x'[2i+1] = x[2i] * sin + x[2i+1] * cos

Practical upside: the rotation is easy to reason about as adjacent feature pairs. Practical requirement: if a Llama checkpoint gets the split-half layout by mistake, parity fails immediately after RoPE.

Split-Half / NEOX RoPE

This layout is common in GPT-NeoX-style implementations. In this engine's current templates, Qwen and Gemma follow this family.

(0, d/2), (1, d/2+1), (2, d/2+2), ...

The vector is split into two halves, then each index in the first half rotates against the matching index in the second half.

Practical upside: it fits naturally with split-vector implementations and long-standing NEOX-style kernels. Practical requirement: it is only correct for checkpoints trained with that layout.

Important: This is a checkpoint semantic, not a tuning preference

Neither layout is inherently "better" in the abstract. The model family decides which one is correct. The right question is not "which RoPE is stronger?" but "which RoPE layout was this checkpoint trained with?"

Question	Correct framing
Do both use the same angles?	Yes, both use the same RoPE frequency schedule idea.
What changes?	Which channels are paired into each 2D rotation plane.
What breaks if you choose the wrong one?	Q/K differ right after RoPE, then attention and logits drift downstream.
What should parity tooling check?	Embedding output, then per-layer hidden states, then the first op that diverges.

What is the actual difference?

The frequency schedule is the same idea in both cases. The actual difference is which channels form each rotation plane.

Pairwise:
  x[0] with x[1]
  x[2] with x[3]

Split-half:
  x[0] with x[d/2]
  x[1] with x[d/2+1]

That changes the rotated Q and K values immediately. If you get the layout wrong, attention will diverge even when embedding lookup, RMSNorm, and projection kernels are all correct.

Before training: how would an ML engineer choose?

If you are designing a model family from scratch, this choice is mostly about ecosystem fit and implementation coherence, not a proven universal quality advantage.

Choose pairwise if you want a Llama-style family that should match llama.cpp, common Llama checkpoints, and adjacent even/odd rotation math.
Choose split-half if you are building on a GPT-NeoX-style stack or existing kernels, loaders, and training code that already assume first-half/second-half pairing.
Choose the stack you can support end-to-end: training code, checkpoint export, inference kernels, parity tooling, and debugging all need to agree.

For most teams, the practical decision is: pick the RoPE layout of the ecosystem you want to join. Bigger modeling choices usually matter more for quality than pairwise-versus-split-half by itself.

Design-time note for ML engineers
If you are still before training, RoPE layout is a family-level architecture choice. It should be decided alongside tokenizer, attention variant, GQA/MQA policy, context strategy, and runtime targets. Once you train, the layout becomes part of the checkpoint contract and changing it later is effectively a different model.

Implementation note in CK right now
The older split-half path already has AVX/AVX-512 SIMD specialization. The pairwise path is selected through template/model-map metadata for model families that rotate adjacent even/odd pairs, including Llama-style layouts and GLM-4 partial RoPE. The direct split-half path was added for model families that need per-layer rotary parameters instead of one global cache contract.

GLM-4 RoPE: Pairwise Partial Rotary

What changed for GLM-4?

GLM-4 uses the pairwise/interleaved RoPE layout, but only on the configured rotary slice of each attention head. For example, with a 128-dimensional head and a 64-dimensional rotary slice, CK rotates adjacent pairs inside the first 64 dimensions and leaves the remaining 64 dimensions unchanged.

Pairwise partial RoPE:
  rotate:      (x0, x1), (x2, x3), ... within rotary_dim
  pass-through: dimensions rotary_dim .. head_dim-1

GLM template/model map:
  rope_layout       = pairwise
  partial_rotary    = true
  rope_qk override  = rope_forward_qk_pairwise

This is a model compatibility feature, not a tuning knob. If GLM-4 is lowered with split-half RoPE, the projection kernels can still be correct, but Q/K diverge immediately after RoPE and the attention output drifts from PyTorch.

How CK controls this

The GLM template declares the circuit-level RoPE policy with rope_layout: "pairwise" and the safetensors model map carries the same architecture contract. Lowering reads that metadata and emits rope_forward_qk_pairwise for the Q/K rotary operation. The kernel itself is generic: it implements pairwise partial rotary math and does not know that the caller is GLM.

This keeps the DSL/compiler path deterministic. Templates and model maps describe the semantic contract; kernel maps bind that contract to an implementation; generated C should not infer or guess the rotary convention from tensor shape alone.

Gemma4 RoPE: Per-Layer Direct Split-Half

What changed for Gemma4?

Gemma4 still uses the split-half rotation layout, but it is not the same runtime contract as a plain global split-half cache. Different layer kinds can carry different rotary settings, so CK now treats RoPE as layer metadata instead of a hard-coded model special case.

template/gemma4.json
  attention_contract.rope_param_mode = per_layer_direct
  layer_rope_kind[]                  = full | sliding
  layer_rotary_dim[]                 = per-layer rotary width
  rope_theta / rope_theta_swa        = per-kind frequency base

lowered IR
  rope_qk = rope_forward_qk_split_direct
  attrs carry the layer rotary contract

kernel map
  rope_forward_qk_split_direct -> rope_forward_qk_split_direct_f32

The important distinction is semantic: sliding_window controls the attention mask/cache window. It does not tell the runtime which rotary base, rotary dimension, or layer-specific RoPE stream to use. Those are explicit IR attributes.

Why this is not hard-coded

The Gemma4 template exports the contract as data. The DSL/lowering path reads that data and emits a generic rope_forward_qk_split_direct op. The C kernel is not named after Gemma4 and can be reused by another architecture that needs split-half rotation with direct per-layer parameters.

This is the same compatibility pattern used elsewhere in CK: model templates describe graph semantics; lowered IR carries concrete kernel attributes; kernels implement math contracts without owning model policy.

Attention Mechanisms

Attention is the core of transformers: it lets each token "look at" all other tokens and decide what's relevant.

Attention Math

1. Project input to Q, K, V:
   Q = input @ W_q    # What am I looking for?
   K = input @ W_k    # What do I contain?
   V = input @ W_v    # What do I offer?

2. Compute attention scores:
   scores = Q @ K.T / sqrt(d_k)    # How relevant is each position?

3. Apply causal mask (decoder only):
   scores[i][j] = -inf if j > i    # Can't look at future tokens

4. Softmax to get weights:
   weights = softmax(scores)       # Normalize to probabilities

5. Weighted sum of values:
   output = weights @ V            # Aggregate information

Why Scale by sqrt(d_k)?

Without scaling, dot products grow with dimension size, pushing softmax into saturation (all attention on one token). Scaling keeps gradients healthy.

d_k = 64:  scale = 1/8 = 0.125
d_k = 128: scale = 1/11.3 ≈ 0.088

Flash Attention

Flash Attention is an algorithmic optimization that computes exact attention without materializing the full N×N attention matrix. It's about memory efficiency, not approximation.

The Problem

Standard attention materializes the full score matrix:

scores = Q @ K.T  # [N, N] matrix!

N = 2048:  16 MB
N = 8192:  256 MB
N = 32768: 4 GB
N = 131072: 64 GB  # Won't fit in GPU!

Memory is O(N²), which limits context length.

Flash Attention Solution

Process in tiles, never storing full N×N matrix:

Load Q tile to fast SRAM
Stream K, V tiles through
Compute partial softmax with online algorithm
Accumulate output incrementally

Memory: O(N) instead of O(N²)

Speed: 2-4x faster (memory-bound → compute-bound)

Online Softmax Trick

The key insight: you can compute softmax incrementally without seeing all values first.

Traditional: softmax(x) = exp(x) / sum(exp(x))  # Need all x first

Online (Flash):
    For each new block of scores:
        1. Update running max: m_new = max(m_old, max(block))
        2. Rescale previous sum: sum *= exp(m_old - m_new)
        3. Add new block contribution: sum += sum(exp(block - m_new))
        4. Update running output with correction factor

Key Insight: Flash Attention reduces memory traffic from O(N²) to O(N), making it valuable for any hardware where memory bandwidth is a bottleneck. The algorithmic improvement (online softmax + tiling) applies to both CPU and GPU.

📊 Performance Analysis: CPU Flash Attention

For a detailed comparison of CPU attention implementations (including benchmarks vs llama.cpp's ggml), see:

Flash Attention Analysis: Why llama.cpp is Faster

Covers SIMD optimizations, threading strategies, and performance trade-offs across different workloads (small vs large models, different context lengths).

Recurrent Attention and Gated DeltaNet

Some hybrid architectures avoid scanning a full KV-cache on every layer by carrying a compact recurrent state. In Qwen3.5/qwen3next-style layers, CK implements the Gated DeltaNet update as a dedicated kernel family rather than treating it as ordinary attention or a wrapper around GEMM.

Where to go next

The full recurrent update, state matrix semantics, gate placement, and CK kernel entry point are covered in the dedicated deep dive:

Gated DeltaNet: Recurrent Attention Deep Dive

Primary implementation: src/kernels/deltanet_kernels.c, exposed through gated_deltanet_autoregressive_forward().

Sliding-Window Attention

Sliding-window attention limits each query to the most recent W tokens. It is exact within the window, keeps compute linear in sequence length, and provides predictable latency for long contexts.

Windowed Causal Mask

For token i:
  attend j in [max(0, i - W + 1), i]

scores[i][j] = -inf if j > i
scores[i][j] = -inf if i - j >= W

Compute: O(N * W) vs O(N²). Memory stays O(N * D).

Why It Matters

Long contexts without quadratic blow-up
Stable decode latency (only last W KV entries)
Works with Flash Attention (online softmax inside the window)

C-Kernel-Engine Implementation

We expose a dedicated op attention_sliding that uses the same head-major GQA layout as standard attention.

attention_forward_causal_head_major_gqa_flash_strided_sliding (prefill)
attention_forward_decode_head_major_gqa_flash_sliding (decode)

The only difference from causal attention is the window bounds. The math and softmax remain exact inside the window.

Grouped Query Attention (GQA)

GQA reduces memory and compute by sharing K, V heads across multiple Q heads.

Why GQA?

Type	Q Heads	K,V Heads	KV Cache Size	Model
MHA (Multi-Head)	32	32	100%	GPT-3, Llama 1
GQA	32	8	25%	Llama 2 70B, Llama 3
MQA (Multi-Query)	32	1	3%	Falcon, PaLM

GQA is the sweet spot: 4x smaller KV cache with minimal quality loss.

Normalization: RMSNorm vs LayerNorm

Normalization stabilizes training by keeping activations in a reasonable range.

LayerNorm (Original)

mean = sum(x) / n
var = sum((x - mean)²) / n
y = gamma * (x - mean) / sqrt(var + eps) + beta

4 operations: mean, center, variance, normalize

Learnable: gamma (scale) and beta (shift)

RMSNorm (Simpler)

rms = sqrt(sum(x²) / n)
y = gamma * x / (rms + eps)

2 operations: RMS, normalize

Learnable: gamma only (no beta)

~15% faster, same quality

Why Remove Mean Centering?

Research found that the re-centering (subtracting mean) in LayerNorm isn't necessary for good performance. RMSNorm keeps just the scaling, which is what matters for gradient flow.

Used in: Llama, Mistral, Qwen, Gemma (most modern LLMs)

Nemotron-H Hybrid Circuit: Attention + Mamba2 + MoE

Nemotron-H is important for CK because it is not just another dense transformer stack. A single model family can mix attention layers, Mamba2-style state-space layers, routed MoE blocks, ReLU2 experts, tokenizer guardrails, and model-specific template rules. CK needs to represent that as an auditable circuit of named kernels before it becomes a high-performance runtime.

Nemotron-H hybrid attention, Mamba2, and MoE circuit

What CK Has to Preserve

The hard part is not only implementing the math once. The runtime has to preserve state ownership and layout contracts across different layer types:

Attention: Q/K/V projection, RoPE or no-RoPE policy, KV-cache placement, and output projection.
Mamba2 / SSM: input projection split, convolution state, dt softplus, recurrent selective state update, and gated normalization.
MoE: group-limited routing, top-k expert selection, ReLU2 expert execution, and weighted accumulation.

Why Profiling Comes Before Fusion

VTune and Advisor are exactly the right tools here. They tell us whether a slow path is memory bandwidth, cache misses, branch pressure, vector under-utilization, bad thread placement, or true arithmetic cost. Only then should CK promote a fused kernel.

Good fusion candidates are real, but they should be gated:

Q/K/V projection plus quantization or layout packing for prefill.
Mamba2 in-proj split plus conv-state update plus dt preparation.
Selective scan chunks once scalar sequence parity is locked.
Router plus expert dispatch when sparse row ownership is stable.

Current CK Coverage

The current Nemotron path is deliberately reference-first. make test-mamba2-reference validates Mamba2 decode and sequence-scan behavior. make test-nemotron-router validates group-limited routing. make test-moe-relu2-expert validates routed ReLU2 expert math. The v8 guardrail tests cover tokenizer capability codegen, template-circuit audit rules, and high-memory Nemotron smoke coverage behind make test-v8-nemotron9-highmem.

This is the right foundation for the next optimization phase: use profiling to decide which references become packed, tiled, SIMD, chunked, or fused kernels, without losing the ability to explain exactly which kernel changed.

Mamba2 Reference Kernels

Hybrid models such as Nemotron-H include Mamba2-style recurrent layers alongside attention and MoE blocks. These layers are not attention: they maintain a compact state, update it token by token, and use learned decay/input/output terms to produce the next hidden representation.

Decode Contract

projected -> split(gate, hidden_bc, dt)
conv_state = roll_append(conv_state, hidden_bc)
dt = softplus(dt + dt_bias)
state = state * exp(dt * A) + dt * B * x
y = C dot state + D * x
out = grouped_rmsnorm(y * silu(gate))

The important part is state ownership. CK exposes each step as a named kernel so the runtime can decide where state lives, how it is streamed, and which future optimized kernel replaces the scalar reference.

Why Reference First

Mamba2 performance work should not start as a fused black box. The scalar FP32 kernels define exact tensor layout, numerical behavior, and PyTorch parity before SIMD, tiling, chunked scan, or distributed state placement are introduced.

mamba2_in_proj_split_f32
mamba2_conv1d_decode_f32
mamba2_dt_softplus_f32
mamba2_selective_state_update_decode_f32
mamba2_selective_scan_f32
mamba2_rmsnorm_gate_f32

Current CK Coverage

make test-mamba2-reference validates the decode primitives and scalar sequence scan against PyTorch. It also checks that full sequence scan matches repeated decode updates. make test-mamba2-reference-perf provides a scalar baseline for selective state update. This is enough to harden Nemotron-H recurrent contracts, but it is not yet the final high-performance chunked/SIMD selective scan.

Mixture of Experts: Router + ReLU2 Experts

MoE layers replace one dense MLP with a pool of experts and a router. For each token row, the router scores experts, selects a small top_k subset, runs only those expert MLPs, and accumulates the weighted result. This is sparse control flow wrapped around dense per-expert math.

Nemotron-style MoE router and ReLU2 expert flow

Group-Limited Top-K Router

scores = sigmoid(router_logits + correction_bias)
groups = partition(experts, n_group)
active_groups = topk(group_scores, topk_group)
experts = topk(scores inside active_groups, top_k)
weights = normalize(scores[experts])

The router is not just a matrix multiply. It includes bias correction, group selection, top-k selection, optional normalization, and a scaling factor before expert dispatch.

Routed ReLU2 Expert

for selected expert e:
  h1 = hidden @ W_up[e].T
  h2 = relu2(h1)
  y  = h2 @ W_down[e].T
  output += routing_weight[e] * y

The active experts are dense kernels. CK now has scalar reference coverage for the routed ReLU2 expert path, including backward parity and a performance benchmark.

Current CK Coverage

nemotron_group_limited_topk_router_f32 for group-limited routing.
moe_relu2_expert_forward_f32 and moe_relu2_expert_backward_f32 for routed expert execution.
make test-nemotron-router, make test-moe-relu2-expert, and make test-moe-relu2-expert-perf for parity and speed checks.

On local AVX2 validation, the group-limited router microbenchmark measured CK at about 3.5x faster than the PyTorch reference, and the ReLU2 expert benchmark measured roughly 28-32x faster for the tested shapes. That does not mean full Nemotron inference is solved; it means these specific MoE primitives are now isolated, tested, and fast enough to build on.

Activations: ReLU2, SwiGLU, GeGLU, GELU

Activation functions introduce non-linearity. Modern LLMs often use gated activations (SwiGLU, GeGLU), while some Nemotron-family MLP blocks use simpler squared activations such as ReLU2. CK treats these as first-class kernels because each activation has a different forward/backward contract.

ReLU and ReLU2

ReLU(x)  = max(0, x)
ReLU2(x) = max(0, x)^2

d ReLU2 / dx = 2x if x > 0 else 0

ReLU2 keeps the same negative-value mask as ReLU, but positive activations grow quadratically. The backward kernel must use the original input value, not only a binary mask.

Why CK Has a Dedicated Kernel

It is small math, but it appears across large MLP tensors where memory traffic and branch handling matter. A dedicated forward/backward pair lets CK validate exact PyTorch parity and benchmark the CPU path directly.

relu2_forward_fp32 for activation output
relu2_backward_fp32 for gradient propagation
make test-relu2 and make test-relu2-perf for coverage

GELU (GPT-2, BERT)

GELU(x) = x * Φ(x)
        ≈ x * sigmoid(1.702 * x)

Smooth approximation of ReLU that allows small negative values.

SwiGLU (Llama, Mistral)

Swish(x) = x * sigmoid(x)

SwiGLU(x) = Swish(x @ W_gate) * (x @ W_up)

Gated: One path controls how much of the other passes through.

2x parameters in MLP, but better performance per FLOP.

GeGLU (Gemma, T5 variants)

Split input: x = [a | b]

GeGLU(x) = GELU(a) * b
GELU(a) = 0.5 * a * (1 + tanh(sqrt(2/pi) * (a + 0.044715 * a^3)))

GeGLU replaces the sigmoid/Swish gate with GELU, giving smoother gradients and better calibration for some model families.

Why Gating Works

The gate learns to selectively activate different features:

gate = activation(x @ W_gate)  # Swish (SwiGLU) or GELU (GeGLU)
value = x @ W_up            # The actual content

output = gate * value       # Gate controls information flow

This gives the network more expressive power: it can learn to completely shut off certain dimensions for certain inputs.

Vision Patch Embedding

Multimodal families (Qwen3-VL, Gemma4 Vision, the standalone SigLIP ViT tower) start by turning a raw image into a sequence of encoder tokens. There is no attention or MLP here yet — this is the frontend that the transformer body consumes. In CK-Engine these steps are concrete kernels in src/kernels/vision_kernels.c, not hidden preprocessing, which is what lets the vision encoder share the same lowering and memory plan as the text decoder.

The Frontend Contract

im2patch:        [C,H,W] -> [N, C*P*P]    (N = (H/P)*(W/P))
patch embed:     [N, C*P*P] -> [N, d]     (dual proj + stream merge + bias)
spatial_merge_2x2: group 4 patches -> [N/4, 4*d]
+ tiled 2D pos:  position_embeddings_add_tiled_2d + 2D RoPE

Each patch is a P×P tile across all channels, flattened into one row. spatial_merge_2x2 then groups a 2×2 block of patches into a single token using tile_order_index_2d, which is why the encoder works on a coarser grid than the raw patch grid.

Why Depthwise / Why Strided

im2patch reads strided pixel rows directly rather than materializing reshaped copies, keeping the frontend allocation-free. The same kernels serve two merge policies:

Qwen3-VL: spatial_merge_2x2 + position_embeddings_add_tiled_2d
Gemma4 Vision: spatial_average_pool_contiguous + position_embeddings_add_gemma4v_xy
Deepstack: feature_concat stitches selected layer taps before the projector footer

Current CK Coverage

The patch frontend is validated as part of the v8 Qwen3-VL encoder parity path. The remaining flagged item is the layout memcpy in im2patch (one row copy per patch row) tracked in KERNEL_STATUS.md for conversion to fully strided access. See the v8 Vision Encoder page for the full encoder → bridge → decoder pipeline.

How a Patch Knows Where It Is: 1D Text vs 2D Vision RoPE

Your intuition is exactly right. In text, a token's "position" is a single number — its index t in the sequence (0, 1, 2, …). RoPE turns that index into a rotation angle and rotates every (x_i, x_i+d/2) dimension pair of Q and K by it. Because rotation is relative, attention automatically recovers "how many tokens apart" two positions are. Position lives on one line.

An image patch has no single index that means anything — what matters is where on the grid it sits. So instead of one scalar, a patch token carries two coordinates: its row y and its column x on the patch grid (the grid being (H/P) × (W/P)). This is the "context is basically the row and the height/column of the i-th patch" idea you described.

Comparison of 1D text RoPE position and 2D vision patch position encoding

CK encodes this two ways at once, both grounded in src/kernels/vision_kernels.c:

1. Multi-section (M-)RoPE — the rotation path

It is the same RoPE math, just applied to two indices. The head dimension is split into sections: one slice of dims is rotated by an angle derived from the row θ(y), another slice by an angle derived from the column θ(x).

text   :  angle = t * base^(-2i/d)          // one index
vision :  y-dims rotate by  y * base^(-2i/d)
          x-dims rotate by  x * base^(-2i/d) // two indices

vision_position_ids_2d_merge builds the position buffer as four flattened streams [y | x | y | x], emitted in the same 2×2 merged-tile order the patches use, so a token's coordinates line up with its embedding row. After rotation, attention can recover both vertical and horizontal distance between patches — true 2D relative position.

2. Learned 2D position table — the additive path

On top of (or instead of) rotation, vision encoders also keep a learned position embedding table indexed by grid cell — the direct analogue of GPT-2's learned 1D positional embedding, but 2D.

position_embeddings_add_tiled_2d takes a table trained at a fixed source_grid × source_grid resolution and bilinearly interpolates it to whatever grid the current image actually produced, then adds it row-by-row. That is what lets one trained table serve images of different sizes.

Gemma4 vision uses the XY variant position_embeddings_add_gemma4v_xy, which adds separate per-row and per-column components instead of a single interpolated grid.

The one-line summary
Text RoPE answers “how far along the sequence?” with one index. Vision M-RoPE answers “which row and which column?” with two — same rotation trick, applied to a 2D grid coordinate, with an interpolatable learned table layered on top.

Weight Tying

Weight tying shares parameters between the input embedding and output projection, reducing model size with minimal quality loss.

What Gets Shared

Input embedding:  token_id → vector    E[vocab, hidden]
Output projection: vector → logits     W[vocab, hidden]

With weight tying: W = E  (same matrix!)

Savings: For vocab=128K, hidden=4096: saves 2 GB of parameters

Why It Works

Intuition: The embedding and LM head learn similar things:

Embedding: "What vector represents this token?"
LM Head: "Which token does this vector represent?"

These are inverse operations, so sharing makes sense semantically.

Gradient implication: During training, gradients from both the embedding lookup and the LM head accumulate into the same weight matrix.

Summary: What Each Config Field Controls

Config Field	What It Controls	Typical Values
`hidden_size`	Main embedding dimension	768, 2048, 4096, 8192
`num_hidden_layers`	Number of transformer blocks	12, 24, 32, 80
`num_attention_heads`	Q heads for attention	12, 32, 64
`num_key_value_heads`	K,V heads (GQA)	Same as heads, or 8, 4, 1
`intermediate_size`	MLP hidden dimension	4x hidden (SwiGLU: 2.67x)
`rope_theta`	RoPE base frequency	10000, 500000, 1000000
`sliding_window`	Attention window size (0 = full causal)	0, 256, 512, 1024
`rms_norm_eps`	Numerical stability in norm	1e-5, 1e-6
`vocab_size`	Vocabulary size	32000, 50257, 128256
`max_position_embeddings`	Maximum context length	2048, 8192, 131072
`tie_word_embeddings`	Share embed & LM head	true/false

Deep Dive: LLM Concepts

Table of Contents

Transformer Architecture Overview

The Flow

RoPE: Rotary Position Embedding

The Flow with RoPE

What is rope_theta?

Position vs Dimension

Same Rotation for Q and K?

Why Rotation Works

RoPE Math

RoPE Layouts: Pairwise vs Split-Half

Pairwise / Consecutive RoPE

Split-Half / NEOX RoPE

Important: This is a checkpoint semantic, not a tuning preference

What is the actual difference?

Before training: how would an ML engineer choose?

GLM-4 RoPE: Pairwise Partial Rotary

What changed for GLM-4?

How CK controls this

Gemma4 RoPE: Per-Layer Direct Split-Half

What changed for Gemma4?

Why this is not hard-coded

Attention Mechanisms

Attention Math

Why Scale by sqrt(d_k)?

Flash Attention

The Problem

Flash Attention Solution

Online Softmax Trick

📊 Performance Analysis: CPU Flash Attention

Recurrent Attention and Gated DeltaNet

Where to go next

Sliding-Window Attention

Windowed Causal Mask

Why It Matters

C-Kernel-Engine Implementation

Grouped Query Attention (GQA)

Why GQA?

Normalization: RMSNorm vs LayerNorm

LayerNorm (Original)

RMSNorm (Simpler)

Why Remove Mean Centering?

Nemotron-H Hybrid Circuit: Attention + Mamba2 + MoE

What CK Has to Preserve

Why Profiling Comes Before Fusion

Current CK Coverage

Mamba2 Reference Kernels

Decode Contract

Why Reference First

Current CK Coverage

Mixture of Experts: Router + ReLU2 Experts

Group-Limited Top-K Router

Routed ReLU2 Expert

Current CK Coverage

Activations: ReLU2, SwiGLU, GeGLU, GELU

ReLU and ReLU2

Why CK Has a Dedicated Kernel

GELU (GPT-2, BERT)

SwiGLU (Llama, Mistral)

GeGLU (Gemma, T5 variants)

Why Gating Works

Vision Patch Embedding

The Frontend Contract

Why Depthwise / Why Strided

Current CK Coverage

How a Patch Knows Where It Is: 1D Text vs 2D Vision RoPE

1. Multi-section (M-)RoPE — the rotation path

2. Learned 2D position table — the additive path

Weight Tying

What Gets Shared

Why It Works

Summary: What Each Config Field Controls

Further Reading