Deep Dive: LLM Concepts
This page explains the key concepts, algorithms, and computational techniques used in modern Large Language Models. Understanding these fundamentals helps you know why each kernel exists and how they fit together.
Each section includes: conceptual explanation, where it fits in the architecture, the math, and our C implementation.
Table of Contents
- Transformer Architecture Overview
- RoPE: Rotary Position Embedding
- RoPE Layouts: Pairwise vs Split-Half
- Attention Mechanisms
- Flash Attention
- Sliding-Window Attention
- Grouped Query Attention (GQA)
- Normalization: RMSNorm vs LayerNorm
- Activations: SwiGLU, GeGLU, GELU
- Weight Tying
Transformer Architecture Overview
Before diving into individual components, let's see where everything fits in a decoder-only transformer (like Llama, GPT).
The Flow
- Token Embedding: Convert token IDs to vectors
- Position Encoding: Add position information (RoPE applied inside attention)
- N Decoder Layers, each containing:
- RMSNorm → Self-Attention → Residual Add
- RMSNorm → MLP (SwiGLU) → Residual Add
- Final RMSNorm
- LM Head: Project back to vocabulary (often weight-tied with embedding)
RoPE: Rotary Position Embedding
RoPE encodes position information by rotating query and key vectors based on their position in the sequence. Unlike absolute position embeddings, RoPE naturally captures relative position.
With RoPE, you go straight from token embedding to decoder layers. There's no separate
position_embed to add. Position is injected via rotation inside each attention layer.
The Flow with RoPE
Traditional (GPT-2 style):
token_embed + position_embed → layers
With RoPE:
token_embed → layers (RoPE applied inside attention)
Token: "The" "cat" "sat" "on" "mat"
Pos: 0 1 2 3 4
Embedding lookup (NO position added):
[vec0] [vec1] [vec2] [vec3] [vec4]
|
Decoder Layer
|
Q, K, V projection
|
+---------------------------+
| RoPE: rotate Q and K |
| Q_0 rotated by 0*theta |
| Q_1 rotated by 1*theta |
| Q_2 rotated by 2*theta |
| (same angles for K) |
+---------------------------+
|
Attention: Q @ K.T (now position-aware!)
What is rope_theta?
rope_theta (default: 10000) is the base frequency that controls how quickly positions rotate through the embedding space.
theta_i = rope_theta ^ (-2i / d) For d=64, i=0: theta = 10000^0 = 1 For d=64, i=16: theta = 10000^(-0.5) = 0.01 For d=64, i=31: theta = 10000^(-0.97) ~ 0.0001
Higher rope_theta = slower rotation = better for long contexts
(Llama 3.1 uses 500,000 for 128K context)
Position vs Dimension
Both are involved, but differently:
angle(pos, dim) = position * theta_i
| Axis | What it controls |
|---|---|
| Position | How much to rotate (angle magnitude) |
| Dim pair | How fast to rotate (frequency) |
Dim pair 0: rotates FAST (local patterns)
Dim pair 31: rotates SLOW (global patterns)
Same Rotation for Q and K?
YES! At the same position, both Q and K get the same rotation:
At position m: Q_m = rotate(Q_m, angle = m * theta) K_m = rotate(K_m, angle = m * theta) At position n: K_n = rotate(K_n, angle = n * theta)
The magic happens in the dot product...
Why Rotation Works
When you compute Q_m · K_n after rotation:
Q_m · K_n = f(Q, K, m - n)
Only the angle difference matters!
- Pos 5 → Pos 3: angle diff = 2*theta
- Pos 100 → Pos 98: angle diff = 2*theta
Same relative distance = same attention behavior.
RoPE Math
For each pair of dimensions (2i, 2i+1) at position m:
Forward:
theta_i = rope_theta ^ (-2i / head_dim)
angle = m * theta_i
x'[2i] = x[2i] * cos(angle) - x[2i+1] * sin(angle)
x'[2i+1] = x[2i] * sin(angle) + x[2i+1] * cos(angle)
Backward:
Simply rotate by negative angle (transpose of rotation matrix)
Our implementation: rope_kernels.c - precomputes cos/sin cache, applies in-place to Q and K
RoPE Layouts: Pairwise vs Split-Half
One subtle but critical detail: RoPE is not only about the angle schedule. You also need the correct channel pairing layout. Two models can use the same rope_theta, the same cache, and the same attention code, yet still diverge badly if they pair channels differently.
Pairwise / Consecutive RoPE
This is the layout used by Llama-family checkpoints, including Nanbeige in our v7 work.
(0,1), (2,3), (4,5), ...
Each even/odd pair is treated as one 2D rotation plane. That makes the math read exactly like the textbook RoPE formula:
x'[2i] = x[2i] * cos - x[2i+1] * sin x'[2i+1] = x[2i] * sin + x[2i+1] * cos
Practical upside: the rotation is easy to reason about as adjacent feature pairs. Practical requirement: if a Llama checkpoint gets the split-half layout by mistake, parity fails immediately after RoPE.
Split-Half / NEOX RoPE
This layout is common in GPT-NeoX-style implementations. In this engine's current templates, Qwen and Gemma follow this family.
(0, d/2), (1, d/2+1), (2, d/2+2), ...
The vector is split into two halves, then each index in the first half rotates against the matching index in the second half.
Practical upside: it fits naturally with split-vector implementations and long-standing NEOX-style kernels. Practical requirement: it is only correct for checkpoints trained with that layout.
Important: This is a checkpoint semantic, not a tuning preference
Neither layout is inherently "better" in the abstract. The model family decides which one is correct. The right question is not "which RoPE is stronger?" but "which RoPE layout was this checkpoint trained with?"
| Question | Correct framing |
|---|---|
| Do both use the same angles? | Yes, both use the same RoPE frequency schedule idea. |
| What changes? | Which channels are paired into each 2D rotation plane. |
| What breaks if you choose the wrong one? | Q/K differ right after RoPE, then attention and logits drift downstream. |
| What should parity tooling check? | Embedding output, then per-layer hidden states, then the first op that diverges. |
What is the actual difference?
The frequency schedule is the same idea in both cases. The actual difference is which channels form each rotation plane.
Pairwise: x[0] with x[1] x[2] with x[3] Split-half: x[0] with x[d/2] x[1] with x[d/2+1]
That changes the rotated Q and K values immediately. If you get the layout wrong, attention will diverge even when embedding lookup, RMSNorm, and projection kernels are all correct.
Before training: how would an ML engineer choose?
If you are designing a model family from scratch, this choice is mostly about ecosystem fit and implementation coherence, not a proven universal quality advantage.
- Choose pairwise if you want a Llama-style family that should match llama.cpp, common Llama checkpoints, and adjacent even/odd rotation math.
- Choose split-half if you are building on a GPT-NeoX-style stack or existing kernels, loaders, and training code that already assume first-half/second-half pairing.
- Choose the stack you can support end-to-end: training code, checkpoint export, inference kernels, parity tooling, and debugging all need to agree.
For most teams, the practical decision is: pick the RoPE layout of the ecosystem you want to join. Bigger modeling choices usually matter more for quality than pairwise-versus-split-half by itself.
If you are still before training, RoPE layout is a family-level architecture choice. It should be decided alongside tokenizer, attention variant, GQA/MQA policy, context strategy, and runtime targets. Once you train, the layout becomes part of the checkpoint contract and changing it later is effectively a different model.
The older split-half path already has AVX/AVX-512 SIMD specialization. The new pairwise path was added to match Llama-family semantics first; in the current codebase it is the correctness path, and it is not yet vectorized like the split-half kernel.
Attention Mechanisms
Attention is the core of transformers: it lets each token "look at" all other tokens and decide what's relevant.
Attention Math
1. Project input to Q, K, V:
Q = input @ W_q # What am I looking for?
K = input @ W_k # What do I contain?
V = input @ W_v # What do I offer?
2. Compute attention scores:
scores = Q @ K.T / sqrt(d_k) # How relevant is each position?
3. Apply causal mask (decoder only):
scores[i][j] = -inf if j > i # Can't look at future tokens
4. Softmax to get weights:
weights = softmax(scores) # Normalize to probabilities
5. Weighted sum of values:
output = weights @ V # Aggregate information
Why Scale by sqrt(d_k)?
Without scaling, dot products grow with dimension size, pushing softmax into saturation (all attention on one token). Scaling keeps gradients healthy.
d_k = 64: scale = 1/8 = 0.125
d_k = 128: scale = 1/11.3 ≈ 0.088
Flash Attention
Flash Attention is an algorithmic optimization that computes exact attention without materializing the full N×N attention matrix. It's about memory efficiency, not approximation.
The Problem
Standard attention materializes the full score matrix:
scores = Q @ K.T # [N, N] matrix! N = 2048: 16 MB N = 8192: 256 MB N = 32768: 4 GB N = 131072: 64 GB # Won't fit in GPU!
Memory is O(N²), which limits context length.
Flash Attention Solution
Process in tiles, never storing full N×N matrix:
- Load Q tile to fast SRAM
- Stream K, V tiles through
- Compute partial softmax with online algorithm
- Accumulate output incrementally
Memory: O(N) instead of O(N²)
Speed: 2-4x faster (memory-bound → compute-bound)
Online Softmax Trick
The key insight: you can compute softmax incrementally without seeing all values first.
Traditional: softmax(x) = exp(x) / sum(exp(x)) # Need all x first
Online (Flash):
For each new block of scores:
1. Update running max: m_new = max(m_old, max(block))
2. Rescale previous sum: sum *= exp(m_old - m_new)
3. Add new block contribution: sum += sum(exp(block - m_new))
4. Update running output with correction factor
Key Insight: Flash Attention reduces memory traffic from O(N²) to O(N), making it valuable for any hardware where memory bandwidth is a bottleneck. The algorithmic improvement (online softmax + tiling) applies to both CPU and GPU.
📊 Performance Analysis: CPU Flash Attention
For a detailed comparison of CPU attention implementations (including benchmarks vs llama.cpp's ggml), see:
Flash Attention Analysis: Why llama.cpp is Faster
Covers SIMD optimizations, threading strategies, and performance trade-offs across different workloads (small vs large models, different context lengths).
Sliding-Window Attention
Sliding-window attention limits each query to the most recent W tokens. It is exact within the window, keeps compute linear in sequence length, and provides predictable latency for long contexts.
Windowed Causal Mask
For token i:
attend j in [max(0, i - W + 1), i]
scores[i][j] = -inf if j > i
scores[i][j] = -inf if i - j >= W
Compute: O(N * W) vs O(N²). Memory stays O(N * D).
Why It Matters
- Long contexts without quadratic blow-up
- Stable decode latency (only last W KV entries)
- Works with Flash Attention (online softmax inside the window)
C-Kernel-Engine Implementation
We expose a dedicated op attention_sliding that uses the same head-major GQA layout as standard attention.
attention_forward_causal_head_major_gqa_flash_strided_sliding(prefill)attention_forward_decode_head_major_gqa_flash_sliding(decode)
The only difference from causal attention is the window bounds. The math and softmax remain exact inside the window.
Grouped Query Attention (GQA)
GQA reduces memory and compute by sharing K, V heads across multiple Q heads.
Why GQA?
| Type | Q Heads | K,V Heads | KV Cache Size | Model |
|---|---|---|---|---|
| MHA (Multi-Head) | 32 | 32 | 100% | GPT-3, Llama 1 |
| GQA | 32 | 8 | 25% | Llama 2 70B, Llama 3 |
| MQA (Multi-Query) | 32 | 1 | 3% | Falcon, PaLM |
GQA is the sweet spot: 4x smaller KV cache with minimal quality loss.
Normalization: RMSNorm vs LayerNorm
Normalization stabilizes training by keeping activations in a reasonable range.
LayerNorm (Original)
mean = sum(x) / n
var = sum((x - mean)²) / n
y = gamma * (x - mean) / sqrt(var + eps) + beta
4 operations: mean, center, variance, normalize
Learnable: gamma (scale) and beta (shift)
RMSNorm (Simpler)
rms = sqrt(sum(x²) / n)
y = gamma * x / (rms + eps)
2 operations: RMS, normalize
Learnable: gamma only (no beta)
~15% faster, same quality
Why Remove Mean Centering?
Research found that the re-centering (subtracting mean) in LayerNorm isn't necessary for good performance. RMSNorm keeps just the scaling, which is what matters for gradient flow.
Used in: Llama, Mistral, Qwen, Gemma (most modern LLMs)
Activations: SwiGLU, GeGLU, GELU
Activation functions introduce non-linearity. Modern LLMs use gated activations (SwiGLU, GeGLU) for better gradient flow and expressivity.
GELU (GPT-2, BERT)
GELU(x) = x * Φ(x)
≈ x * sigmoid(1.702 * x)
Smooth approximation of ReLU that allows small negative values.
SwiGLU (Llama, Mistral)
Swish(x) = x * sigmoid(x)
SwiGLU(x) = Swish(x @ W_gate) * (x @ W_up)
Gated: One path controls how much of the other passes through.
2x parameters in MLP, but better performance per FLOP.
GeGLU (Gemma, T5 variants)
Split input: x = [a | b]
GeGLU(x) = GELU(a) * b
GELU(a) = 0.5 * a * (1 + tanh(sqrt(2/pi) * (a + 0.044715 * a^3)))
GeGLU replaces the sigmoid/Swish gate with GELU, giving smoother gradients and better calibration for some model families.
Why Gating Works
The gate learns to selectively activate different features:
gate = activation(x @ W_gate) # Swish (SwiGLU) or GELU (GeGLU)
value = x @ W_up # The actual content
output = gate * value # Gate controls information flow
This gives the network more expressive power: it can learn to completely shut off certain dimensions for certain inputs.
Weight Tying
Weight tying shares parameters between the input embedding and output projection, reducing model size with minimal quality loss.
What Gets Shared
Input embedding: token_id → vector E[vocab, hidden]
Output projection: vector → logits W[vocab, hidden]
With weight tying: W = E (same matrix!)
Savings: For vocab=128K, hidden=4096: saves 2 GB of parameters
Why It Works
Intuition: The embedding and LM head learn similar things:
- Embedding: "What vector represents this token?"
- LM Head: "Which token does this vector represent?"
These are inverse operations, so sharing makes sense semantically.
Gradient implication: During training, gradients from both the embedding lookup and the LM head accumulate into the same weight matrix.
Summary: What Each Config Field Controls
| Config Field | What It Controls | Typical Values |
|---|---|---|
hidden_size |
Main embedding dimension | 768, 2048, 4096, 8192 |
num_hidden_layers |
Number of transformer blocks | 12, 24, 32, 80 |
num_attention_heads |
Q heads for attention | 12, 32, 64 |
num_key_value_heads |
K,V heads (GQA) | Same as heads, or 8, 4, 1 |
intermediate_size |
MLP hidden dimension | 4x hidden (SwiGLU: 2.67x) |
rope_theta |
RoPE base frequency | 10000, 500000, 1000000 |
sliding_window |
Attention window size (0 = full causal) | 0, 256, 512, 1024 |
rms_norm_eps |
Numerical stability in norm | 1e-5, 1e-6 |
vocab_size |
Vocabulary size | 32000, 50257, 128256 |
max_position_embeddings |
Maximum context length | 2048, 8192, 131072 |
tie_word_embeddings |
Share embed & LM head | true/false |
Further Reading
- Kernel Reference - API documentation for each kernel
- Code Generation - How we generate C from configs
- Source Code - Browse the implementation