Deep Dive: LLM Concepts

This page explains the key concepts, algorithms, and computational techniques used in modern Large Language Models. Understanding these fundamentals helps you know why each kernel exists and how they fit together.

Reading Guide
Each section includes: conceptual explanation, where it fits in the architecture, the math, and our C implementation.

Table of Contents


Transformer Architecture Overview

Before diving into individual components, let's see where everything fits in a decoder-only transformer (like Llama, GPT).

Transformer Architecture Overview

The Flow

  1. Token Embedding: Convert token IDs to vectors
  2. Position Encoding: Add position information (RoPE applied inside attention)
  3. N Decoder Layers, each containing:
    • RMSNorm → Self-Attention → Residual Add
    • RMSNorm → MLP (SwiGLU) → Residual Add
  4. Final RMSNorm
  5. LM Head: Project back to vocabulary (often weight-tied with embedding)

RoPE: Rotary Position Embedding

RoPE encodes position information by rotating query and key vectors based on their position in the sequence. Unlike absolute position embeddings, RoPE naturally captures relative position.

RoPE Detailed Explanation
!
Key Insight: RoPE Replaces Additive Position Embeddings
With RoPE, you go straight from token embedding to decoder layers. There's no separate position_embed to add. Position is injected via rotation inside each attention layer.

The Flow with RoPE

Traditional (GPT-2 style):
  token_embed + position_embed → layers

With RoPE:
  token_embed → layers (RoPE applied inside attention)

Token: "The"  "cat"  "sat"  "on"   "mat"
Pos:     0      1      2      3      4

Embedding lookup (NO position added):
  [vec0] [vec1] [vec2] [vec3] [vec4]
           |
      Decoder Layer
           |
      Q, K, V projection
           |
    +---------------------------+
    |  RoPE: rotate Q and K     |
    |  Q_0 rotated by 0*theta   |
    |  Q_1 rotated by 1*theta   |
    |  Q_2 rotated by 2*theta   |
    |  (same angles for K)      |
    +---------------------------+
           |
      Attention: Q @ K.T (now position-aware!)
    

What is rope_theta?

rope_theta (default: 10000) is the base frequency that controls how quickly positions rotate through the embedding space.

theta_i = rope_theta ^ (-2i / d)

For d=64, i=0:  theta = 10000^0 = 1
For d=64, i=16: theta = 10000^(-0.5) = 0.01
For d=64, i=31: theta = 10000^(-0.97) ~ 0.0001

Higher rope_theta = slower rotation = better for long contexts
(Llama 3.1 uses 500,000 for 128K context)

Position vs Dimension

Both are involved, but differently:

angle(pos, dim) = position * theta_i
AxisWhat it controls
PositionHow much to rotate (angle magnitude)
Dim pairHow fast to rotate (frequency)

Dim pair 0: rotates FAST (local patterns)
Dim pair 31: rotates SLOW (global patterns)

Same Rotation for Q and K?

YES! At the same position, both Q and K get the same rotation:

At position m:
  Q_m = rotate(Q_m, angle = m * theta)
  K_m = rotate(K_m, angle = m * theta)

At position n:
  K_n = rotate(K_n, angle = n * theta)

The magic happens in the dot product...

Why Rotation Works

When you compute Q_m · K_n after rotation:

Q_m · K_n = f(Q, K, m - n)

Only the angle difference matters!

  • Pos 5 → Pos 3: angle diff = 2*theta
  • Pos 100 → Pos 98: angle diff = 2*theta

Same relative distance = same attention behavior.

RoPE Math

For each pair of dimensions (2i, 2i+1) at position m:

Forward:
    theta_i = rope_theta ^ (-2i / head_dim)
    angle = m * theta_i

    x'[2i]   = x[2i] * cos(angle) - x[2i+1] * sin(angle)
    x'[2i+1] = x[2i] * sin(angle) + x[2i+1] * cos(angle)

Backward:
    Simply rotate by negative angle (transpose of rotation matrix)
    

Our implementation: rope_kernels.c - precomputes cos/sin cache, applies in-place to Q and K

RoPE Layouts: Pairwise vs Split-Half

One subtle but critical detail: RoPE is not only about the angle schedule. You also need the correct channel pairing layout. Two models can use the same rope_theta, the same cache, and the same attention code, yet still diverge badly if they pair channels differently.

Pairwise versus split-half RoPE layout comparison

Pairwise / Consecutive RoPE

This is the layout used by Llama-family checkpoints, including Nanbeige in our v7 work.

(0,1), (2,3), (4,5), ...

Each even/odd pair is treated as one 2D rotation plane. That makes the math read exactly like the textbook RoPE formula:

x'[2i]   = x[2i] * cos - x[2i+1] * sin
x'[2i+1] = x[2i] * sin + x[2i+1] * cos

Practical upside: the rotation is easy to reason about as adjacent feature pairs. Practical requirement: if a Llama checkpoint gets the split-half layout by mistake, parity fails immediately after RoPE.

Split-Half / NEOX RoPE

This layout is common in GPT-NeoX-style implementations. In this engine's current templates, Qwen and Gemma follow this family.

(0, d/2), (1, d/2+1), (2, d/2+2), ...

The vector is split into two halves, then each index in the first half rotates against the matching index in the second half.

Practical upside: it fits naturally with split-vector implementations and long-standing NEOX-style kernels. Practical requirement: it is only correct for checkpoints trained with that layout.

Important: This is a checkpoint semantic, not a tuning preference

Neither layout is inherently "better" in the abstract. The model family decides which one is correct. The right question is not "which RoPE is stronger?" but "which RoPE layout was this checkpoint trained with?"

QuestionCorrect framing
Do both use the same angles?Yes, both use the same RoPE frequency schedule idea.
What changes?Which channels are paired into each 2D rotation plane.
What breaks if you choose the wrong one?Q/K differ right after RoPE, then attention and logits drift downstream.
What should parity tooling check?Embedding output, then per-layer hidden states, then the first op that diverges.

What is the actual difference?

The frequency schedule is the same idea in both cases. The actual difference is which channels form each rotation plane.

Pairwise:
  x[0] with x[1]
  x[2] with x[3]

Split-half:
  x[0] with x[d/2]
  x[1] with x[d/2+1]

That changes the rotated Q and K values immediately. If you get the layout wrong, attention will diverge even when embedding lookup, RMSNorm, and projection kernels are all correct.

Before training: how would an ML engineer choose?

If you are designing a model family from scratch, this choice is mostly about ecosystem fit and implementation coherence, not a proven universal quality advantage.

  • Choose pairwise if you want a Llama-style family that should match llama.cpp, common Llama checkpoints, and adjacent even/odd rotation math.
  • Choose split-half if you are building on a GPT-NeoX-style stack or existing kernels, loaders, and training code that already assume first-half/second-half pairing.
  • Choose the stack you can support end-to-end: training code, checkpoint export, inference kernels, parity tooling, and debugging all need to agree.

For most teams, the practical decision is: pick the RoPE layout of the ecosystem you want to join. Bigger modeling choices usually matter more for quality than pairwise-versus-split-half by itself.

Design-time note for ML engineers
If you are still before training, RoPE layout is a family-level architecture choice. It should be decided alongside tokenizer, attention variant, GQA/MQA policy, context strategy, and runtime targets. Once you train, the layout becomes part of the checkpoint contract and changing it later is effectively a different model.
Implementation note in CK right now
The older split-half path already has AVX/AVX-512 SIMD specialization. The new pairwise path was added to match Llama-family semantics first; in the current codebase it is the correctness path, and it is not yet vectorized like the split-half kernel.

Attention Mechanisms

Attention is the core of transformers: it lets each token "look at" all other tokens and decide what's relevant.

Scaled Dot-Product Attention

Attention Math

1. Project input to Q, K, V:
   Q = input @ W_q    # What am I looking for?
   K = input @ W_k    # What do I contain?
   V = input @ W_v    # What do I offer?

2. Compute attention scores:
   scores = Q @ K.T / sqrt(d_k)    # How relevant is each position?

3. Apply causal mask (decoder only):
   scores[i][j] = -inf if j > i    # Can't look at future tokens

4. Softmax to get weights:
   weights = softmax(scores)       # Normalize to probabilities

5. Weighted sum of values:
   output = weights @ V            # Aggregate information
    

Why Scale by sqrt(d_k)?

Without scaling, dot products grow with dimension size, pushing softmax into saturation (all attention on one token). Scaling keeps gradients healthy.

d_k = 64:  scale = 1/8 = 0.125
d_k = 128: scale = 1/11.3 ≈ 0.088
    

Flash Attention

Flash Attention is an algorithmic optimization that computes exact attention without materializing the full N×N attention matrix. It's about memory efficiency, not approximation.

Flash Attention Algorithm

The Problem

Standard attention materializes the full score matrix:

scores = Q @ K.T  # [N, N] matrix!

N = 2048:  16 MB
N = 8192:  256 MB
N = 32768: 4 GB
N = 131072: 64 GB  # Won't fit in GPU!

Memory is O(N²), which limits context length.

Flash Attention Solution

Process in tiles, never storing full N×N matrix:

  1. Load Q tile to fast SRAM
  2. Stream K, V tiles through
  3. Compute partial softmax with online algorithm
  4. Accumulate output incrementally

Memory: O(N) instead of O(N²)

Speed: 2-4x faster (memory-bound → compute-bound)

Online Softmax Trick

The key insight: you can compute softmax incrementally without seeing all values first.

Traditional: softmax(x) = exp(x) / sum(exp(x))  # Need all x first

Online (Flash):
    For each new block of scores:
        1. Update running max: m_new = max(m_old, max(block))
        2. Rescale previous sum: sum *= exp(m_old - m_new)
        3. Add new block contribution: sum += sum(exp(block - m_new))
        4. Update running output with correction factor
    

Key Insight: Flash Attention reduces memory traffic from O(N²) to O(N), making it valuable for any hardware where memory bandwidth is a bottleneck. The algorithmic improvement (online softmax + tiling) applies to both CPU and GPU.

📊 Performance Analysis: CPU Flash Attention

For a detailed comparison of CPU attention implementations (including benchmarks vs llama.cpp's ggml), see:

Flash Attention Analysis: Why llama.cpp is Faster

Covers SIMD optimizations, threading strategies, and performance trade-offs across different workloads (small vs large models, different context lengths).


Sliding-Window Attention

Sliding-window attention limits each query to the most recent W tokens. It is exact within the window, keeps compute linear in sequence length, and provides predictable latency for long contexts.

tokens window W=4 0 1 2 3 4 5 6 7 allowed region

Windowed Causal Mask

For token i:
  attend j in [max(0, i - W + 1), i]

scores[i][j] = -inf if j > i
scores[i][j] = -inf if i - j >= W
        

Compute: O(N * W) vs O(N²). Memory stays O(N * D).

Why It Matters

  • Long contexts without quadratic blow-up
  • Stable decode latency (only last W KV entries)
  • Works with Flash Attention (online softmax inside the window)

C-Kernel-Engine Implementation

We expose a dedicated op attention_sliding that uses the same head-major GQA layout as standard attention.

The only difference from causal attention is the window bounds. The math and softmax remain exact inside the window.


Grouped Query Attention (GQA)

GQA reduces memory and compute by sharing K, V heads across multiple Q heads.

Grouped Query Attention

Why GQA?

Type Q Heads K,V Heads KV Cache Size Model
MHA (Multi-Head) 32 32 100% GPT-3, Llama 1
GQA 32 8 25% Llama 2 70B, Llama 3
MQA (Multi-Query) 32 1 3% Falcon, PaLM

GQA is the sweet spot: 4x smaller KV cache with minimal quality loss.


Normalization: RMSNorm vs LayerNorm

Normalization stabilizes training by keeping activations in a reasonable range.

RMSNorm

LayerNorm (Original)

mean = sum(x) / n
var = sum((x - mean)²) / n
y = gamma * (x - mean) / sqrt(var + eps) + beta
        

4 operations: mean, center, variance, normalize

Learnable: gamma (scale) and beta (shift)

RMSNorm (Simpler)

rms = sqrt(sum(x²) / n)
y = gamma * x / (rms + eps)
        

2 operations: RMS, normalize

Learnable: gamma only (no beta)

~15% faster, same quality

Why Remove Mean Centering?

Research found that the re-centering (subtracting mean) in LayerNorm isn't necessary for good performance. RMSNorm keeps just the scaling, which is what matters for gradient flow.

Used in: Llama, Mistral, Qwen, Gemma (most modern LLMs)


Activations: SwiGLU, GeGLU, GELU

Activation functions introduce non-linearity. Modern LLMs use gated activations (SwiGLU, GeGLU) for better gradient flow and expressivity.

SwiGLU Activation

GELU (GPT-2, BERT)

GELU(x) = x * Φ(x)
        ≈ x * sigmoid(1.702 * x)
        

Smooth approximation of ReLU that allows small negative values.

SwiGLU (Llama, Mistral)

Swish(x) = x * sigmoid(x)

SwiGLU(x) = Swish(x @ W_gate) * (x @ W_up)
        

Gated: One path controls how much of the other passes through.

2x parameters in MLP, but better performance per FLOP.

a b x [tokens, 2*dim] GELU pass-through x output [tokens, dim] GeGLU: output = GELU(a) * b

GeGLU (Gemma, T5 variants)

Split input: x = [a | b]

GeGLU(x) = GELU(a) * b
GELU(a) = 0.5 * a * (1 + tanh(sqrt(2/pi) * (a + 0.044715 * a^3)))
    

GeGLU replaces the sigmoid/Swish gate with GELU, giving smoother gradients and better calibration for some model families.

Why Gating Works

The gate learns to selectively activate different features:

gate = activation(x @ W_gate)  # Swish (SwiGLU) or GELU (GeGLU)
value = x @ W_up            # The actual content

output = gate * value       # Gate controls information flow
    

This gives the network more expressive power: it can learn to completely shut off certain dimensions for certain inputs.


Weight Tying

Weight tying shares parameters between the input embedding and output projection, reducing model size with minimal quality loss.

Weight Tying

What Gets Shared

Input embedding:  token_id → vector    E[vocab, hidden]
Output projection: vector → logits     W[vocab, hidden]

With weight tying: W = E  (same matrix!)
    

Savings: For vocab=128K, hidden=4096: saves 2 GB of parameters

Why It Works

Intuition: The embedding and LM head learn similar things:

These are inverse operations, so sharing makes sense semantically.

Gradient implication: During training, gradients from both the embedding lookup and the LM head accumulate into the same weight matrix.


Summary: What Each Config Field Controls

Config Field What It Controls Typical Values
hidden_size Main embedding dimension 768, 2048, 4096, 8192
num_hidden_layers Number of transformer blocks 12, 24, 32, 80
num_attention_heads Q heads for attention 12, 32, 64
num_key_value_heads K,V heads (GQA) Same as heads, or 8, 4, 1
intermediate_size MLP hidden dimension 4x hidden (SwiGLU: 2.67x)
rope_theta RoPE base frequency 10000, 500000, 1000000
sliding_window Attention window size (0 = full causal) 0, 256, 512, 1024
rms_norm_eps Numerical stability in norm 1e-5, 1e-6
vocab_size Vocabulary size 32000, 50257, 128256
max_position_embeddings Maximum context length 2048, 8192, 131072
tie_word_embeddings Share embed & LM head true/false

Further Reading

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close