GEMM Memory Layout Deep Dive
Understanding Weight & Activation Memory Layout
This page explains how weights, activations, and quantized data are stored in memory - critical for understanding cache behavior and debugging performance issues.
This page explains how weights, activations, and quantized data are stored in memory - critical for understanding cache behavior and debugging performance issues.
The Big Picture: GEMM Data Flow
Core Operation: Y = W × X where output[M] = weights[M,K] × input[K]
Weight Matrix W [M × K]
Row-major order (C-style): elements in the same row are contiguous.
// Row 0
W[0,0], W[0,1], W[0,2], ..., W[0,K-1]
// Row 1
W[1,0], W[1,1], W[1,2], ..., W[1,K-1]
...
Input X [K], Output Y [M]
// Input vector
X[0], X[1], X[2], ..., X[K-1]
// Output vector
Y[0], Y[1], Y[2], ..., Y[M-1]
Key Insight: Inner Loop Over K
For cache-friendly access, the inner loop should iterate over K (columns) since elements in the same row are contiguous.
// OPTIMAL: Inner loop over K (cache-friendly)
for (m = 0; m < M; m++) {
y[m] = 0;
for (k = 0; k < K; k++) { // ← K elements are contiguous in W[m]
y[m] += W[m][k] * x[k]; // Sequential memory access!
}
}
Quantized Block Layout
FP32 (Full Precision)
32 bits per weight = 4 bytes per element
// Memory layout
addr 0x00: [31:24] [23:16] [15:8] [7:0] // float w[0]
addr 0x04: [31:24] [23:16] [15:8] [7:0] // float w[1]
addr 0x08: [31:24] [23:16] [15:8] [7:0] // float w[2]
...
Q8_0 Format (8-bit)
22 bytes per 32 weights = 5.5 bits/weight
struct block_q8_0 {
ggml_fp16_t d; // Offset 0-1: 2 bytes FP16 scale
int8_t qs[32]; // Offset 2-33: 32 int8 weights
}; // Total: 34 bytes
// Memory layout
0x00: d_low, d_high
0x02: qs[0], qs[1], qs[2], ..., qs[31]
...
Q5_0 Format (5-bit)
22 bytes per 32 weights = 5.5 bits/weight
struct block_q5_0 {
ggml_fp16_t d; // Offset 0-1: 2 bytes FP16 scale
uint32_t qh[4]; // Offset 2-5: 32 high bits (1 per weight)
uint8_t qs[16]; // Offset 6-21: packed nibbles (2 per byte)
}; // Total: 22 bytes
// How Q5_0 weight is reconstructed:
// Weight 0: (qs[0] & 0x0F) | (qh bit 0 << 4)
// Weight 16: (qs[0] >> 4) | (qh bit 16 << 4)
Q4_K Format (4-bit K-quant)
144 bytes per 256 weights = 4.5 bits/weight
// Q4_K super-block structure (256 weights)
struct block_q4_K {
ggml_fp16_t d1; // Sub-block 0 scale (2 bytes)
ggml_fp16_t d2; // Sub-block 1 scale (2 bytes)
uint8_t mins[1]; // Min values (1 byte)
uint8_t scales[16]; // 16 sub-block scales
uint8_t qs[128]; // 128 bytes of packed 4-bit weights
}; // Total: 144 bytes
Multi-Dimensional Tensor Layout
Activation Tensor: [Batch, Sequence, Hidden]
For BERT-like model: batch=2, seq_len=512, hidden=768
// Stride calculation
stride_H = 1 // Elements are contiguous
stride_S = 768 // Next sequence position = 768
stride_B = 393216 // 512 * 768 = next batch
// Address of tensor[b, s, h]:
addr = base + b * 393216 + s * 768 + h * 1
Memory Layout Matters!
Hidden dimension varies fastest (stride=1) in C-style row-major order. This is why transformers store activations as [batch, seq, hidden] - not [hidden, seq, batch].
Hidden dimension varies fastest (stride=1) in C-style row-major order. This is why transformers store activations as [batch, seq, hidden] - not [hidden, seq, batch].
Cache Blocking (Tiling) for Performance
Without Blocking
Every access to W triggers DRAM fetch!
Result: ~10 GB/s memory bandwidth
With Blocking (32×32)
Data reused from L1/L2 cache!
Result: ~100 GB/s effective bandwidth
Blocked GEMM Pseudocode
// Cache-blocked GEMM
for (m_tile = 0; m_tile < M; m_tile += MB) {
for (k_tile = 0; k_tile < K; k_tile += KB) {
for (n_tile = 0; n_tile < N; n_tile += NB) {
// Load blocks into cache
load_block_A(W[m_tile:m_tile+MB, k_tile:k_tile+KB]); // MB×KB
load_block_B(X[k_tile:k_tile+KB, n_tile:n_tile+NB]); // KB×NB
// Compute with cached data (reuse!)
for (i = 0; i < MB; i++) {
for (j = 0; j < NB; j++) {
C[m_tile+i, n_tile+j] += dot(A[i, :], B[:, j]);
}
}
}
}
}
KV Cache: Sliding Window in Memory
KV Cache Structure
For context length 4096, n_heads=32, head_dim=128:
// Key Cache: [B, n_heads, seq, head_dim]
// Value Cache: [B, n_heads, seq, head_dim]
// Stride per seq: 128 × 4 bytes = 512 bytes
// Total K+V cache: 2 × 4096 × 32 × 128 × 4B ≈ 128 MB
During Prefill
Full sequence, compute all attention scores
Q × K_all → Attention Scores
During Decode
New Q, existing K (sliding window)
Q_new × K_existing → New Token
Summary
| Concept | Key Point | Performance Impact |
|---|---|---|
| Row-Major Order | Same-row elements are contiguous | Inner loop over K for cache locality |
| Quantization | Q8_0 = 34B/32w, Q5_0 = 22B/32w | 4-8x memory savings vs FP32 |
| Tensor Strides | Hidden varies fastest (stride=1) | Proper indexing avoids cache misses |
| Cache Blocking | Process 32×32 tiles for L1 reuse | 10x bandwidth improvement |
| KV Cache | Sequential write, random-ish read | Critical for decode throughput |