GEMM Memory Layout Deep Dive

Understanding Weight & Activation Memory Layout
This page explains how weights, activations, and quantized data are stored in memory - critical for understanding cache behavior and debugging performance issues.

The Big Picture: GEMM Data Flow

Core Operation: Y = W × X where output[M] = weights[M,K] × input[K]

Weight Matrix W [M × K]

Row-major order (C-style): elements in the same row are contiguous.

// Row 0
W[0,0], W[0,1], W[0,2], ..., W[0,K-1]
// Row 1
W[1,0], W[1,1], W[1,2], ..., W[1,K-1]
...

Input X [K], Output Y [M]

// Input vector
X[0], X[1], X[2], ..., X[K-1]

// Output vector
Y[0], Y[1], Y[2], ..., Y[M-1]

Key Insight: Inner Loop Over K

For cache-friendly access, the inner loop should iterate over K (columns) since elements in the same row are contiguous.

// OPTIMAL: Inner loop over K (cache-friendly)
for (m = 0; m < M; m++) {
    y[m] = 0;
    for (k = 0; k < K; k++) {       // ← K elements are contiguous in W[m]
        y[m] += W[m][k] * x[k];     // Sequential memory access!
    }
}

Quantized Block Layout

FP32 (Full Precision)

32 bits per weight = 4 bytes per element

// Memory layout
addr 0x00: [31:24] [23:16] [15:8] [7:0]   // float w[0]
addr 0x04: [31:24] [23:16] [15:8] [7:0]   // float w[1]
addr 0x08: [31:24] [23:16] [15:8] [7:0]   // float w[2]
...

Q8_0 Format (8-bit)

22 bytes per 32 weights = 5.5 bits/weight

struct block_q8_0 {
    ggml_fp16_t d;      // Offset 0-1: 2 bytes FP16 scale
    int8_t qs[32];      // Offset 2-33: 32 int8 weights
};  // Total: 34 bytes

// Memory layout
0x00: d_low, d_high
0x02: qs[0], qs[1], qs[2], ..., qs[31]
...

Q5_0 Format (5-bit)

22 bytes per 32 weights = 5.5 bits/weight

struct block_q5_0 {
    ggml_fp16_t d;      // Offset 0-1: 2 bytes FP16 scale
    uint32_t qh[4];     // Offset 2-5: 32 high bits (1 per weight)
    uint8_t qs[16];     // Offset 6-21: packed nibbles (2 per byte)
};  // Total: 22 bytes

// How Q5_0 weight is reconstructed:
// Weight 0: (qs[0] & 0x0F) | (qh bit 0 << 4)
// Weight 16: (qs[0] >> 4) | (qh bit 16 << 4)

Q4_K Format (4-bit K-quant)

144 bytes per 256 weights = 4.5 bits/weight

// Q4_K super-block structure (256 weights)
struct block_q4_K {
    ggml_fp16_t d1;      // Sub-block 0 scale (2 bytes)
    ggml_fp16_t d2;      // Sub-block 1 scale (2 bytes)
    uint8_t mins[1];     // Min values (1 byte)
    uint8_t scales[16];  // 16 sub-block scales
    uint8_t qs[128];     // 128 bytes of packed 4-bit weights
};  // Total: 144 bytes

Multi-Dimensional Tensor Layout

Activation Tensor: [Batch, Sequence, Hidden]

For BERT-like model: batch=2, seq_len=512, hidden=768

// Stride calculation
stride_H = 1                    // Elements are contiguous
stride_S = 768                  // Next sequence position = 768
stride_B = 393216               // 512 * 768 = next batch

// Address of tensor[b, s, h]:
addr = base + b * 393216 + s * 768 + h * 1

Memory Layout Matters!
Hidden dimension varies fastest (stride=1) in C-style row-major order. This is why transformers store activations as [batch, seq, hidden] - not [hidden, seq, batch].

Cache Blocking (Tiling) for Performance

Without Blocking

Every access to W triggers DRAM fetch!

Result: ~10 GB/s memory bandwidth

With Blocking (32×32)

Data reused from L1/L2 cache!

Result: ~100 GB/s effective bandwidth

Blocked GEMM Pseudocode

// Cache-blocked GEMM
for (m_tile = 0; m_tile < M; m_tile += MB) {
    for (k_tile = 0; k_tile < K; k_tile += KB) {
        for (n_tile = 0; n_tile < N; n_tile += NB) {

            // Load blocks into cache
            load_block_A(W[m_tile:m_tile+MB, k_tile:k_tile+KB]);  // MB×KB
            load_block_B(X[k_tile:k_tile+KB, n_tile:n_tile+NB]);  // KB×NB

            // Compute with cached data (reuse!)
            for (i = 0; i < MB; i++) {
                for (j = 0; j < NB; j++) {
                    C[m_tile+i, n_tile+j] += dot(A[i, :], B[:, j]);
                }
            }
        }
    }
}

KV Cache: Sliding Window in Memory

KV Cache Structure

For context length 4096, n_heads=32, head_dim=128:

// Key Cache: [B, n_heads, seq, head_dim]
// Value Cache: [B, n_heads, seq, head_dim]

// Stride per seq: 128 × 4 bytes = 512 bytes
// Total K+V cache: 2 × 4096 × 32 × 128 × 4B ≈ 128 MB

During Prefill

Full sequence, compute all attention scores

Q × K_all → Attention Scores

During Decode

New Q, existing K (sliding window)

Q_new × K_existing → New Token

Summary

Concept	Key Point	Performance Impact
Row-Major Order	Same-row elements are contiguous	Inner loop over K for cache locality
Quantization	Q8_0 = 34B/32w, Q5_0 = 22B/32w	4-8x memory savings vs FP32
Tensor Strides	Hidden varies fastest (stride=1)	Proper indexing avoids cache misses
Cache Blocking	Process 32×32 tiles for L1 reuse	10x bandwidth improvement
KV Cache	Sequential write, random-ish read	Critical for decode throughput