GEMM Optimization Deep Dive

How we built a high-performance matrix multiplication kernel that beats Intel MKL, inspired by oneDNN, BLIS, and decades of HPC research.

1.44x Faster than PyTorch/MKL

Current v8 Prefill Target: Q4_K Repacking

The next CK prefill gap is not generic GEMM theory or thread count. It is Q4_K_M layout: CK now has Q8_K activations, persistent threadpool dispatch, packed metadata, and a safe CK-vs-llama perf lane, but Qwen3.5 prefill still trails llama.cpp because llama uses an interleaved/repacked Q4_K GEMM hot loop.

See Prefill Performance Roadmap for the Q8 activation contract, threadpool, packed-meta, 2D scheduler, and 8x8 repacked Q4_K microkernel plan.

52.95

Peak GFLOPS

1.44x

vs PyTorch/MKL

4.31x

vs Naive GEMM

8x8

GEMM vs GEMV in Production LLM Serving

In transformer inference, most expensive linear layers reduce to either GEMM or GEMV. The distinction matters because a kernel that is good for one can be poor for the other.

GEMV: decode latency

GEMV is matrix-vector multiplication: one new token is multiplied by a weight matrix. This is the common decode path after the prompt is already in the KV cache.

// Decode, one token
output[1, N] = activation[1, K] x weight[N, K]

GEMV is usually memory-bandwidth bound. Each output row streams weight bytes, does a dot product, and has limited reuse of the activation vector. The important questions are:

Are weights read sequentially in the inner loop?
Are quantized blocks decoded with SIMD instead of scalar code?
Is thread overhead smaller than the work per row?
Does the final logits projection avoid tiny-task oversubscription?

GEMM: prefill throughput

GEMM is matrix-matrix multiplication: many prompt tokens are processed together. This is the prefill path and the dominant cost for long prompts.

// Prefill, M prompt tokens
output[M, N] = activations[M, K] x weight[N, K]

GEMM can reuse weights and activations across many outputs, so a production kernel must create reuse deliberately:

Tile M, N, and K so hot data fits in L1/L2/L3.
Keep a small accumulator tile in registers.
Pack or interleave weights when the canonical layout is not enough.
Use a thread split that creates enough work without saturating memory bandwidth.

Why simple correctness kernels are not production kernels

A reference kernel proves the math. A production kernel has to prove the data movement. The same equation can run 3x slower if it repeatedly reloads cache lines, transposes buffers between operators, dispatches too many tiny tasks, or decodes quantized blocks in a layout that SIMD cannot consume efficiently.

This is why CK keeps scalar/reference, ISA-specific, tiled, packed, and dispatch-selected kernels separate. The reference path protects parity. The optimized path earns its place only when it matches parity and wins on the shapes used by real models.

Production Kernel Checklist

1. Shape first

Measure real model shapes, not synthetic square matrices only. Qwen2, Qwen3, Gemma, Nanbeige, and vision models stress different M/N/K regimes.

2. Layout before instruction count

If the weight layout forces scattered loads or repeated unpacking, more SIMD instructions will not fix the bottleneck. Repacking and interleaving often matter more than another arithmetic trick.

3. Cache reuse

Fast GEMM keeps reused A/B tiles close to the core. Fast GEMV streams weights predictably and avoids extra copies around activations, logits, and head-major attention output.

4. Threading policy

More threads are not automatically better. Small GEMV and tiny-M prefill can regress when threadpool overhead and memory contention exceed useful work.

5. Quantization path

Quantized inference is usually Q4_K/Q5_K/Q6_K weights times Q8_K or Q8_0 activations. Test the actual mixed-format kernel, not only dequantize-to-FP32 fallback.

6. Parity gates

Keep exact/reference kernels available. Enable optimized kernels by dispatch only after they pass llama.cpp/PyTorch parity for the relevant format and shape.

How to Tell What Is Slow

A useful performance investigation separates raw kernel speed from model-level orchestration.

Symptom	Likely Cause	What to Run
Decode is slow, prompt is acceptable	GEMV/logits path is bandwidth-bound or overscheduled	`make llamacpp-parity-perf`, `make test-gemv-comprehensive`
Prompt/prefill is slow, standalone GEMM is fast	Model executor overhead: quantization, transposes, per-op boundaries	v8 op-level profiling and lowered prefill call inspection
Large shapes improve, small shapes regress	Thread split or tiled kernel not shape-gated	`make test-q6k-prefill-dispatch-sweep-quick`
CK matches parity but trails llama.cpp	Packed/interleaved layout or graph scheduling gap	Compare against llama.cpp repacked paths and kernel-level speed lanes

The practical rule is simple: first prove the kernel is mathematically correct, then prove it wins for the exact shapes, formats, thread counts, and hardware where dispatch will select it.

Standing on the Shoulders of Giants

Our GEMM implementation draws from decades of high-performance computing research and industry-leading libraries:

The 8x8 Microkernel Architecture

The heart of our GEMM is an 8x8 microkernel that keeps all 64 accumulator values in AVX registers throughout the entire K-loop. This is the same strategy used by oneDNN and BLIS.

Matrix Packing for Cache Efficiency

For large matrices, we pack A and B into contiguous memory layouts that maximize cache line utilization. This is the key technique that allowed us to beat MKL.

Three-Level Cache Blocking

We tile the computation to fit each level of the memory hierarchy, minimizing data movement between DRAM and CPU.

Our Optimization Journey

Step 1: Naive Implementation

Triple nested loop with poor cache behavior. ~8 GFLOPS.

Step 2: OpenMP Parallelization

Added parallel for loops. Still cache-unfriendly.

Step 3: AVX Vectorization

8-wide SIMD using 256-bit YMM registers. ~15 GFLOPS.

Step 4: 8x8 Register Blocking

Keep 64 accumulators in registers across K-loop. ~20 GFLOPS.

Step 5: Cache Blocking (MC/NC/KC)

Tile for L1/L2/L3 cache hierarchy. ~25 GFLOPS.

Step 6: Matrix Packing

Pack A and B for contiguous access. ~30 GFLOPS.

Step 7: Software Prefetching + Loop Unrolling

Prefetch next cache line, unroll K by 4. 52.95 GFLOPS peak!

Performance Results

Matrix Size	PyTorch/MKL	Our Microkernel	Result
32 x 32 x 32	5.17 GFLOPS	10.19 GFLOPS	2.0x FASTER
64 x 64 x 64	14.03 GFLOPS	22.94 GFLOPS	1.6x FASTER
128 x 128 x 128	19.75 GFLOPS	27.75 GFLOPS	1.4x FASTER
256 x 256 x 256	22.41 GFLOPS	32.76 GFLOPS	1.5x FASTER
512 x 512 x 512	22.50 GFLOPS	24.22 GFLOPS	1.1x FASTER
1024 x 1024 x 1024	23.00 GFLOPS	31.48 GFLOPS	1.4x FASTER

Our kernel beats MKL at ALL sizes

v7 Training Threadpool Dispatch Playbook

Inference v6.6 and training v7 should both use lowered execution plans. For v7 training, IR2 defines gradient math, while IR3 must define parallel execution and reduction ownership.

IR1

Forward op graph from template + manifest (what to compute).

IR2

Backward synthesis and explicit gradient fanout/fanin accumulation (chain rule routing).

IR3 / Exec Plan

Memory layout + dispatch policy (split axis, tiles, threads, reduction order, barriers).

Dispatch policy by workload shape

Workload	Preferred split	Reason
`M = B*S` large GEMM	`split M`	Best cache locality and zero reduction overhead for independent rows.
Tiny-`M` GEMM (decode-like)	`split N`	Improves core utilization when row-parallel work is too small.
Backward `dW` GEMM	`split K` + explicit partial reduce	Good parallelism, but requires deterministic reduction tree.
Elementwise ops	Contiguous range chunks	Simple scheduling and stable memory throughput.
Attention	Head split, then token block split	Natural independence across heads, then balanced token work.

Execution-plan JSON consumed by codegen

{
  "schema": "ck.train.exec.v1",
  "runtime": {"threads": 12, "simd": "avx2", "mode": "deterministic"},
  "ops": [
    {
      "op_id": 37,
      "phase": "forward",
      "kernel_id": "gemm_fwd_f32",
      "shape": {"m": 16, "n": 1024, "k": 1024},
      "dispatch": {"split_axis": "m", "tile_m": 4, "tile_n": 128, "threads": 12}
    },
    {
      "op_id": 109,
      "phase": "backward",
      "kernel_id": "gemm_backward_f32",
      "shape": {"m": 1024, "n": 1024, "k": 16},
      "dispatch": {"split_axis": "k", "threads": 12},
      "reduction": {"type": "sum", "order": "fixed_tree", "target": "grad.weight.layer.0.wq"}
    }
  ],
  "barriers": [{"after_op": 109, "reason": "grad_accum_boundary"}]
}

Rule for maintainability

Codegen should remain dumb. It should emit calls from train_exec_plan.json directly, not infer split/reduction policy ad hoc. This keeps parity, determinism, and performance tuning auditable.

Using the Microkernel

#include "ckernel_engine.h"

// Basic usage - automatically selects best implementation
float A[M * K], B[K * N], C[M * N];
gemm_microkernel(A, B, C, M, N, K, 0);  // B not transposed

// For neural network weights (B is [N, K] transposed)
gemm_microkernel(A, B, C, M, N, K, 1);  // B transposed

// Direct packed version for large matrices
gemm_microkernel_packed(A, B, C, M, N, K);

Source Files

gemm_microkernel.c

Main microkernel implementation with 8x8 register blocking, matrix packing, and cache blocking.

src/kernels/gemm_microkernel.c

gemm_fused_kernels.c

Fused GEMM operations: GEMM+ReLU, GEMM+GELU, GEMM+SiLU, and the dual-GEMM SwiGLU.

src/kernels/gemm_fused_kernels.c

test_gemm_microkernel.py

Unit test with accuracy verification and performance benchmarks vs PyTorch.

unittest/test_gemm_microkernel.py

References & Further Reading

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

The foundational paper on microkernel-based GEMM design.

Field G. Van Zee, Robert A. van de Geijn

Anatomy of High-Performance Matrix Multiplication

Classic paper explaining cache blocking and register tiling.

Kazushige Goto, Robert A. van de Geijn

oneDNN Developer Guide

Intel's deep learning library with state-of-the-art GEMM kernels.

oneapi-src/oneDNN

How to Optimize GEMM

Practical tutorial on GEMM optimization techniques.

flame.cs.utexas.edu/~flame/web/