CK-Engine Thread Pool

A plain-English deep dive into how persistent pthread thread pools work, why they exist, and how they compare to OpenMP and memory pools.

ck_threadpool.c Sub-µs Dispatch Spin + Condvar Hybrid

What Is a "Pool"? (The Core Idea)
The Problem: Why Not Just Use OpenMP?
Anatomy of CK-Engine's Thread Pool
Lifecycle: Create → Dispatch → Destroy
Dispatch Deep Dive: What Actually Happens
Spin-Wait vs. Sleep: The Hybrid Polling Design
Atomics, Mutexes, and Memory Ordering
Cache Lines & False Sharing
The Barrier: How Threads Synchronize
Thread Pool vs. OpenMP — Side-by-Side
Thread Pool vs. Memory Pool — Same Idea, Different Resource
Why Is It Faster?
Real Numbers: Before & After
How Parallelization Actually Works
Per-Kernel Parallelization Analysis
Testing & Validation

§1 What Is a "Pool"?

A pool is a set of expensive resources you create once and reuse over and over, instead of creating and destroying them every time you need one.

Analogy: Imagine a restaurant kitchen with 4 chefs.

No pool: Every time an order comes in, go out on the street, hire 3 chefs, cook the meal, then fire them all. Next order — hire 3 more chefs. (This is what OpenMP does.)

Pool: Hire 3 chefs on day one. They stand ready at their stations. Order comes in — you shout "Go!" and all 4 of you (including yourself) cook simultaneously. Between orders, chefs wait at their stations. (This is a thread pool.)

The word "pool" shows up in many places in systems programming:

Thread Pool

Pre-created threads waiting for work. Avoids spawn/join overhead per task.

Memory Pool

Pre-allocated memory blocks. Avoids malloc/free overhead per allocation.

Connection Pool

Pre-opened database/network connections. Avoids TCP handshake per query.

Object Pool

Pre-constructed objects (e.g. game entities). Avoids constructor/destructor overhead.

The common pattern: creating the resource is expensive; using it is cheap. A pool pays the creation cost once, then amortizes it over thousands of uses.

§2 The Problem: Why Not Just Use OpenMP?

OpenMP looks appealing — just add #pragma omp parallel for and you get parallelism. But there's a hidden cost.

During inference, CK-Engine runs the transformer decode loop. For each token generated, it calls kernel functions (matrix-vector multiplies, RMSNorm, SwiGLU, etc.) roughly 500+ times. Each call is a short burst of work (microseconds to low milliseconds).

OpenMP's Fork/Join Model

// Every single #pragma omp parallel does this: Main thread | |—— fork ——> [spawn/wake N threads] ← 50–200 µs overhead | | | | | | work work work work ← actual compute (maybe 20 µs) | | | | | |<— join ——— [wait + tear down] ← 50–200 µs overhead | |—— fork ——> [spawn/wake again] ← overhead AGAIN | ... v 500+ fork/joins per token!

With each fork/join costing 50–200µs and 500+ kernel calls per token:

The math: 500 calls × 100µs overhead = 50ms of pure overhead per token — just from spawning and joining threads, before any useful work is done. The actual compute per kernel might only be 20µs!

Real Measurement from CK-Engine (gemv_omp.c)

/*
 * Measured on i7-3630QM (4C/8T), Qwen 0.5B:
 *   Serial kernels:   170 ms/tok  (5.9 tok/s)
 *   OMP parallel:     327 ms/tok  (3.1 tok/s)  ← 1.9x SLOWER
 *
 * The math is correct (10/10 parity tests pass) but the
 * threading model is wrong for this workload.
 */

OpenMP parallelism made it almost 2x slower. The fork/join overhead completely ate the parallelism gains and then some.

§3 Anatomy of CK-Engine's Thread Pool

Let's look at the actual data structure. Every field exists for a specific reason.

struct ck_threadpool {
    /* --- Dispatch state (each on its own cache line) --- */
    _Alignas(64) atomic_int   n_dispatch;   // bumped to wake workers
    _Alignas(64) atomic_int   n_complete;   // workers signal "I'm done"
    _Alignas(64) ck_work_fn_t work_fn;      // pointer to current kernel
                 void        *work_args;    // arguments for the kernel

    /* --- Barrier for mid-kernel sync --- */
    ck_barrier_t barrier;

    /* --- Worker management --- */
    int          n_threads;    // total (including main)
    ck_worker_t  workers[64];  // thread handles

    /* --- Shutdown / pause --- */
    _Alignas(64) atomic_int stop;    // set to 1 to kill workers
    _Alignas(64) atomic_int paused;  // set to 1 for power saving

    /* --- Hybrid sleep mechanism --- */
    pthread_mutex_t mutex;
    pthread_cond_t  cond_dispatch; // workers sleep here
    pthread_cond_t  cond_done;     // main sleeps here
};

Why `_Alignas(64)` Everywhere?

Each atomic counter is placed on its own 64-byte cache line. This prevents "false sharing" — where two unrelated variables on the same cache line cause all CPU cores to fight over that cache line. More on this in §8.

The Worker Struct

typedef struct {
    pthread_t thread;            // OS thread handle
    int       id;               // 0 = main, 1..N-1 = workers
    struct ck_threadpool *pool; // back-pointer to the pool
} ck_worker_t;

Thread 0 is always the main thread — the one that calls ck_threadpool_dispatch(). It doesn't get a new pthread. It does its share of the work alongside the workers.

§4 Lifecycle: Create → Dispatch → Destroy

1. Create — `ck_threadpool_create(n_threads)`

Called once at engine startup. Allocates the pool struct (cache-line aligned), then spawns N−1 worker pthreads. Workers immediately enter the spin-wait loop. The main thread is thread 0 — no extra pthread is created for it.

2. Dispatch (called ~500x per token) — `ck_threadpool_dispatch(pool, fn, args)`

Sets the work function + args, bumps n_dispatch (an atomic counter), and signals sleeping workers via condvar. Main thread also runs its share. All threads do their chunk, then main waits for n_complete to hit N−1.

3. Pause/Resume (between batches)

Between user interactions (waiting for input), workers can be paused so they sleep on a condvar and use 0% CPU. Resume wakes them instantly.

4. Destroy — `ck_threadpool_destroy(pool)`

Sets stop = 1, broadcasts the condvar to wake sleeping workers, then pthread_joins every worker thread. Frees the pool memory.

Global Pool Convenience

For CK-Engine, there's a global singleton pool accessed via ck_threadpool_global(). It uses pthread_once to ensure thread-safe lazy initialization — no matter how many threads call it simultaneously, the pool is created exactly once.

static ck_threadpool_t *g_threadpool = NULL;
static pthread_once_t  g_threadpool_once = PTHREAD_ONCE_INIT;

ck_threadpool_t *ck_threadpool_global(void) {
    pthread_once(&g_threadpool_once, global_pool_init);
    return g_threadpool;
}

§5 Dispatch Deep Dive: What Actually Happens

Let's trace exactly what happens when CK-Engine needs to run one kernel across 4 threads.

Key Details

No thread creation. Workers are already alive, spinning or sleeping. The only cost is bumping an atomic counter.
Main thread participates. Thread 0 does 1/N of the work alongside the workers. No wasted CPU.
Single-thread fast path. If n_threads == 1, dispatch just calls fn(0, 1, args) directly — zero overhead.
Completion tracking. Each worker atomically increments n_complete. The last worker (N−2-th increment) signals the condvar to wake main if it fell asleep waiting.

Inside the Work Function

Each kernel receives (ith, nth, args) and partitions work by thread index:

void gemv_parallel(int ith, int nth, void *args) {
    gemv_args_t *a = args;
    // Each thread computes its slice of output rows
    int rows_per_thread = a->M / nth;
    int start = ith * rows_per_thread;
    int end   = (ith == nth - 1) ? a->M : start + rows_per_thread;

    for (int row = start; row < end; row++) {
        vec_dot(a->K, &a->y[row], a->W + row * a->K, a->x);
    }
}

§6 Spin-Wait vs. Sleep: The Hybrid Polling Design

How do worker threads wait for work? There are two extremes, and CK-Engine uses both.

Pure Spin-Wait

Thread sits in a tight loop checking an atomic variable: while (no_work) { _mm_pause(); }

PRO: Instant wake-up (nanoseconds)
CON: Burns 100% CPU while waiting

Pure Condvar Sleep

Thread calls pthread_cond_wait() and the OS puts it to sleep.

PRO: Zero CPU while waiting
CON: Wake-up takes 5–50µs (kernel context switch)

CK-Engine's Hybrid Approach

Workers do both:

First, spin for CK_THREADPOOL_SPIN_COUNT (1024) iterations, checking the atomic dispatch counter.
If no work arrives during spin, fall back to pthread_cond_wait() (sleep).
When new work arrives, spinning workers wake instantly; sleeping workers wake via condvar broadcast.

// worker_main() — the hybrid wait loop
for (;;) {
    int spins = 0;
    for (;;) {
        if (atomic_load(&pool->stop))  return NULL;  // shutdown?

        int current = atomic_load(&pool->n_dispatch);
        if (current != last_dispatch) {              // new work!
            last_dispatch = current;
            break;                                   // go execute
        }

        _mm_pause();    // x86 hint: "I'm spinning, save power"
        spins++;

        if (spins >= 1024) {                        // tired of spinning?
            pthread_mutex_lock(&pool->mutex);
            // Re-check under lock (avoid missed wakeup!)
            if (no_new_work && !stopping)
                pthread_cond_wait(&pool->cond_dispatch, &pool->mutex);
            pthread_mutex_unlock(&pool->mutex);
            spins = 0;
        }
    }
    // execute work...
}

Why re-check under the lock? This avoids the "missed wakeup" race condition. Without it: main could bump n_dispatch and broadcast between the worker's last spin check and its cond_wait call. The worker would then sleep forever, never seeing the work. The lock + re-check makes this impossible.

What Is `_mm_pause()`?

It's an x86 intrinsic that tells the CPU: "I'm in a spin-wait loop." The CPU can then:

Reduce power consumption of the spinning core
Free up execution resources for the sibling hyper-thread
Avoid memory-order violations that would require pipeline flushes

On non-x86 (ARM, etc.), CK-Engine defines CK_SPIN_PAUSE() as a no-op.

§7 Atomics, Mutexes, and Memory Ordering

These are the fundamental building blocks of any concurrent system. Let's understand each one from scratch.

What Is an Atomic Operation?

An atomic operation is one that completes in a single, indivisible step from the perspective of all other threads. No other thread can see a "half-done" state.

WITHOUT atomics (broken): WITH atomics (correct): Thread A: read counter (= 5) Thread A: atomic_fetch_add(&counter, 1) Thread B: read counter (= 5) → hardware guarantees only one Thread A: write counter = 6 thread touches it at a time Thread B: write counter = 6 Thread B: atomic_fetch_add(&counter, 1) Result: counter = 6 (should be 7!) Result: counter = 7 (correct)

CK-Engine uses atomics for:

n_dispatch — the "new work available" counter. Main increments it; workers read it.
n_complete — workers increment when done. Main reads it to know when all are finished.
n_arrived (barrier) — threads increment when reaching the barrier point.
stop, paused — simple flags, but still need atomic reads/writes for visibility across cores.

Memory Ordering

CPUs and compilers can reorder memory operations for performance. Memory ordering tells them: "don't reorder this particular operation past certain points."

Ordering	What It Means	Used In CK-Engine For
`memory_order_relaxed`	No ordering guarantee. Just make the read/write atomic.	Resetting `n_arrived` (only writer, no dependency)
`memory_order_acquire`	All reads/writes after this load cannot be moved before it.	Workers reading `n_dispatch` — must see `work_fn`/`work_args` written before
`memory_order_release`	All reads/writes before this store cannot be moved after it.	Main bumping `n_dispatch` — ensures `work_fn`/`work_args` are visible
`memory_order_acq_rel`	Both acquire and release. Full fence in both directions.	`fetch_add` on `n_complete` and `n_arrived`

Why does this matter? Without proper ordering, the main thread could bump n_dispatch but the worker might read the old work_fn pointer (from a previous dispatch) because the CPU reordered the stores. The release on the store and acquire on the load form a happens-before relationship that prevents this.

What Is a Mutex?

A mutex (mutual exclusion) is a lock. Only one thread can hold it at a time. Everyone else waits. CK-Engine uses mutexes only for the condvar sleep mechanism — the hot path (spin-wait) doesn't touch the mutex at all.

// Mutex usage — only in the cold sleep path
pthread_mutex_lock(&pool->mutex);
  // Re-check condition (avoid lost wakeup)
  if (still_no_work)
      pthread_cond_wait(&pool->cond_dispatch, &pool->mutex);
      //     ^^ releases mutex, sleeps, re-acquires on wake
pthread_mutex_unlock(&pool->mutex);

What Is a Condition Variable (Condvar)?

A condvar lets a thread say: "Wake me up when something changes." It works together with a mutex:

Wait side: Lock mutex → check condition → if false, cond_wait (atomically releases mutex + sleeps) → on wake, re-acquires mutex
Signal side: Lock mutex → change state → cond_broadcast (wakes all waiters) → unlock

CK-Engine has two condvars:

cond_dispatch — workers wait here when idle (no work to do)
cond_done — main thread waits here when all workers are still busy

§8 Cache Lines & False Sharing

This is one of the most important performance concepts in multi-threaded code.

How CPU Caches Work (Simplified)

CPUs don't read individual bytes from RAM. They read in cache lines — 64-byte chunks. When one core writes to a cache line, all other cores that have a copy of that line must invalidate it (MESI protocol). This is called a cache coherency miss.

FALSE SHARING: Two unrelated variables on the SAME cache line 64-byte cache line +------------------------------------------------------------+ | n_dispatch (4 bytes) | n_complete (4 bytes) | padding | +------------------------------------------------------------+ ^ ^ Core 0 writes Core 1 writes (bumps dispatch) (bumps complete) PROBLEM: Every write by Core 0 invalidates Core 1's cache, and vice versa. Both cores stall on every access! FIXED: Each variable on its OWN cache line Cache line 1 (64 bytes) Cache line 2 (64 bytes) +--------------------------+ +--------------------------+ | n_dispatch | padding | | n_complete | padding | +--------------------------+ +--------------------------+ ^ ^ Core 0: no conflict Core 1: no conflict

This is why every atomic_int in the pool struct has _Alignas(64) — it forces each one onto its own cache line, eliminating false sharing entirely.

How bad is false sharing? On modern x86, a cache coherency miss costs ~40–100 nanoseconds. In a tight spin-wait loop executing billions of iterations, this can reduce throughput by 10–50x. It's one of the most common multi-threaded performance bugs.

§9 The Barrier: How Threads Synchronize

A barrier is a meeting point. All threads must arrive before any can proceed.

CK-Engine implements a custom spin-wait barrier using two atomic counters: n_arrived (how many threads reached the barrier) and n_phase (which "round" of the barrier we're on).

// barrier_wait() — ck_threadpool.c:89–111
static void barrier_wait(ck_barrier_t *b) {
    const int phase = atomic_load(&b->n_phase);

    if (atomic_fetch_add(&b->n_arrived, 1) == n - 1) {
        // I'm the LAST thread to arrive
        atomic_store(&b->n_arrived, 0);       // reset for next use
        atomic_store(&b->n_phase, phase + 1); // advance phase
    } else {
        // NOT the last thread — spin until phase advances
        while (atomic_load(&b->n_phase) == phase) {
            _mm_pause();
        }
    }
}

The phase counter trick allows the barrier to be reused without an explicit reset — each barrier use just increments the phase. After many spins without the last thread arriving, threads call sched_yield() to be nice to the OS on oversubscribed systems.

§10 Thread Pool vs. OpenMP — Side-by-Side

Aspect	OpenMP	CK Thread Pool
Thread lifetime	Created per `#pragma` region (or pooled by runtime — inconsistent)	Created once at startup, live until shutdown
Dispatch cost	50–200 µs (fork/join overhead)	< 1 µs (atomic increment + optional condvar signal)
Control over scheduling	Limited (`schedule(static)`, `dynamic`)	Full — you write the partition logic
Idle CPU usage	Implementation-dependent (some spin, some sleep)	Configurable: spin → condvar sleep → pause
Barrier	Implicit at end of parallel region	Explicit `ck_threadpool_barrier()`
Works with short kernels?	No — overhead dominates	Yes — designed for thousands of tiny dispatches
Code complexity	One line: `#pragma omp parallel for`	Must write work function, manage pool lifecycle
Portability	Standard, compiler-supported everywhere	POSIX pthreads (Linux/macOS)

When to use what:
OpenMP is great for long-running parallel regions (scientific computing, image processing, big matrix operations). One fork/join overhead amortized over milliseconds of work is fine.

Thread pools are essential when you dispatch thousands of short tasks per second (inference, game engines, audio processing, network servers).

§11 Thread Pool vs. Memory Pool — Same Idea, Different Resource

Both are "pools" — they pre-create expensive resources. The key differences are about what they pool and what synchronization they need.

Aspect	Thread Pool	Memory Pool
What is pooled?	OS threads (pthreads)	Memory blocks (heap allocations)
What is expensive?	`pthread_create` / `pthread_join` (~50–200 µs)	`malloc` / `free` (~0.1–5 µs)
Pool initialized with...	N live threads, waiting for work	One big block of memory, subdivided into slots
"Acquire" from pool	`dispatch(fn, args)` — give threads work	`pool_alloc(size)` — get a memory block
"Release" back to pool	Implicit — threads go back to waiting	`pool_free(ptr)` — return block to free list
Synchronization needed	Atomics + condvar (dispatch/wait signaling)	Atomics or mutex (free-list management)
Active resources...	...execute code concurrently	...hold data passively

The Deeper Conceptual Link

Synchronization Comparison

Thread Pool Sync

atomic_int for dispatch/complete counters (hot path).
pthread_mutex + cond for sleep fallback (cold path).
_mm_pause() spin hint for low-latency wake.
memory_order_acquire/release to publish work descriptor.

Memory Pool Sync

Atomic CAS for lock-free free-list (typical).
Or a simple mutex around the free-list pointer.
Bump allocator variant needs zero sync (per-thread pools).
No condvars needed — memory doesn't need to "wake up."

§12 Why Is It Faster?

Cost Breakdown: OpenMP vs. Thread Pool

Per-kernel dispatch cost comparison (4 threads, x86-64): OpenMP fork/join: +------------------------------------------------------------------+ | pthread_create x3 | work | pthread_join x3 | scheduler | | ~~~100 us~~~~~~~ | 20us | ~~~100 us~~~~~~ | ~20 us~~ | +------------------------------------------------------------------+ Total: ~240 us (only 20 us is useful work!) Thread pool dispatch: +---------------------------------------+ | atomic++ | work | spin check | | ~0.1us | 20us | ~0.1us | +---------------------------------------+ Total: ~20.2 us (99% is useful work!)

The Five Reasons It's Faster

1. No Thread Creation

pthread_create involves a system call, stack allocation (~8MB virtual), kernel scheduling setup, and TLS initialization. The thread pool does this once. Subsequent dispatches just flip an atomic counter — a single CPU instruction.

2. No Join Overhead

pthread_join involves a system call + scheduler interaction to reap the thread. The pool never joins during work — completion is tracked with a simple atomic counter.

3. Warm Caches

Persistent threads keep their L1/L2 caches warm across dispatches. A fresh thread starts with cold caches and must re-fetch everything from L3/RAM.

4. No Kernel Involvement

The spin-wait hot path is entirely in userspace. No system calls, no context switches, no kernel scheduler decisions. Just atomic loads in a tight loop.

5. Main Thread Participates

The main thread doesn't just wait — it does 1/N of the work. With OpenMP, the orchestration thread often sits idle during the parallel region. CK-Engine wastes zero cores.

Bonus: Power Management

Between inference batches (e.g., waiting for user input), workers sleep on a condvar using 0% CPU. No busy-waiting during idle periods. OpenMP's behavior here varies by implementation and is often wasteful.

§13 Real Numbers: Before & After

Measured on i7-3630QM (4 cores / 8 threads), Qwen2 0.5B model, Q4_K_M quantization.

Configuration	ms/token	Tokens/sec	Relative
Serial (single-threaded)	170	5.9	baseline
OpenMP `#pragma omp parallel for`	327	3.1	1.9x slower
Persistent pthread thread pool	—	—	Expected: 2–4x faster than serial

The key insight: OpenMP didn't fail because parallelism is bad. It failed because the overhead of setting up parallelism exceeded the actual work being parallelized. The thread pool removes that overhead, letting the parallelism actually help.

Where Does Overhead Go?

500 kernel calls per token, 4 threads: OpenMP overhead per token: 500 calls x ~200us fork/join = 100,000 us = 100 ms (wasted) + actual compute: ~70 ms (useful) = total: 170+ ms (but measured 327ms due to cache thrashing, scheduler overhead) Thread pool overhead per token: 500 calls x ~0.2us dispatch = 100 us = 0.1 ms (wasted) + actual compute: ~70 ms (useful) = total: ~70 ms Dispatch overhead: 100 ms (OpenMP) vs 0.1 ms (pool) = 1000x less overhead

§14 How Parallelization Actually Works

Forget the abstractions for a moment. Here is exactly what happens, from physical hardware up.

Step 1: Pin Threads to Physical Cores

At startup, the thread pool creates N−1 persistent pthreads. Ideally, each thread is pinned (affined) to a distinct physical CPU core using sched_setaffinity() or pthread_setaffinity_np(). This means the OS scheduler will not migrate a thread to a different core — it stays put.

4-core CPU with hyperthreading (8 logical CPUs): Physical Core 0 Physical Core 1 Physical Core 2 Physical Core 3 +-------+-------+ +-------+-------+ +-------+-------+ +-------+-------+ | CPU 0 | CPU 4 | | CPU 1 | CPU 5 | | CPU 2 | CPU 6 | | CPU 3 | CPU 7 | +-------+-------+ +-------+-------+ +-------+-------+ +-------+-------+ ^ ^ ^ ^ Thread 0 Thread 1 Thread 2 Thread 3 (main) (worker) (worker) (worker) Each thread pinned to one physical core. We avoid hyperthreads (CPU 4-7) because they share execution units with the primary logical CPU on the same core — using both gives diminishing returns for compute-bound work.

Why pin? Three reasons:

Cache locality: A pinned thread keeps its L1/L2 caches warm. If the OS migrates a thread to a different core, the old core's cache is useless and the new core starts cold.
No scheduler jitter: The OS scheduler makes decisions every ~1–4ms. Thread migration adds 5–20µs of latency. For sub-millisecond kernels, that's significant.
NUMA awareness: On multi-socket systems, accessing memory attached to a remote socket costs 2–3x more than local memory. Pinning keeps threads close to their data.

CK-Engine's Current State

The system topology module (src/system_topology.c) can discover the CPU layout:

// topology_discover_affinity() reads the process CPU mask
cpu_set_t mask;
if (sched_getaffinity(0, sizeof(mask), &mask) == 0) {
    for (int i = 0; i < MAX_CPUS; i++) {
        if (CPU_ISSET(i, &mask)) {
            aff->affinity_cpus[aff->num_affinity_cpus++] = i;
        }
    }
}

The GGML layer has full NUMA-aware pinning with three strategies: DISTRIBUTE (round-robin across NUMA nodes), ISOLATE (all threads on same node), and NUMACTL (respect external numactl settings). Intel hybrid CPU detection (P-core vs E-core via CPUID) is also available to skip efficiency cores for compute-heavy work.

Step 2: Identify Embarrassingly Parallel Loops

An embarrassingly parallel loop is one where every iteration is completely independent — no iteration reads what another writes. These loops can be split across cores with zero synchronization.

The key question for every loop: Can two iterations step on each other?

If iteration i only writes to output[i] and only reads from shared inputs, then yes, it's embarrassingly parallel. Every neural network kernel in CK-Engine has this property along at least one dimension.

Here's what the thread pool does to every such loop:

// BEFORE: Serial loop over M rows
for (int row = 0; row < M; row++) {
    compute_row(row, ...);
}

// AFTER: Thread pool splits the loop across N threads
void work_fn(int ith, int nth, void *args) {
    int rows_per_thread = M / nth;
    int start = ith * rows_per_thread;
    int end   = (ith == nth - 1) ? M : start + rows_per_thread;

    for (int row = start; row < end; row++) {
        compute_row(row, ...);   // exact same computation, just a slice
    }
}
ck_threadpool_dispatch(pool, work_fn, &args);

Each thread computes a contiguous slice of the output. Thread 0 does rows 0..249, thread 1 does rows 250..499, etc. The total work is the same; it's just divided.

Step 3: Ensure No Two Threads Write to the Same Cache Line

This is the critical constraint. The rule is:

The Write-Exclusivity Rule:
Reading from the same cache line across threads is FREE — all cores get their own read-only copy (MESI "Shared" state).

Writing to the same cache line from two threads is EXPENSIVE — every write invalidates all other cores' copies (MESI "Invalid" state), forcing a 40–100ns stall on every access.

In practice, this means the output array partition must be cache-line aligned at the boundaries between threads. Since a cache line is 64 bytes = 16 floats:

Output array: 1000 floats, 4 threads Thread 0 writes Thread 1 writes Thread 2 writes Thread 3 writes output[0..249] output[250..499] output[500..749] output[750..999] Cache lines (16 floats each): |0..15|16..31|...|240..255|256..271|...|496..511|512..527|...|752..767|768..783|...|992..1007| ^ ^ Thread boundaries fall on cache line boundaries 250 floats = 15.625 cache lines → round up to 256 (16 lines) Shared INPUT (all threads read, no one writes): Weight matrix W[M×K]: all threads read overlapping rows → FREE (Shared state) Input vector x[K]: all threads read the same vector → FREE (Shared state)

For transformer kernels, this alignment happens naturally because:

Each output row is typically 512–4096 floats (2KB–16KB) — many cache lines per row.
We split on row boundaries, so each thread owns complete rows.
Row addresses are aligned to CK_CACHE_LINE (64 bytes) in the memory allocator.
The only "dangerous" case is the last few elements at a boundary, but with row-level splitting this never happens.

Step 4: What About Reads?

Reads are shared freely. This is what makes matrix operations so parallelizable:

GEMV: y = W · x

Shared read: Input vector x[K] — all threads read the same vector.
Shared read: Weight matrix W[M×K] — each thread reads its own rows, but the data is read-only.
Exclusive write: Output y[M] — each thread writes its own slice of y.

RMSNorm: y = x · γ / RMS(x)

Shared read: Gamma weights γ[D] — all threads read the same normalization weights.
Per-token read: Each thread reads its own token's input x[t].
Exclusive write: Output y[t] — each thread writes its own token.

The MESI cache coherency protocol makes shared reads essentially free. When multiple cores read the same cache line, each core gets its own copy in "Shared" (S) state. No invalidation occurs. The cache line stays resident in every core's L1/L2 simultaneously.

Step 5: Handle Reductions Carefully

Some kernels have reduction operations — where all threads must contribute to a single result (e.g., sum, max). These break the "no shared writes" rule and need special handling.

Reduction example: RMSNorm needs sum of squares across D elements BAD: Multiple threads write to the same accumulator Thread 0: sum += x[0]² + x[1]² + ... ↓ Thread 1: sum += x[256]² + x[257]² + ... ↓ RACE CONDITION on sum! Thread 2: sum += x[512]² + x[513]² + ... ↓ GOOD: Each thread has a private partial sum, then combine Thread 0: partial[0] = x[0]² + x[1]² + ... (private write) Thread 1: partial[1] = x[256]² + x[257]² + ... (private write) Thread 2: partial[2] = x[512]² + x[513]² + ... (private write) ———— barrier ———— Thread 0: total = partial[0] + partial[1] + partial[2] (single-thread reduce)

In CK-Engine's kernels, reductions are intra-token (within a single token's D-dimensional vector), not inter-token. Since we parallelize across tokens (not within a single token's dimension), each thread's reduction is completely private — no cross-thread communication needed.

The Complete Picture

One kernel dispatch, from hardware up: 1. POOL CREATED (once at startup) • N-1 pthreads created, each pinned to a physical core • Workers enter spin-wait loop 2. DISPATCH (per kernel call, ~500x per token) • Main sets work_fn = gemv_parallel, work_args = {W, x, y, M, K} • Main bumps n_dispatch (atomic, ~0.1µs) • Workers see counter change, read work_fn + args 3. PARTITION (each thread independently) • Thread i computes: start_row = i * (M/N), end_row = (i+1) * (M/N) • Write region: y[start_row .. end_row-1] ← no overlap, cache-line aligned • Read region: W[start_row*K .. end_row*K-1] + x[0..K-1] ← shared, read-only 4. COMPUTE (all threads in parallel) • Each thread runs SIMD-vectorized inner loop over its rows • L1/L2 caches stay warm (thread pinned to same core) • No locks, no atomics, no barriers needed — pure data parallelism 5. COMPLETE • Each worker atomically increments n_complete • Main spin-waits until n_complete == N-1 • Cost: ~0.1µs to synchronize • Workers go back to spin-wait for next dispatch

§15 Per-Kernel Parallelization Analysis

Every kernel in the transformer decode loop has a natural parallelization dimension. Here's exactly how each one splits across threads, what data is shared, what is exclusive, and where reductions happen.

GEMV — Matrix-Vector Multiply

File: src/kernels/gemv_omp.c
Operation: y[M] = W[M×K] · x[K]
Call frequency: ~37% of decode time (logits projection is the largest single GEMV)

GEMV parallelization: split rows of W across threads Weight matrix W [M rows × K cols] Input x [K] Output y [M] +---------------------------+ +-------+ | Thread 0: rows 0..249 | · [shared by all] → | y[0..249] | +---------------------------+ +-------+ | Thread 1: rows 250..499 | · [shared by all] → | y[250..499] | +---------------------------+ +-------+ | Thread 2: rows 500..749 | · [shared by all] → | y[500..749] | +---------------------------+ +-------+ | Thread 3: rows 750..999 | · [shared by all] → | y[750..999] | +---------------------------+ +-------+ READ (shared): x[K] — all threads read the entire input vector READ (private): W rows — each thread reads only its rows (no overlap) WRITE: y elements — each thread writes its own slice (no overlap) REDUCTION: NONE — each y[i] = dot(W[i], x) is independent

void gemv_q8_parallel(int ith, int nth, void *args) {
    gemv_args_t *a = args;
    int rows_per_thread = a->M / nth;
    int start = ith * rows_per_thread;
    int end   = (ith == nth - 1) ? a->M : start + rows_per_thread;

    for (int row = start; row < end; row++) {
        // Each thread: dot product of one W row with shared x
        vec_dot_q8_0_q8_0(a->K, &a->y[row],
                          &a->W_blocks[row * blocks_per_row],
                          a->x_blocks);
    }
}

Property	Value
Parallel dimension	Rows (M) — embarrassingly parallel
Shared reads	Input vector x[K], quantization lookup tables
Exclusive writes	Output y[start..end] — contiguous, cache-aligned
Reduction	None
Cache behavior	x stays in L1 across all rows (K ≈ 896 = 3.5KB). W streams through L2.
Speedup potential	Near-linear: 4 threads ≈ 3.5–3.8x (limited by memory bandwidth)

GEMM — Matrix-Matrix Multiply

File: src/kernels/gemm_kernels.c
Operation: C[M×N] = A[M×K] · B[K×N] + bias[N]
Used for: Prefill (processing entire prompt at once — M > 1)

Property	Value
Parallel dimension	Output rows (M) — embarrassingly parallel
Shared reads	B[K×N] (entire matrix), bias[N]
Exclusive writes	C[start_row..end_row, 0..N-1] — complete rows per thread
Reduction	None — each C[i,j] = dot(A[i], B[:,j]) is independent
SIMD	AVX-512 processes 16 floats per cycle, AVX2 does 8
Special case	When M=1 (decode mode), falls back to GEMV — parallelizes over N instead

RMSNorm — Root Mean Square Normalization

File: src/kernels/rmsnorm_kernels.c
Operation: y[t,d] = x[t,d] · γ[d] / sqrt(mean(x[t]²) + ε)
Called: Before every attention and FFN block (2x per layer ≈ 48 calls for 24 layers)

The reduction (sum of x² across the D dimension) happens within a single token. Since we split across tokens, not across dimensions within a token, each thread does its own reduction with zero cross-thread communication.

void rmsnorm_parallel(int ith, int nth, void *args) {
    rmsnorm_args_t *a = args;
    int tokens_per_thread = a->T / nth;
    int t_start = ith * tokens_per_thread;
    int t_end   = (ith == nth - 1) ? a->T : t_start + tokens_per_thread;

    for (int t = t_start; t < t_end; t++) {
        float *x = a->input + t * a->aligned_D;
        float *y = a->output + t * a->aligned_D;

        // Phase 1: sum of squares (PRIVATE reduction — within this token only)
        __m512 sum_sq = _mm512_setzero_ps();
        for (int d = 0; d < a->D; d += 16)
            sum_sq = _mm512_fmadd_ps(_mm512_loadu_ps(&x[d]),
                                      _mm512_loadu_ps(&x[d]), sum_sq);
        float rms = sqrtf(_mm512_reduce_add_ps(sum_sq) / a->D + 1e-6f);

        // Phase 2: normalize (element-wise, reads shared gamma)
        __m512 rstd = _mm512_set1_ps(1.0f / rms);
        for (int d = 0; d < a->D; d += 16)
            _mm512_storeu_ps(&y[d],
                _mm512_mul_ps(_mm512_mul_ps(_mm512_loadu_ps(&x[d]), rstd),
                              _mm512_loadu_ps(&a->gamma[d])));
    }
}

Property	Value
Parallel dimension	Tokens (T) — embarrassingly parallel
Shared reads	γ[D] normalization weights
Exclusive writes	output[t_start..t_end, 0..D-1] — complete token vectors
Reduction	Intra-token only (private to each thread, no cross-thread sync)
Decode mode	T=1 in decode → no token-level parallelism. Could split D dimension with barrier.

Attention — Q·K^T/√d → Softmax → Attn·V

File: src/kernels/attention_kernels.c
Operation: Multi-head scaled dot-product attention with causal mask
Three phases: Score computation, softmax, weighted value summation

Attention is the most complex kernel but also the most naturally parallel. Each attention head is a self-contained computation:

Phase 1 (Score): For each query position i, compute score[h,i,j] = Q[h,i] · K[h,j] / √d for all j ≤ i (causal mask). Triangle of dot products per head.
Phase 2 (Softmax): Per row: find max, subtract, exp, sum, normalize. The reduction (max, sum) is within a single row — private to the thread owning that head.
Phase 3 (Value): output[h,i] = ∑_j score[h,i,j] · V[h,j]. Accumulates over j — but within a single head, so private to the owning thread.

Property	Value
Parallel dimension	Heads (H) — completely independent
Shared reads	None — each head has its own Q, K, V slices in head-major layout
Exclusive writes	scores[h, , ] and output[h, , ] — entire head is private
Reduction	Softmax max/sum is intra-row, within a single head (private)
Compute cost	O(T² · d) per head — quadratic in sequence length
Ideal threads	num_heads (typically 8–32, more than enough to saturate cores)

Softmax — With Causal Masking

File: src/kernels/softmax_kernels.c
Operation: softmax(row) = exp(row - max) / sum(exp(row - max)), applied to each row of the attention score matrix
Called: Once per attention layer, integrated into the attention computation

Softmax parallelization: split heads (or rows) across threads Score matrix per head [T × T], lower-triangular (causal): Row 0: [s₀₀] ← len=1 Row 1: [s₁₀, s₁₁] ← len=2 Row 2: [s₂₀, s₂₁, s₂₂] ← len=3 Row 3: [s₃₀, s₃₁, s₃₂, s₃₃] ← len=4 ... Row T: [s_T0, s_T1, ..., s_TT] ← len=T Each row: find max → subtract max → exp → sum → divide by sum Rows are independent! Each row's max/sum is a private reduction.

Property	Value
Parallel dimension	Heads (H) or rows (T) within a head — both embarrassingly parallel
Shared reads	None (in-place operation)
Exclusive writes	score[h, i, *] — each row written independently
Reduction	Row-internal max and sum — private per row, no cross-thread sync
Causal optimization	Row i only processes i+1 elements (lower triangle) + zeroes rest

SwiGLU — Gated Linear Unit Activation

File: src/kernels/swiglu_kernels.c
Operation: output[t,d] = SiLU(gate[t,d]) × value[t,d] where input is split into gate and value halves
Called: Once per FFN block (1x per layer)

Property	Value
Parallel dimension	Tokens (T) — embarrassingly parallel, zero dependencies
Shared reads	None
Exclusive writes	output[t_start..t_end, 0..D-1]
Reduction	None — pure element-wise operation
SIMD efficiency	Excellent — straight FMA + sigmoid approximation over contiguous memory

RoPE — Rotary Position Embedding

File: src/kernels/rope_kernels.c
Operation: Rotate Q and K vectors using precomputed sin/cos tables based on token position
Called: Once per attention block (1x per layer, applied to Q and K separately)

RoPE parallelization: split heads × tokens across threads Q or K [H heads × T tokens × head_dim] +-----------------------------------------------+ | Head 0, Token 0: rotate(q, cos[0], sin[0]) | Thread 0 | Head 0, Token 1: rotate(q, cos[1], sin[1]) | | Head 1, Token 0: rotate(q, cos[0], sin[0]) | +-----------------------------------------------+ | Head 1, Token 1: rotate(q, cos[1], sin[1]) | Thread 1 | Head 2, Token 0: rotate(q, cos[0], sin[0]) | | Head 2, Token 1: rotate(q, cos[1], sin[1]) | +-----------------------------------------------+ ... Rotation: x'[i] = x[i]*cos[pos,i] - x[i+half]*sin[pos,i] x'[i+half] = x[i]*sin[pos,i] + x[i+half]*cos[pos,i] Shared reads: cos_cache[pos, half_dim], sin_cache[pos, half_dim] In-place write: x[h, t, 0..head_dim-1] — each (h,t) is independent

Property	Value
Parallel dimension	Heads × Tokens (H·T) — all (h,t) pairs independent
Shared reads	cos_cache[T × half_dim], sin_cache[T × half_dim]
Exclusive writes	x[h, t, *] — in-place rotation, one head-vector per (h,t)
Reduction	None — pure element-wise rotation
Best split	Flatten H×T loop, divide evenly across threads

Embedding Lookup

File: src/kernels/embedding_kernels.c
Operation: output[t] = token_embeddings[token_id[t]] + pos_embeddings[t]
Called: Once per forward pass (first operation)

Property	Value
Parallel dimension	Tokens (T) — each token lookup is independent
Shared reads	token_embeddings[vocab_size × D] (table), pos_embeddings[T × D]
Exclusive writes	output[t, 0..D-1] — one embedding vector per token
Reduction	None
Special pattern	Irregular memory access (gather) — token_ids determine which rows to fetch
Decode mode	T=1 → no parallelism possible. Prefill (T=512+) benefits greatly.

KV Cache Write

File: src/kernels/kv_cache_kernels.c
Operation: Copy new K and V vectors for current token into the cache at position token_index
Called: Once per layer (append new token's K,V to cache)

Property	Value
Parallel dimension	KV heads (num_kv_heads) — each head is independent
Shared reads	k_token[num_kv_heads × head_dim], v_token[num_kv_heads × head_dim]
Exclusive writes	k_cache[h, token_index, ], v_cache[h, token_index, ]
Reduction	None — pure memcpy per head
Worth parallelizing?	Small — only head_dim floats per head. May not justify dispatch overhead.

Summary: Parallelization Map

Kernel	Split Dimension	Shared Reads	Reduction	Priority
GEMV	Rows (M)	x[K]	None	HIGH — dominates decode time
GEMM	Rows (M)	B[K×N]	None	HIGH — dominates prefill time
Attention	Heads (H)	None	Private per head	HIGH — O(T²) cost
RMSNorm	Tokens (T)	γ[D]	Private per token	MEDIUM — called 48x/fwd
SwiGLU	Tokens (T)	None	None	MEDIUM — element-wise
Softmax	Heads (H) or Rows (T)	None	Private per row	MEDIUM
RoPE	H × T	cos/sin tables	None	LOW — cheap per call
Embedding	Tokens (T)	Embedding table	None	LOW — prefill only
KV Cache	KV Heads	Token vectors	None	LOW — small memcpy

Every single kernel in the transformer has embarrassingly parallel structure. The only reductions (RMSNorm sum-of-squares, Softmax max/sum) are intra-token or intra-row — private to whichever thread owns that token or head. No kernel requires cross-thread communication for its core computation. This is why a persistent thread pool with simple (ith, nth) partitioning is the perfect parallelization strategy for LLM inference.

§16 Testing & Validation

The thread pool dispatch layer is validated by a dedicated parity test suite that compares every dispatch wrapper against its serial counterpart. The tests verify both numerical correctness (bit-for-bit parity within tolerance) and performance (speedup from parallelization).

Test Suite: `test_threadpool_parity.c`

Location: tests/test_threadpool_parity.c

Test	Serial Kernel	Thread Pool Dispatch	What It Validates
Test 1	`gemv_q8_0_q8_0()`	`gemv_q8_0_q8_0_parallel_dispatch()`	Q8_0 weights × Q8_0 input (logits, V proj)
Test 2	`gemv_q5_0_q8_0()`	`gemv_q5_0_q8_0_parallel_dispatch()`	Q5_0 weights × Q8_0 input (Q/K proj, MLP, out proj)
Test 3	`gemv_fused_q5_0_bias_dispatch()`	`gemv_fused_q5_0_bias_parallel_dispatch()`	Fused: quantize FP32 → GEMV → add bias (production path)
Test 4	Fused dispatch with `NULL` bias		Edge case: bias pointer is NULL (no bias add)
Test 5	Small M dispatch latency measurement		Overhead of packing args + waking threads on tiny workloads

Test Dimensions (Qwen2 0.5B shapes)

Tests run on real model dimensions to catch alignment and edge-case issues:

Config	M (output rows)	K (input dim)	Corresponds To
`qkv_proj`	896	896	Q/K/V projections (embed → embed)
`mlp_gate_up`	9,728	896	MLP gate+up projection (embed → 4× intermediate)
`mlp_down`	896	4,864	MLP down projection (intermediate → embed)
`logits`	151,936	896	LM head (embed → vocab, largest GEMV)

How to Run

# Quick parity check (~5 seconds)
make test-threadpool-parity-quick

# Full parity + speed test (all model dimensions)
make test-threadpool-parity

# Verbose output (shows max_abs_diff, max_rel_diff for every config)
make test-threadpool-parity-verbose

Correctness Criteria

Tolerance: 1e-3 absolute. Thread pool dispatch splits rows across threads. Each thread calls the same _parallel_simd kernel on its row subset. Since each output row is computed from independent dot products, the thread pool result should be bit-identical to serial — the tolerance exists only to guard against floating-point non-determinism from different SIMD accumulation orders.

CI Integration

The thread pool parity test runs automatically in three places:

Trigger	What Runs	Where Configured
`make test`	`test-threadpool-parity-quick`	`Makefile` (test target)
`make llamacpp-parity-full`	`test-threadpool-parity` (full)	`Makefile` (llamacpp-parity-full target)
Nightly / Push to main / PR	`test-threadpool-parity` (full)	`scripts/nightly_runner.py` → `MAKE_TARGETS["threadpool_parity"]`

The nightly workflow (.github/workflows/nightly.yml) runs on every push to main, every pull request, and nightly at 2am UTC. If the thread pool dispatch diverges from serial output, the test exits with code 1 and the nightly report marks it as failed.

ADR: Thread Pool Replaced OpenMP for Parallelization

The codebase previously had an OpenMP parallelization pass (parallel_pass.py) that annotated IR operations with #pragma omp parallel for. This was superseded by the persistent pthread thread pool for the following reasons:

Concern	OpenMP	Thread Pool
Thread creation	Fork/join per `#pragma` region (~15-50µs)	Created once at init, spin-wait between dispatches (~0.1µs wake)
Core oversubscription	OpenMP team + thread pool = 2N threads on N cores	One pool, known thread count, no conflict
Codegen integration	`parallel_pass.py` emits pragmas, but `codegen_v6_6.py` never reads them	Macro redirects in `ck_parallel_decode.h` — zero codegen changes
Kernel interface	Implicit OpenMP runtime	Explicit `(ith, nth)` args to `_parallel_simd` kernels

Files affected by this decision:

version/v6.6/scripts/parallel_pass.py — header documents "SUPERSEDED BY THREAD POOL DISPATCH"
version/v6.6/scripts/build_ir_v6_6.py — run_parallel_pass() commented out with ADR
version/v6.6/scripts/ck_run_v6_6.py — --parallel-decode flag deprecated (warns and continues)

Related Test Suites

The thread pool parity test complements existing tests that cover other aspects of the parallelization stack:

Make Target	Test File	What It Tests
`test-threadpool-parity`	`tests/test_threadpool_parity.c`	Serial vs thread pool dispatch (this page)
`test-gemv-omp`	`tests/test_gemv_omp_parity.c`	Serial vs OpenMP parallel (legacy comparison)
`test-fusion-gemv`	`unittest/fusion/test_gemv_fused_quant_bias.py`	Fused vs unfused GEMV correctness + speedup
`llamacpp-parity-full`	`scripts/run_parity_smoketest.sh`	CK-Engine kernels vs llama.cpp reference (includes thread pool + OMP)
`e2e-v66`	`version/v6.6/test/Makefile`	Full v6.6 pipeline: IR → codegen → compile → run (thread pool always enabled)