CK-Engine Thread Pool

A plain-English deep dive into how persistent pthread thread pools work, why they exist, and how they compare to OpenMP and memory pools.

ck_threadpool.c Sub-µs Dispatch Spin + Condvar Hybrid

Contents

  1. What Is a "Pool"? (The Core Idea)
  2. The Problem: Why Not Just Use OpenMP?
  3. Anatomy of CK-Engine's Thread Pool
  4. Lifecycle: Create → Dispatch → Destroy
  5. Dispatch Deep Dive: What Actually Happens
  6. Spin-Wait vs. Sleep: The Hybrid Polling Design
  7. Atomics, Mutexes, and Memory Ordering
  8. Cache Lines & False Sharing
  9. The Barrier: How Threads Synchronize
  10. Thread Pool vs. OpenMP — Side-by-Side
  11. Thread Pool vs. Memory Pool — Same Idea, Different Resource
  12. Why Is It Faster?
  13. Real Numbers: Before & After
  14. How Parallelization Actually Works
  15. Per-Kernel Parallelization Analysis
  16. Testing & Validation

§1 What Is a "Pool"?

A pool is a set of expensive resources you create once and reuse over and over, instead of creating and destroying them every time you need one.

Analogy: Imagine a restaurant kitchen with 4 chefs.

No pool: Every time an order comes in, go out on the street, hire 3 chefs, cook the meal, then fire them all. Next order — hire 3 more chefs. (This is what OpenMP does.)

Pool: Hire 3 chefs on day one. They stand ready at their stations. Order comes in — you shout "Go!" and all 4 of you (including yourself) cook simultaneously. Between orders, chefs wait at their stations. (This is a thread pool.)

The word "pool" shows up in many places in systems programming:

Thread Pool

Pre-created threads waiting for work. Avoids spawn/join overhead per task.

Memory Pool

Pre-allocated memory blocks. Avoids malloc/free overhead per allocation.

Connection Pool

Pre-opened database/network connections. Avoids TCP handshake per query.

Object Pool

Pre-constructed objects (e.g. game entities). Avoids constructor/destructor overhead.

The common pattern: creating the resource is expensive; using it is cheap. A pool pays the creation cost once, then amortizes it over thousands of uses.

§2 The Problem: Why Not Just Use OpenMP?

OpenMP looks appealing — just add #pragma omp parallel for and you get parallelism. But there's a hidden cost.

During inference, CK-Engine runs the transformer decode loop. For each token generated, it calls kernel functions (matrix-vector multiplies, RMSNorm, SwiGLU, etc.) roughly 500+ times. Each call is a short burst of work (microseconds to low milliseconds).

OpenMP's Fork/Join Model

// Every single #pragma omp parallel does this: Main thread | |—— fork ——> [spawn/wake N threads] ← 50–200 µs overhead | | | | | | work work work work ← actual compute (maybe 20 µs) | | | | | |<— join ——— [wait + tear down] ← 50–200 µs overhead | |—— fork ——> [spawn/wake again] ← overhead AGAIN | ... v 500+ fork/joins per token!

With each fork/join costing 50–200µs and 500+ kernel calls per token:

The math: 500 calls × 100µs overhead = 50ms of pure overhead per token — just from spawning and joining threads, before any useful work is done. The actual compute per kernel might only be 20µs!

Real Measurement from CK-Engine (gemv_omp.c)

/*
 * Measured on i7-3630QM (4C/8T), Qwen 0.5B:
 *   Serial kernels:   170 ms/tok  (5.9 tok/s)
 *   OMP parallel:     327 ms/tok  (3.1 tok/s)  ← 1.9x SLOWER
 *
 * The math is correct (10/10 parity tests pass) but the
 * threading model is wrong for this workload.
 */

OpenMP parallelism made it almost 2x slower. The fork/join overhead completely ate the parallelism gains and then some.

§3 Anatomy of CK-Engine's Thread Pool

Let's look at the actual data structure. Every field exists for a specific reason.

struct ck_threadpool {
    /* --- Dispatch state (each on its own cache line) --- */
    _Alignas(64) atomic_int   n_dispatch;   // bumped to wake workers
    _Alignas(64) atomic_int   n_complete;   // workers signal "I'm done"
    _Alignas(64) ck_work_fn_t work_fn;      // pointer to current kernel
                 void        *work_args;    // arguments for the kernel

    /* --- Barrier for mid-kernel sync --- */
    ck_barrier_t barrier;

    /* --- Worker management --- */
    int          n_threads;    // total (including main)
    ck_worker_t  workers[64];  // thread handles

    /* --- Shutdown / pause --- */
    _Alignas(64) atomic_int stop;    // set to 1 to kill workers
    _Alignas(64) atomic_int paused;  // set to 1 for power saving

    /* --- Hybrid sleep mechanism --- */
    pthread_mutex_t mutex;
    pthread_cond_t  cond_dispatch; // workers sleep here
    pthread_cond_t  cond_done;     // main sleeps here
};

Why _Alignas(64) Everywhere?

Each atomic counter is placed on its own 64-byte cache line. This prevents "false sharing" — where two unrelated variables on the same cache line cause all CPU cores to fight over that cache line. More on this in §8.

The Worker Struct

typedef struct {
    pthread_t thread;            // OS thread handle
    int       id;               // 0 = main, 1..N-1 = workers
    struct ck_threadpool *pool; // back-pointer to the pool
} ck_worker_t;

Thread 0 is always the main thread — the one that calls ck_threadpool_dispatch(). It doesn't get a new pthread. It does its share of the work alongside the workers.

§4 Lifecycle: Create → Dispatch → Destroy

1. Create — ck_threadpool_create(n_threads)

Called once at engine startup. Allocates the pool struct (cache-line aligned), then spawns N−1 worker pthreads. Workers immediately enter the spin-wait loop. The main thread is thread 0 — no extra pthread is created for it.

2. Dispatch (called ~500x per token) — ck_threadpool_dispatch(pool, fn, args)

Sets the work function + args, bumps n_dispatch (an atomic counter), and signals sleeping workers via condvar. Main thread also runs its share. All threads do their chunk, then main waits for n_complete to hit N−1.

3. Pause/Resume (between batches)

Between user interactions (waiting for input), workers can be paused so they sleep on a condvar and use 0% CPU. Resume wakes them instantly.

4. Destroy — ck_threadpool_destroy(pool)

Sets stop = 1, broadcasts the condvar to wake sleeping workers, then pthread_joins every worker thread. Frees the pool memory.

Global Pool Convenience

For CK-Engine, there's a global singleton pool accessed via ck_threadpool_global(). It uses pthread_once to ensure thread-safe lazy initialization — no matter how many threads call it simultaneously, the pool is created exactly once.

static ck_threadpool_t *g_threadpool = NULL;
static pthread_once_t  g_threadpool_once = PTHREAD_ONCE_INIT;

ck_threadpool_t *ck_threadpool_global(void) {
    pthread_once(&g_threadpool_once, global_pool_init);
    return g_threadpool;
}

§5 Dispatch Deep Dive: What Actually Happens

Let's trace exactly what happens when CK-Engine needs to run one kernel across 4 threads.

ck_threadpool_dispatch(pool, gemv_q8_0, &args) MAIN THREAD (ith=0) WORKER 1 WORKER 2 WORKER 3 | [spinning] [spinning] [spinning] | checking checking checking | pool->work_fn = gemv_q8_0 n_dispatch n_dispatch n_dispatch | pool->work_args = &args ... ... ... | n_complete = 0 ... ... ... | ... ... ... | atomic n_dispatch++ ——————> WAKE! WAKE! WAKE! | condvar broadcast ——————> (if sleeping) (if sleeping) (if sleeping) | | | | | gemv(0, 4, args) gemv(1,4,args) gemv(2,4,args) gemv(3,4,args) | rows 0..249 rows 250..499 rows 500..749 rows 750..999 | ...done ...done ...done ...done | | | | | n_complete++ n_complete++ n_complete++ | (back to spin) (back to spin) last one signals | spin-wait: n_complete == 3? cond_signal! | YES → return | v next kernel call...

Key Details

Inside the Work Function

Each kernel receives (ith, nth, args) and partitions work by thread index:

void gemv_parallel(int ith, int nth, void *args) {
    gemv_args_t *a = args;
    // Each thread computes its slice of output rows
    int rows_per_thread = a->M / nth;
    int start = ith * rows_per_thread;
    int end   = (ith == nth - 1) ? a->M : start + rows_per_thread;

    for (int row = start; row < end; row++) {
        vec_dot(a->K, &a->y[row], a->W + row * a->K, a->x);
    }
}

§6 Spin-Wait vs. Sleep: The Hybrid Polling Design

How do worker threads wait for work? There are two extremes, and CK-Engine uses both.

Pure Spin-Wait

Thread sits in a tight loop checking an atomic variable: while (no_work) { _mm_pause(); }

PRO: Instant wake-up (nanoseconds)
CON: Burns 100% CPU while waiting

Pure Condvar Sleep

Thread calls pthread_cond_wait() and the OS puts it to sleep.

PRO: Zero CPU while waiting
CON: Wake-up takes 5–50µs (kernel context switch)

CK-Engine's Hybrid Approach

Workers do both:

  1. First, spin for CK_THREADPOOL_SPIN_COUNT (1024) iterations, checking the atomic dispatch counter.
  2. If no work arrives during spin, fall back to pthread_cond_wait() (sleep).
  3. When new work arrives, spinning workers wake instantly; sleeping workers wake via condvar broadcast.
// worker_main() — the hybrid wait loop
for (;;) {
    int spins = 0;
    for (;;) {
        if (atomic_load(&pool->stop))  return NULL;  // shutdown?

        int current = atomic_load(&pool->n_dispatch);
        if (current != last_dispatch) {              // new work!
            last_dispatch = current;
            break;                                   // go execute
        }

        _mm_pause();    // x86 hint: "I'm spinning, save power"
        spins++;

        if (spins >= 1024) {                        // tired of spinning?
            pthread_mutex_lock(&pool->mutex);
            // Re-check under lock (avoid missed wakeup!)
            if (no_new_work && !stopping)
                pthread_cond_wait(&pool->cond_dispatch, &pool->mutex);
            pthread_mutex_unlock(&pool->mutex);
            spins = 0;
        }
    }
    // execute work...
}
Why re-check under the lock? This avoids the "missed wakeup" race condition. Without it: main could bump n_dispatch and broadcast between the worker's last spin check and its cond_wait call. The worker would then sleep forever, never seeing the work. The lock + re-check makes this impossible.

What Is _mm_pause()?

It's an x86 intrinsic that tells the CPU: "I'm in a spin-wait loop." The CPU can then:

On non-x86 (ARM, etc.), CK-Engine defines CK_SPIN_PAUSE() as a no-op.

§7 Atomics, Mutexes, and Memory Ordering

These are the fundamental building blocks of any concurrent system. Let's understand each one from scratch.

What Is an Atomic Operation?

An atomic operation is one that completes in a single, indivisible step from the perspective of all other threads. No other thread can see a "half-done" state.

WITHOUT atomics (broken): WITH atomics (correct): Thread A: read counter (= 5) Thread A: atomic_fetch_add(&counter, 1) Thread B: read counter (= 5) → hardware guarantees only one Thread A: write counter = 6 thread touches it at a time Thread B: write counter = 6 Thread B: atomic_fetch_add(&counter, 1) Result: counter = 6 (should be 7!) Result: counter = 7 (correct)

CK-Engine uses atomics for:

Memory Ordering

CPUs and compilers can reorder memory operations for performance. Memory ordering tells them: "don't reorder this particular operation past certain points."

Ordering What It Means Used In CK-Engine For
memory_order_relaxed No ordering guarantee. Just make the read/write atomic. Resetting n_arrived (only writer, no dependency)
memory_order_acquire All reads/writes after this load cannot be moved before it. Workers reading n_dispatch — must see work_fn/work_args written before
memory_order_release All reads/writes before this store cannot be moved after it. Main bumping n_dispatch — ensures work_fn/work_args are visible
memory_order_acq_rel Both acquire and release. Full fence in both directions. fetch_add on n_complete and n_arrived
Why does this matter? Without proper ordering, the main thread could bump n_dispatch but the worker might read the old work_fn pointer (from a previous dispatch) because the CPU reordered the stores. The release on the store and acquire on the load form a happens-before relationship that prevents this.

What Is a Mutex?

A mutex (mutual exclusion) is a lock. Only one thread can hold it at a time. Everyone else waits. CK-Engine uses mutexes only for the condvar sleep mechanism — the hot path (spin-wait) doesn't touch the mutex at all.

// Mutex usage — only in the cold sleep path
pthread_mutex_lock(&pool->mutex);
  // Re-check condition (avoid lost wakeup)
  if (still_no_work)
      pthread_cond_wait(&pool->cond_dispatch, &pool->mutex);
      //     ^^ releases mutex, sleeps, re-acquires on wake
pthread_mutex_unlock(&pool->mutex);

What Is a Condition Variable (Condvar)?

A condvar lets a thread say: "Wake me up when something changes." It works together with a mutex:

  1. Wait side: Lock mutex → check condition → if false, cond_wait (atomically releases mutex + sleeps) → on wake, re-acquires mutex
  2. Signal side: Lock mutex → change state → cond_broadcast (wakes all waiters) → unlock

CK-Engine has two condvars:

§8 Cache Lines & False Sharing

This is one of the most important performance concepts in multi-threaded code.

How CPU Caches Work (Simplified)

CPUs don't read individual bytes from RAM. They read in cache lines — 64-byte chunks. When one core writes to a cache line, all other cores that have a copy of that line must invalidate it (MESI protocol). This is called a cache coherency miss.

FALSE SHARING: Two unrelated variables on the SAME cache line 64-byte cache line +------------------------------------------------------------+ | n_dispatch (4 bytes) | n_complete (4 bytes) | padding | +------------------------------------------------------------+ ^ ^ Core 0 writes Core 1 writes (bumps dispatch) (bumps complete) PROBLEM: Every write by Core 0 invalidates Core 1's cache, and vice versa. Both cores stall on every access! FIXED: Each variable on its OWN cache line Cache line 1 (64 bytes) Cache line 2 (64 bytes) +--------------------------+ +--------------------------+ | n_dispatch | padding | | n_complete | padding | +--------------------------+ +--------------------------+ ^ ^ Core 0: no conflict Core 1: no conflict

This is why every atomic_int in the pool struct has _Alignas(64) — it forces each one onto its own cache line, eliminating false sharing entirely.

How bad is false sharing? On modern x86, a cache coherency miss costs ~40–100 nanoseconds. In a tight spin-wait loop executing billions of iterations, this can reduce throughput by 10–50x. It's one of the most common multi-threaded performance bugs.

§9 The Barrier: How Threads Synchronize

A barrier is a meeting point. All threads must arrive before any can proceed.

CK-Engine implements a custom spin-wait barrier using two atomic counters: n_arrived (how many threads reached the barrier) and n_phase (which "round" of the barrier we're on).

// barrier_wait() — ck_threadpool.c:89–111
static void barrier_wait(ck_barrier_t *b) {
    const int phase = atomic_load(&b->n_phase);

    if (atomic_fetch_add(&b->n_arrived, 1) == n - 1) {
        // I'm the LAST thread to arrive
        atomic_store(&b->n_arrived, 0);       // reset for next use
        atomic_store(&b->n_phase, phase + 1); // advance phase
    } else {
        // NOT the last thread — spin until phase advances
        while (atomic_load(&b->n_phase) == phase) {
            _mm_pause();
        }
    }
}
4-thread barrier example: Thread 0 Thread 1 Thread 2 Thread 3 | | | | | arrive | | | n_arrived = 1, phase = 0 | | arrive | | n_arrived = 2, phase = 0 | spin | spin | | | spin | spin | arrive | n_arrived = 3, phase = 0 | spin | spin | spin | | spin | spin | spin | arrive n_arrived = 4 == N → LAST! | | | | reset n_arrived = 0 | | | | phase = 1 | GO! | GO! | GO! | GO! all see phase changed v v v v

The phase counter trick allows the barrier to be reused without an explicit reset — each barrier use just increments the phase. After many spins without the last thread arriving, threads call sched_yield() to be nice to the OS on oversubscribed systems.

§10 Thread Pool vs. OpenMP — Side-by-Side

Aspect OpenMP CK Thread Pool
Thread lifetime Created per #pragma region (or pooled by runtime — inconsistent) Created once at startup, live until shutdown
Dispatch cost 50–200 µs (fork/join overhead) < 1 µs (atomic increment + optional condvar signal)
Control over scheduling Limited (schedule(static), dynamic) Full — you write the partition logic
Idle CPU usage Implementation-dependent (some spin, some sleep) Configurable: spin → condvar sleep → pause
Barrier Implicit at end of parallel region Explicit ck_threadpool_barrier()
Works with short kernels? No — overhead dominates Yes — designed for thousands of tiny dispatches
Code complexity One line: #pragma omp parallel for Must write work function, manage pool lifecycle
Portability Standard, compiler-supported everywhere POSIX pthreads (Linux/macOS)
When to use what:
OpenMP is great for long-running parallel regions (scientific computing, image processing, big matrix operations). One fork/join overhead amortized over milliseconds of work is fine.

Thread pools are essential when you dispatch thousands of short tasks per second (inference, game engines, audio processing, network servers).

§11 Thread Pool vs. Memory Pool — Same Idea, Different Resource

Both are "pools" — they pre-create expensive resources. The key differences are about what they pool and what synchronization they need.

Aspect Thread Pool Memory Pool
What is pooled? OS threads (pthreads) Memory blocks (heap allocations)
What is expensive? pthread_create / pthread_join (~50–200 µs) malloc / free (~0.1–5 µs)
Pool initialized with... N live threads, waiting for work One big block of memory, subdivided into slots
"Acquire" from pool dispatch(fn, args) — give threads work pool_alloc(size) — get a memory block
"Release" back to pool Implicit — threads go back to waiting pool_free(ptr) — return block to free list
Synchronization needed Atomics + condvar (dispatch/wait signaling) Atomics or mutex (free-list management)
Active resources... ...execute code concurrently ...hold data passively

The Deeper Conceptual Link

The pool pattern in both cases: Thread Pool Memory Pool +---------------------+ +---------------------+ | STARTUP (once) | | STARTUP (once) | | Create N threads | | malloc(big_block) | | All threads idle | | Carve into slots | +---------+-----------+ +---------+-----------+ | | v v +---------------------+ +---------------------+ | HOT PATH (fast) | | HOT PATH (fast) | | dispatch(fn, args) | | slot = pool_alloc()| | < 1 us | | < 0.05 us | +---------+-----------+ +---------+-----------+ | | v v +---------------------+ +---------------------+ | RETURN TO POOL | | RETURN TO POOL | | Threads go idle | | pool_free(slot) | | (spin or sleep) | | (back to free list)| +---------+-----------+ +---------+-----------+ | | v v +---------------------+ +---------------------+ | SHUTDOWN (once) | | SHUTDOWN (once) | | join all threads | | free(big_block) | +---------------------+ +---------------------+

Synchronization Comparison

Thread Pool Sync

atomic_int for dispatch/complete counters (hot path).
pthread_mutex + cond for sleep fallback (cold path).
_mm_pause() spin hint for low-latency wake.
memory_order_acquire/release to publish work descriptor.

Memory Pool Sync

Atomic CAS for lock-free free-list (typical).
Or a simple mutex around the free-list pointer.
Bump allocator variant needs zero sync (per-thread pools).
No condvars needed — memory doesn't need to "wake up."

§12 Why Is It Faster?

Cost Breakdown: OpenMP vs. Thread Pool

Per-kernel dispatch cost comparison (4 threads, x86-64): OpenMP fork/join: +------------------------------------------------------------------+ | pthread_create x3 | work | pthread_join x3 | scheduler | | ~~~100 us~~~~~~~ | 20us | ~~~100 us~~~~~~ | ~20 us~~ | +------------------------------------------------------------------+ Total: ~240 us (only 20 us is useful work!) Thread pool dispatch: +---------------------------------------+ | atomic++ | work | spin check | | ~0.1us | 20us | ~0.1us | +---------------------------------------+ Total: ~20.2 us (99% is useful work!)

The Five Reasons It's Faster

1. No Thread Creation

pthread_create involves a system call, stack allocation (~8MB virtual), kernel scheduling setup, and TLS initialization. The thread pool does this once. Subsequent dispatches just flip an atomic counter — a single CPU instruction.

2. No Join Overhead

pthread_join involves a system call + scheduler interaction to reap the thread. The pool never joins during work — completion is tracked with a simple atomic counter.

3. Warm Caches

Persistent threads keep their L1/L2 caches warm across dispatches. A fresh thread starts with cold caches and must re-fetch everything from L3/RAM.

4. No Kernel Involvement

The spin-wait hot path is entirely in userspace. No system calls, no context switches, no kernel scheduler decisions. Just atomic loads in a tight loop.

5. Main Thread Participates

The main thread doesn't just wait — it does 1/N of the work. With OpenMP, the orchestration thread often sits idle during the parallel region. CK-Engine wastes zero cores.

Bonus: Power Management

Between inference batches (e.g., waiting for user input), workers sleep on a condvar using 0% CPU. No busy-waiting during idle periods. OpenMP's behavior here varies by implementation and is often wasteful.

§13 Real Numbers: Before & After

Measured on i7-3630QM (4 cores / 8 threads), Qwen2 0.5B model, Q4_K_M quantization.

Configuration ms/token Tokens/sec Relative
Serial (single-threaded) 170 5.9 baseline
OpenMP #pragma omp parallel for 327 3.1 1.9x slower
Persistent pthread thread pool Expected: 2–4x faster than serial
The key insight: OpenMP didn't fail because parallelism is bad. It failed because the overhead of setting up parallelism exceeded the actual work being parallelized. The thread pool removes that overhead, letting the parallelism actually help.

Where Does Overhead Go?

500 kernel calls per token, 4 threads: OpenMP overhead per token: 500 calls x ~200us fork/join = 100,000 us = 100 ms (wasted) + actual compute: ~70 ms (useful) = total: 170+ ms (but measured 327ms due to cache thrashing, scheduler overhead) Thread pool overhead per token: 500 calls x ~0.2us dispatch = 100 us = 0.1 ms (wasted) + actual compute: ~70 ms (useful) = total: ~70 ms Dispatch overhead: 100 ms (OpenMP) vs 0.1 ms (pool) = 1000x less overhead

§14 How Parallelization Actually Works

Forget the abstractions for a moment. Here is exactly what happens, from physical hardware up.

Step 1: Pin Threads to Physical Cores

At startup, the thread pool creates N−1 persistent pthreads. Ideally, each thread is pinned (affined) to a distinct physical CPU core using sched_setaffinity() or pthread_setaffinity_np(). This means the OS scheduler will not migrate a thread to a different core — it stays put.

4-core CPU with hyperthreading (8 logical CPUs): Physical Core 0 Physical Core 1 Physical Core 2 Physical Core 3 +-------+-------+ +-------+-------+ +-------+-------+ +-------+-------+ | CPU 0 | CPU 4 | | CPU 1 | CPU 5 | | CPU 2 | CPU 6 | | CPU 3 | CPU 7 | +-------+-------+ +-------+-------+ +-------+-------+ +-------+-------+ ^ ^ ^ ^ Thread 0 Thread 1 Thread 2 Thread 3 (main) (worker) (worker) (worker) Each thread pinned to one physical core. We avoid hyperthreads (CPU 4-7) because they share execution units with the primary logical CPU on the same core — using both gives diminishing returns for compute-bound work.

Why pin? Three reasons:

CK-Engine's Current State

The system topology module (src/system_topology.c) can discover the CPU layout:

// topology_discover_affinity() reads the process CPU mask
cpu_set_t mask;
if (sched_getaffinity(0, sizeof(mask), &mask) == 0) {
    for (int i = 0; i < MAX_CPUS; i++) {
        if (CPU_ISSET(i, &mask)) {
            aff->affinity_cpus[aff->num_affinity_cpus++] = i;
        }
    }
}

The GGML layer has full NUMA-aware pinning with three strategies: DISTRIBUTE (round-robin across NUMA nodes), ISOLATE (all threads on same node), and NUMACTL (respect external numactl settings). Intel hybrid CPU detection (P-core vs E-core via CPUID) is also available to skip efficiency cores for compute-heavy work.

Step 2: Identify Embarrassingly Parallel Loops

An embarrassingly parallel loop is one where every iteration is completely independent — no iteration reads what another writes. These loops can be split across cores with zero synchronization.

The key question for every loop: Can two iterations step on each other?

If iteration i only writes to output[i] and only reads from shared inputs, then yes, it's embarrassingly parallel. Every neural network kernel in CK-Engine has this property along at least one dimension.

Here's what the thread pool does to every such loop:

// BEFORE: Serial loop over M rows
for (int row = 0; row < M; row++) {
    compute_row(row, ...);
}

// AFTER: Thread pool splits the loop across N threads
void work_fn(int ith, int nth, void *args) {
    int rows_per_thread = M / nth;
    int start = ith * rows_per_thread;
    int end   = (ith == nth - 1) ? M : start + rows_per_thread;

    for (int row = start; row < end; row++) {
        compute_row(row, ...);   // exact same computation, just a slice
    }
}
ck_threadpool_dispatch(pool, work_fn, &args);

Each thread computes a contiguous slice of the output. Thread 0 does rows 0..249, thread 1 does rows 250..499, etc. The total work is the same; it's just divided.

Step 3: Ensure No Two Threads Write to the Same Cache Line

This is the critical constraint. The rule is:

The Write-Exclusivity Rule:
Reading from the same cache line across threads is FREE — all cores get their own read-only copy (MESI "Shared" state).

Writing to the same cache line from two threads is EXPENSIVE — every write invalidates all other cores' copies (MESI "Invalid" state), forcing a 40–100ns stall on every access.

In practice, this means the output array partition must be cache-line aligned at the boundaries between threads. Since a cache line is 64 bytes = 16 floats:

Output array: 1000 floats, 4 threads Thread 0 writes Thread 1 writes Thread 2 writes Thread 3 writes output[0..249] output[250..499] output[500..749] output[750..999] Cache lines (16 floats each): |0..15|16..31|...|240..255|256..271|...|496..511|512..527|...|752..767|768..783|...|992..1007| ^ ^ Thread boundaries fall on cache line boundaries 250 floats = 15.625 cache lines → round up to 256 (16 lines) Shared INPUT (all threads read, no one writes): Weight matrix W[M×K]: all threads read overlapping rows → FREE (Shared state) Input vector x[K]: all threads read the same vector → FREE (Shared state)

For transformer kernels, this alignment happens naturally because:

Step 4: What About Reads?

Reads are shared freely. This is what makes matrix operations so parallelizable:

GEMV: y = W · x

Shared read: Input vector x[K] — all threads read the same vector.
Shared read: Weight matrix W[M×K] — each thread reads its own rows, but the data is read-only.
Exclusive write: Output y[M] — each thread writes its own slice of y.

RMSNorm: y = x · γ / RMS(x)

Shared read: Gamma weights γ[D] — all threads read the same normalization weights.
Per-token read: Each thread reads its own token's input x[t].
Exclusive write: Output y[t] — each thread writes its own token.

The MESI cache coherency protocol makes shared reads essentially free. When multiple cores read the same cache line, each core gets its own copy in "Shared" (S) state. No invalidation occurs. The cache line stays resident in every core's L1/L2 simultaneously.

Step 5: Handle Reductions Carefully

Some kernels have reduction operations — where all threads must contribute to a single result (e.g., sum, max). These break the "no shared writes" rule and need special handling.

Reduction example: RMSNorm needs sum of squares across D elements BAD: Multiple threads write to the same accumulator Thread 0: sum += x[0]² + x[1]² + ... ↓ Thread 1: sum += x[256]² + x[257]² + ... ↓ RACE CONDITION on sum! Thread 2: sum += x[512]² + x[513]² + ... ↓ GOOD: Each thread has a private partial sum, then combine Thread 0: partial[0] = x[0]² + x[1]² + ... (private write) Thread 1: partial[1] = x[256]² + x[257]² + ... (private write) Thread 2: partial[2] = x[512]² + x[513]² + ... (private write) ———— barrier ———— Thread 0: total = partial[0] + partial[1] + partial[2] (single-thread reduce)

In CK-Engine's kernels, reductions are intra-token (within a single token's D-dimensional vector), not inter-token. Since we parallelize across tokens (not within a single token's dimension), each thread's reduction is completely private — no cross-thread communication needed.

The Complete Picture

One kernel dispatch, from hardware up: 1. POOL CREATED (once at startup) • N-1 pthreads created, each pinned to a physical core • Workers enter spin-wait loop 2. DISPATCH (per kernel call, ~500x per token) • Main sets work_fn = gemv_parallel, work_args = {W, x, y, M, K} • Main bumps n_dispatch (atomic, ~0.1µs) • Workers see counter change, read work_fn + args 3. PARTITION (each thread independently) • Thread i computes: start_row = i * (M/N), end_row = (i+1) * (M/N) • Write region: y[start_row .. end_row-1] ← no overlap, cache-line alignedRead region: W[start_row*K .. end_row*K-1] + x[0..K-1] ← shared, read-only 4. COMPUTE (all threads in parallel) • Each thread runs SIMD-vectorized inner loop over its rows • L1/L2 caches stay warm (thread pinned to same core) • No locks, no atomics, no barriers needed — pure data parallelism 5. COMPLETE • Each worker atomically increments n_complete • Main spin-waits until n_complete == N-1 • Cost: ~0.1µs to synchronize • Workers go back to spin-wait for next dispatch

§15 Per-Kernel Parallelization Analysis

Every kernel in the transformer decode loop has a natural parallelization dimension. Here's exactly how each one splits across threads, what data is shared, what is exclusive, and where reductions happen.

GEMV — Matrix-Vector Multiply

File: src/kernels/gemv_omp.c
Operation: y[M] = W[M×K] · x[K]
Call frequency: ~37% of decode time (logits projection is the largest single GEMV)

GEMV parallelization: split rows of W across threads Weight matrix W [M rows × K cols] Input x [K] Output y [M] +---------------------------+ +-------+ | Thread 0: rows 0..249 | · [shared by all] → | y[0..249] | +---------------------------+ +-------+ | Thread 1: rows 250..499 | · [shared by all] → | y[250..499] | +---------------------------+ +-------+ | Thread 2: rows 500..749 | · [shared by all] → | y[500..749] | +---------------------------+ +-------+ | Thread 3: rows 750..999 | · [shared by all] → | y[750..999] | +---------------------------+ +-------+ READ (shared): x[K] — all threads read the entire input vector READ (private): W rows — each thread reads only its rows (no overlap) WRITE: y elements — each thread writes its own slice (no overlap) REDUCTION: NONE — each y[i] = dot(W[i], x) is independent
void gemv_q8_parallel(int ith, int nth, void *args) {
    gemv_args_t *a = args;
    int rows_per_thread = a->M / nth;
    int start = ith * rows_per_thread;
    int end   = (ith == nth - 1) ? a->M : start + rows_per_thread;

    for (int row = start; row < end; row++) {
        // Each thread: dot product of one W row with shared x
        vec_dot_q8_0_q8_0(a->K, &a->y[row],
                          &a->W_blocks[row * blocks_per_row],
                          a->x_blocks);
    }
}
PropertyValue
Parallel dimensionRows (M) — embarrassingly parallel
Shared readsInput vector x[K], quantization lookup tables
Exclusive writesOutput y[start..end] — contiguous, cache-aligned
ReductionNone
Cache behaviorx stays in L1 across all rows (K ≈ 896 = 3.5KB). W streams through L2.
Speedup potentialNear-linear: 4 threads ≈ 3.5–3.8x (limited by memory bandwidth)

GEMM — Matrix-Matrix Multiply

File: src/kernels/gemm_kernels.c
Operation: C[M×N] = A[M×K] · B[K×N] + bias[N]
Used for: Prefill (processing entire prompt at once — M > 1)

GEMM parallelization: split output rows across threads A [M×K] × B [K×N] = C [M×N] +--------+ +--------+ +--------+ | T0 rows| | | | T0 rows| +--------+ × | shared | = +--------+ | T1 rows| | by | | T1 rows| +--------+ | all | +--------+ | T2 rows| |threads | | T2 rows| +--------+ +--------+ +--------+ | T3 rows| | T3 rows| +--------+ +--------+ Each thread computes complete rows of C independently. B matrix is shared read-only across all threads.
PropertyValue
Parallel dimensionOutput rows (M) — embarrassingly parallel
Shared readsB[K×N] (entire matrix), bias[N]
Exclusive writesC[start_row..end_row, 0..N-1] — complete rows per thread
ReductionNone — each C[i,j] = dot(A[i], B[:,j]) is independent
SIMDAVX-512 processes 16 floats per cycle, AVX2 does 8
Special caseWhen M=1 (decode mode), falls back to GEMV — parallelizes over N instead

RMSNorm — Root Mean Square Normalization

File: src/kernels/rmsnorm_kernels.c
Operation: y[t,d] = x[t,d] · γ[d] / sqrt(mean(x[t]²) + ε)
Called: Before every attention and FFN block (2x per layer ≈ 48 calls for 24 layers)

RMSNorm parallelization: split tokens across threads Input [T tokens × D dims] Output [T tokens × D dims] +---------------------------+ +---------------------------+ | Token 0: [d0, d1, .., dD] | → | Token 0: normalized | Thread 0 | Token 1: [d0, d1, .., dD] | → | Token 1: normalized | +---------------------------+ +---------------------------+ | Token 2: [d0, d1, .., dD] | → | Token 2: normalized | Thread 1 | Token 3: [d0, d1, .., dD] | → | Token 3: normalized | +---------------------------+ +---------------------------+ ... Shared read: γ[D] (normalization weights, same for all tokens) Each token has an INTERNAL reduction (sum of squares across D), but since we parallelize across TOKENS, each thread's reduction is private.

The reduction (sum of x² across the D dimension) happens within a single token. Since we split across tokens, not across dimensions within a token, each thread does its own reduction with zero cross-thread communication.

void rmsnorm_parallel(int ith, int nth, void *args) {
    rmsnorm_args_t *a = args;
    int tokens_per_thread = a->T / nth;
    int t_start = ith * tokens_per_thread;
    int t_end   = (ith == nth - 1) ? a->T : t_start + tokens_per_thread;

    for (int t = t_start; t < t_end; t++) {
        float *x = a->input + t * a->aligned_D;
        float *y = a->output + t * a->aligned_D;

        // Phase 1: sum of squares (PRIVATE reduction — within this token only)
        __m512 sum_sq = _mm512_setzero_ps();
        for (int d = 0; d < a->D; d += 16)
            sum_sq = _mm512_fmadd_ps(_mm512_loadu_ps(&x[d]),
                                      _mm512_loadu_ps(&x[d]), sum_sq);
        float rms = sqrtf(_mm512_reduce_add_ps(sum_sq) / a->D + 1e-6f);

        // Phase 2: normalize (element-wise, reads shared gamma)
        __m512 rstd = _mm512_set1_ps(1.0f / rms);
        for (int d = 0; d < a->D; d += 16)
            _mm512_storeu_ps(&y[d],
                _mm512_mul_ps(_mm512_mul_ps(_mm512_loadu_ps(&x[d]), rstd),
                              _mm512_loadu_ps(&a->gamma[d])));
    }
}
PropertyValue
Parallel dimensionTokens (T) — embarrassingly parallel
Shared readsγ[D] normalization weights
Exclusive writesoutput[t_start..t_end, 0..D-1] — complete token vectors
ReductionIntra-token only (private to each thread, no cross-thread sync)
Decode modeT=1 in decode → no token-level parallelism. Could split D dimension with barrier.

Attention — Q·KT/√d → Softmax → Attn·V

File: src/kernels/attention_kernels.c
Operation: Multi-head scaled dot-product attention with causal mask
Three phases: Score computation, softmax, weighted value summation

Attention parallelization: split heads across threads Q [H heads × T × d] K [H × T × d] V [H × T × d] +-------------------+ +-------------------+ +-------------------+ | Head 0 (Thread 0) | | Head 0 | | Head 0 | | Head 1 (Thread 0) | | Head 1 | | Head 1 | +-------------------+ +-------------------+ +-------------------+ | Head 2 (Thread 1) | | Head 2 | | Head 2 | | Head 3 (Thread 1) | | Head 3 | | Head 3 | +-------------------+ +-------------------+ +-------------------+ | Head 4 (Thread 2) | | Head 4 | | Head 4 | | Head 5 (Thread 2) | | Head 5 | | Head 5 | +-------------------+ +-------------------+ +-------------------+ | Head 6 (Thread 3) | | Head 6 | | Head 6 | | Head 7 (Thread 3) | | Head 7 | | Head 7 | +-------------------+ +-------------------+ +-------------------+ Each head is completely independent. All three phases (score, softmax, value) for a given head can be computed by one thread with zero cross-thread communication. Scores [H × T × T] — intermediate, private per head Output [H × T × d] — exclusive write per thread

Attention is the most complex kernel but also the most naturally parallel. Each attention head is a self-contained computation:

  1. Phase 1 (Score): For each query position i, compute score[h,i,j] = Q[h,i] · K[h,j] / √d for all j ≤ i (causal mask). Triangle of dot products per head.
  2. Phase 2 (Softmax): Per row: find max, subtract, exp, sum, normalize. The reduction (max, sum) is within a single row — private to the thread owning that head.
  3. Phase 3 (Value): output[h,i] = ∑j score[h,i,j] · V[h,j]. Accumulates over j — but within a single head, so private to the owning thread.
PropertyValue
Parallel dimensionHeads (H) — completely independent
Shared readsNone — each head has its own Q, K, V slices in head-major layout
Exclusive writesscores[h, *, *] and output[h, *, *] — entire head is private
ReductionSoftmax max/sum is intra-row, within a single head (private)
Compute costO(T² · d) per head — quadratic in sequence length
Ideal threadsnum_heads (typically 8–32, more than enough to saturate cores)

Softmax — With Causal Masking

File: src/kernels/softmax_kernels.c
Operation: softmax(row) = exp(row - max) / sum(exp(row - max)), applied to each row of the attention score matrix
Called: Once per attention layer, integrated into the attention computation

Softmax parallelization: split heads (or rows) across threads Score matrix per head [T × T], lower-triangular (causal): Row 0: [s00] ← len=1 Row 1: [s10, s11] ← len=2 Row 2: [s20, s21, s22] ← len=3 Row 3: [s30, s31, s32, s33] ← len=4 ... Row T: [sT0, sT1, ..., sTT] ← len=T Each row: find max → subtract max → exp → sum → divide by sum Rows are independent! Each row's max/sum is a private reduction.
PropertyValue
Parallel dimensionHeads (H) or rows (T) within a head — both embarrassingly parallel
Shared readsNone (in-place operation)
Exclusive writesscore[h, i, *] — each row written independently
ReductionRow-internal max and sum — private per row, no cross-thread sync
Causal optimizationRow i only processes i+1 elements (lower triangle) + zeroes rest

SwiGLU — Gated Linear Unit Activation

File: src/kernels/swiglu_kernels.c
Operation: output[t,d] = SiLU(gate[t,d]) × value[t,d] where input is split into gate and value halves
Called: Once per FFN block (1x per layer)

SwiGLU parallelization: split tokens across threads Input [T × 2D] Output [T × D] +-------------------+-------------------+ +-------------------+ | Token 0: gate[D] | Token 0: value[D] | → | Token 0: SiLU×val | Thread 0 | Token 1: gate[D] | Token 1: value[D] | → | Token 1: SiLU×val | +-------------------+-------------------+ +-------------------+ | Token 2: gate[D] | Token 2: value[D] | → | Token 2: SiLU×val | Thread 1 | Token 3: gate[D] | Token 3: value[D] | → | Token 3: SiLU×val | +-------------------+-------------------+ +-------------------+ Pure element-wise: sigmoid(gate) * gate * value No reduction. No shared writes. Perfect parallelism.
PropertyValue
Parallel dimensionTokens (T) — embarrassingly parallel, zero dependencies
Shared readsNone
Exclusive writesoutput[t_start..t_end, 0..D-1]
ReductionNone — pure element-wise operation
SIMD efficiencyExcellent — straight FMA + sigmoid approximation over contiguous memory

RoPE — Rotary Position Embedding

File: src/kernels/rope_kernels.c
Operation: Rotate Q and K vectors using precomputed sin/cos tables based on token position
Called: Once per attention block (1x per layer, applied to Q and K separately)

RoPE parallelization: split heads × tokens across threads Q or K [H heads × T tokens × head_dim] +-----------------------------------------------+ | Head 0, Token 0: rotate(q, cos[0], sin[0]) | Thread 0 | Head 0, Token 1: rotate(q, cos[1], sin[1]) | | Head 1, Token 0: rotate(q, cos[0], sin[0]) | +-----------------------------------------------+ | Head 1, Token 1: rotate(q, cos[1], sin[1]) | Thread 1 | Head 2, Token 0: rotate(q, cos[0], sin[0]) | | Head 2, Token 1: rotate(q, cos[1], sin[1]) | +-----------------------------------------------+ ... Rotation: x'[i] = x[i]*cos[pos,i] - x[i+half]*sin[pos,i] x'[i+half] = x[i]*sin[pos,i] + x[i+half]*cos[pos,i] Shared reads: cos_cache[pos, half_dim], sin_cache[pos, half_dim] In-place write: x[h, t, 0..head_dim-1] — each (h,t) is independent
PropertyValue
Parallel dimensionHeads × Tokens (H·T) — all (h,t) pairs independent
Shared readscos_cache[T × half_dim], sin_cache[T × half_dim]
Exclusive writesx[h, t, *] — in-place rotation, one head-vector per (h,t)
ReductionNone — pure element-wise rotation
Best splitFlatten H×T loop, divide evenly across threads

Embedding Lookup

File: src/kernels/embedding_kernels.c
Operation: output[t] = token_embeddings[token_id[t]] + pos_embeddings[t]
Called: Once per forward pass (first operation)

PropertyValue
Parallel dimensionTokens (T) — each token lookup is independent
Shared readstoken_embeddings[vocab_size × D] (table), pos_embeddings[T × D]
Exclusive writesoutput[t, 0..D-1] — one embedding vector per token
ReductionNone
Special patternIrregular memory access (gather) — token_ids determine which rows to fetch
Decode modeT=1 → no parallelism possible. Prefill (T=512+) benefits greatly.

KV Cache Write

File: src/kernels/kv_cache_kernels.c
Operation: Copy new K and V vectors for current token into the cache at position token_index
Called: Once per layer (append new token's K,V to cache)

PropertyValue
Parallel dimensionKV heads (num_kv_heads) — each head is independent
Shared readsk_token[num_kv_heads × head_dim], v_token[num_kv_heads × head_dim]
Exclusive writesk_cache[h, token_index, *], v_cache[h, token_index, *]
ReductionNone — pure memcpy per head
Worth parallelizing?Small — only head_dim floats per head. May not justify dispatch overhead.

Summary: Parallelization Map

Kernel Split Dimension Shared Reads Reduction Priority
GEMV Rows (M) x[K] None HIGH — dominates decode time
GEMM Rows (M) B[K×N] None HIGH — dominates prefill time
Attention Heads (H) None Private per head HIGH — O(T²) cost
RMSNorm Tokens (T) γ[D] Private per token MEDIUM — called 48x/fwd
SwiGLU Tokens (T) None None MEDIUM — element-wise
Softmax Heads (H) or Rows (T) None Private per row MEDIUM
RoPE H × T cos/sin tables None LOW — cheap per call
Embedding Tokens (T) Embedding table None LOW — prefill only
KV Cache KV Heads Token vectors None LOW — small memcpy
Every single kernel in the transformer has embarrassingly parallel structure. The only reductions (RMSNorm sum-of-squares, Softmax max/sum) are intra-token or intra-row — private to whichever thread owns that token or head. No kernel requires cross-thread communication for its core computation. This is why a persistent thread pool with simple (ith, nth) partitioning is the perfect parallelization strategy for LLM inference.

§16 Testing & Validation

The thread pool dispatch layer is validated by a dedicated parity test suite that compares every dispatch wrapper against its serial counterpart. The tests verify both numerical correctness (bit-for-bit parity within tolerance) and performance (speedup from parallelization).

Test Suite: test_threadpool_parity.c

Location: tests/test_threadpool_parity.c

Test Serial Kernel Thread Pool Dispatch What It Validates
Test 1 gemv_q8_0_q8_0() gemv_q8_0_q8_0_parallel_dispatch() Q8_0 weights × Q8_0 input (logits, V proj)
Test 2 gemv_q5_0_q8_0() gemv_q5_0_q8_0_parallel_dispatch() Q5_0 weights × Q8_0 input (Q/K proj, MLP, out proj)
Test 3 gemv_fused_q5_0_bias_dispatch() gemv_fused_q5_0_bias_parallel_dispatch() Fused: quantize FP32 → GEMV → add bias (production path)
Test 4 Fused dispatch with NULL bias Edge case: bias pointer is NULL (no bias add)
Test 5 Small M dispatch latency measurement Overhead of packing args + waking threads on tiny workloads

Test Dimensions (Qwen2 0.5B shapes)

Tests run on real model dimensions to catch alignment and edge-case issues:

Config M (output rows) K (input dim) Corresponds To
qkv_proj 896 896 Q/K/V projections (embed → embed)
mlp_gate_up 9,728 896 MLP gate+up projection (embed → 4× intermediate)
mlp_down 896 4,864 MLP down projection (intermediate → embed)
logits 151,936 896 LM head (embed → vocab, largest GEMV)

How to Run

# Quick parity check (~5 seconds)
make test-threadpool-parity-quick

# Full parity + speed test (all model dimensions)
make test-threadpool-parity

# Verbose output (shows max_abs_diff, max_rel_diff for every config)
make test-threadpool-parity-verbose

Correctness Criteria

Tolerance: 1e-3 absolute. Thread pool dispatch splits rows across threads. Each thread calls the same _parallel_simd kernel on its row subset. Since each output row is computed from independent dot products, the thread pool result should be bit-identical to serial — the tolerance exists only to guard against floating-point non-determinism from different SIMD accumulation orders.

CI Integration

The thread pool parity test runs automatically in three places:

Trigger What Runs Where Configured
make test test-threadpool-parity-quick Makefile (test target)
make llamacpp-parity-full test-threadpool-parity (full) Makefile (llamacpp-parity-full target)
Nightly / Push to main / PR test-threadpool-parity (full) scripts/nightly_runner.pyMAKE_TARGETS["threadpool_parity"]

The nightly workflow (.github/workflows/nightly.yml) runs on every push to main, every pull request, and nightly at 2am UTC. If the thread pool dispatch diverges from serial output, the test exits with code 1 and the nightly report marks it as failed.

ADR: Thread Pool Replaced OpenMP for Parallelization

The codebase previously had an OpenMP parallelization pass (parallel_pass.py) that annotated IR operations with #pragma omp parallel for. This was superseded by the persistent pthread thread pool for the following reasons:

Concern OpenMP Thread Pool
Thread creation Fork/join per #pragma region (~15-50µs) Created once at init, spin-wait between dispatches (~0.1µs wake)
Core oversubscription OpenMP team + thread pool = 2N threads on N cores One pool, known thread count, no conflict
Codegen integration parallel_pass.py emits pragmas, but codegen_v6_6.py never reads them Macro redirects in ck_parallel_decode.h — zero codegen changes
Kernel interface Implicit OpenMP runtime Explicit (ith, nth) args to _parallel_simd kernels

Files affected by this decision:

Related Test Suites

The thread pool parity test complements existing tests that cover other aspects of the parallelization stack:

Make Target Test File What It Tests
test-threadpool-parity tests/test_threadpool_parity.c Serial vs thread pool dispatch (this page)
test-gemv-omp tests/test_gemv_omp_parity.c Serial vs OpenMP parallel (legacy comparison)
test-fusion-gemv unittest/fusion/test_gemv_fused_quant_bias.py Fused vs unfused GEMV correctness + speedup
llamacpp-parity-full scripts/run_parity_smoketest.sh CK-Engine kernels vs llama.cpp reference (includes thread pool + OMP)
e2e-v66 version/v6.6/test/Makefile Full v6.6 pipeline: IR → codegen → compile → run (thread pool always enabled)
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close