CK-Engine Thread Pool
A plain-English deep dive into how persistent pthread thread pools work, why they exist, and how they compare to OpenMP and memory pools.
Contents
- What Is a "Pool"? (The Core Idea)
- The Problem: Why Not Just Use OpenMP?
- Anatomy of CK-Engine's Thread Pool
- Lifecycle: Create → Dispatch → Destroy
- Dispatch Deep Dive: What Actually Happens
- Spin-Wait vs. Sleep: The Hybrid Polling Design
- Atomics, Mutexes, and Memory Ordering
- Cache Lines & False Sharing
- The Barrier: How Threads Synchronize
- Thread Pool vs. OpenMP — Side-by-Side
- Thread Pool vs. Memory Pool — Same Idea, Different Resource
- Why Is It Faster?
- Real Numbers: Before & After
- How Parallelization Actually Works
- Per-Kernel Parallelization Analysis
- Testing & Validation
§1 What Is a "Pool"?
A pool is a set of expensive resources you create once and reuse over and over, instead of creating and destroying them every time you need one.
No pool: Every time an order comes in, go out on the street, hire 3 chefs, cook the meal, then fire them all. Next order — hire 3 more chefs. (This is what OpenMP does.)
Pool: Hire 3 chefs on day one. They stand ready at their stations. Order comes in — you shout "Go!" and all 4 of you (including yourself) cook simultaneously. Between orders, chefs wait at their stations. (This is a thread pool.)
The word "pool" shows up in many places in systems programming:
Thread Pool
Pre-created threads waiting for work. Avoids spawn/join overhead per task.
Memory Pool
Pre-allocated memory blocks. Avoids malloc/free overhead per allocation.
Connection Pool
Pre-opened database/network connections. Avoids TCP handshake per query.
Object Pool
Pre-constructed objects (e.g. game entities). Avoids constructor/destructor overhead.
The common pattern: creating the resource is expensive; using it is cheap. A pool pays the creation cost once, then amortizes it over thousands of uses.
§2 The Problem: Why Not Just Use OpenMP?
OpenMP looks appealing — just add #pragma omp parallel for and you get parallelism. But there's a hidden cost.
During inference, CK-Engine runs the transformer decode loop. For each token generated, it calls kernel functions (matrix-vector multiplies, RMSNorm, SwiGLU, etc.) roughly 500+ times. Each call is a short burst of work (microseconds to low milliseconds).
OpenMP's Fork/Join Model
With each fork/join costing 50–200µs and 500+ kernel calls per token:
Real Measurement from CK-Engine (gemv_omp.c)
/*
* Measured on i7-3630QM (4C/8T), Qwen 0.5B:
* Serial kernels: 170 ms/tok (5.9 tok/s)
* OMP parallel: 327 ms/tok (3.1 tok/s) ← 1.9x SLOWER
*
* The math is correct (10/10 parity tests pass) but the
* threading model is wrong for this workload.
*/
OpenMP parallelism made it almost 2x slower. The fork/join overhead completely ate the parallelism gains and then some.
§3 Anatomy of CK-Engine's Thread Pool
Let's look at the actual data structure. Every field exists for a specific reason.
struct ck_threadpool {
/* --- Dispatch state (each on its own cache line) --- */
_Alignas(64) atomic_int n_dispatch; // bumped to wake workers
_Alignas(64) atomic_int n_complete; // workers signal "I'm done"
_Alignas(64) ck_work_fn_t work_fn; // pointer to current kernel
void *work_args; // arguments for the kernel
/* --- Barrier for mid-kernel sync --- */
ck_barrier_t barrier;
/* --- Worker management --- */
int n_threads; // total (including main)
ck_worker_t workers[64]; // thread handles
/* --- Shutdown / pause --- */
_Alignas(64) atomic_int stop; // set to 1 to kill workers
_Alignas(64) atomic_int paused; // set to 1 for power saving
/* --- Hybrid sleep mechanism --- */
pthread_mutex_t mutex;
pthread_cond_t cond_dispatch; // workers sleep here
pthread_cond_t cond_done; // main sleeps here
};
Why _Alignas(64) Everywhere?
Each atomic counter is placed on its own 64-byte cache line. This prevents "false sharing" — where two unrelated variables on the same cache line cause all CPU cores to fight over that cache line. More on this in §8.
The Worker Struct
typedef struct {
pthread_t thread; // OS thread handle
int id; // 0 = main, 1..N-1 = workers
struct ck_threadpool *pool; // back-pointer to the pool
} ck_worker_t;
Thread 0 is always the main thread — the one that calls ck_threadpool_dispatch().
It doesn't get a new pthread. It does its share of the work alongside the workers.
§4 Lifecycle: Create → Dispatch → Destroy
1. Create — ck_threadpool_create(n_threads)
Called once at engine startup. Allocates the pool struct (cache-line aligned), then spawns N−1 worker pthreads. Workers immediately enter the spin-wait loop. The main thread is thread 0 — no extra pthread is created for it.
2. Dispatch (called ~500x per token) — ck_threadpool_dispatch(pool, fn, args)
Sets the work function + args, bumps n_dispatch (an atomic counter), and
signals sleeping workers via condvar. Main thread also runs its share. All threads do their chunk,
then main waits for n_complete to hit N−1.
3. Pause/Resume (between batches)
Between user interactions (waiting for input), workers can be paused so they sleep on a condvar and use 0% CPU. Resume wakes them instantly.
4. Destroy — ck_threadpool_destroy(pool)
Sets stop = 1, broadcasts the condvar to wake sleeping workers,
then pthread_joins every worker thread. Frees the pool memory.
Global Pool Convenience
For CK-Engine, there's a global singleton pool accessed via
ck_threadpool_global(). It uses pthread_once to ensure
thread-safe lazy initialization — no matter how many threads call it simultaneously,
the pool is created exactly once.
static ck_threadpool_t *g_threadpool = NULL;
static pthread_once_t g_threadpool_once = PTHREAD_ONCE_INIT;
ck_threadpool_t *ck_threadpool_global(void) {
pthread_once(&g_threadpool_once, global_pool_init);
return g_threadpool;
}
§5 Dispatch Deep Dive: What Actually Happens
Let's trace exactly what happens when CK-Engine needs to run one kernel across 4 threads.
Key Details
- No thread creation. Workers are already alive, spinning or sleeping. The only cost is bumping an atomic counter.
- Main thread participates. Thread 0 does 1/N of the work alongside the workers. No wasted CPU.
- Single-thread fast path. If
n_threads == 1, dispatch just callsfn(0, 1, args)directly — zero overhead. - Completion tracking. Each worker atomically increments
n_complete. The last worker (N−2-th increment) signals the condvar to wake main if it fell asleep waiting.
Inside the Work Function
Each kernel receives (ith, nth, args) and partitions work by thread index:
void gemv_parallel(int ith, int nth, void *args) {
gemv_args_t *a = args;
// Each thread computes its slice of output rows
int rows_per_thread = a->M / nth;
int start = ith * rows_per_thread;
int end = (ith == nth - 1) ? a->M : start + rows_per_thread;
for (int row = start; row < end; row++) {
vec_dot(a->K, &a->y[row], a->W + row * a->K, a->x);
}
}
§6 Spin-Wait vs. Sleep: The Hybrid Polling Design
How do worker threads wait for work? There are two extremes, and CK-Engine uses both.
Pure Spin-Wait
Thread sits in a tight loop checking an atomic variable: while (no_work) { _mm_pause(); }
PRO: Instant wake-up (nanoseconds)
CON: Burns 100% CPU while waiting
Pure Condvar Sleep
Thread calls pthread_cond_wait() and the OS puts it to sleep.
PRO: Zero CPU while waiting
CON: Wake-up takes 5–50µs (kernel context switch)
CK-Engine's Hybrid Approach
Workers do both:
- First, spin for
CK_THREADPOOL_SPIN_COUNT(1024) iterations, checking the atomic dispatch counter. - If no work arrives during spin, fall back to
pthread_cond_wait()(sleep). - When new work arrives, spinning workers wake instantly; sleeping workers wake via condvar broadcast.
// worker_main() — the hybrid wait loop
for (;;) {
int spins = 0;
for (;;) {
if (atomic_load(&pool->stop)) return NULL; // shutdown?
int current = atomic_load(&pool->n_dispatch);
if (current != last_dispatch) { // new work!
last_dispatch = current;
break; // go execute
}
_mm_pause(); // x86 hint: "I'm spinning, save power"
spins++;
if (spins >= 1024) { // tired of spinning?
pthread_mutex_lock(&pool->mutex);
// Re-check under lock (avoid missed wakeup!)
if (no_new_work && !stopping)
pthread_cond_wait(&pool->cond_dispatch, &pool->mutex);
pthread_mutex_unlock(&pool->mutex);
spins = 0;
}
}
// execute work...
}
n_dispatch and broadcast between the worker's
last spin check and its cond_wait call. The worker would then sleep forever,
never seeing the work. The lock + re-check makes this impossible.
What Is _mm_pause()?
It's an x86 intrinsic that tells the CPU: "I'm in a spin-wait loop." The CPU can then:
- Reduce power consumption of the spinning core
- Free up execution resources for the sibling hyper-thread
- Avoid memory-order violations that would require pipeline flushes
On non-x86 (ARM, etc.), CK-Engine defines CK_SPIN_PAUSE() as a no-op.
§7 Atomics, Mutexes, and Memory Ordering
These are the fundamental building blocks of any concurrent system. Let's understand each one from scratch.
What Is an Atomic Operation?
An atomic operation is one that completes in a single, indivisible step from the perspective of all other threads. No other thread can see a "half-done" state.
CK-Engine uses atomics for:
n_dispatch— the "new work available" counter. Main increments it; workers read it.n_complete— workers increment when done. Main reads it to know when all are finished.n_arrived(barrier) — threads increment when reaching the barrier point.stop,paused— simple flags, but still need atomic reads/writes for visibility across cores.
Memory Ordering
CPUs and compilers can reorder memory operations for performance. Memory ordering tells them: "don't reorder this particular operation past certain points."
| Ordering | What It Means | Used In CK-Engine For |
|---|---|---|
memory_order_relaxed |
No ordering guarantee. Just make the read/write atomic. | Resetting n_arrived (only writer, no dependency) |
memory_order_acquire |
All reads/writes after this load cannot be moved before it. | Workers reading n_dispatch — must see work_fn/work_args written before |
memory_order_release |
All reads/writes before this store cannot be moved after it. | Main bumping n_dispatch — ensures work_fn/work_args are visible |
memory_order_acq_rel |
Both acquire and release. Full fence in both directions. | fetch_add on n_complete and n_arrived |
n_dispatch but the worker might read the old work_fn pointer (from a previous dispatch)
because the CPU reordered the stores. The release on the store and acquire on the load
form a happens-before relationship that prevents this.
What Is a Mutex?
A mutex (mutual exclusion) is a lock. Only one thread can hold it at a time. Everyone else waits. CK-Engine uses mutexes only for the condvar sleep mechanism — the hot path (spin-wait) doesn't touch the mutex at all.
// Mutex usage — only in the cold sleep path
pthread_mutex_lock(&pool->mutex);
// Re-check condition (avoid lost wakeup)
if (still_no_work)
pthread_cond_wait(&pool->cond_dispatch, &pool->mutex);
// ^^ releases mutex, sleeps, re-acquires on wake
pthread_mutex_unlock(&pool->mutex);
What Is a Condition Variable (Condvar)?
A condvar lets a thread say: "Wake me up when something changes." It works together with a mutex:
- Wait side: Lock mutex → check condition → if false,
cond_wait(atomically releases mutex + sleeps) → on wake, re-acquires mutex - Signal side: Lock mutex → change state →
cond_broadcast(wakes all waiters) → unlock
CK-Engine has two condvars:
cond_dispatch— workers wait here when idle (no work to do)cond_done— main thread waits here when all workers are still busy
§8 Cache Lines & False Sharing
This is one of the most important performance concepts in multi-threaded code.
How CPU Caches Work (Simplified)
CPUs don't read individual bytes from RAM. They read in cache lines — 64-byte chunks. When one core writes to a cache line, all other cores that have a copy of that line must invalidate it (MESI protocol). This is called a cache coherency miss.
This is why every atomic_int in the pool struct has _Alignas(64) —
it forces each one onto its own cache line, eliminating false sharing entirely.
§9 The Barrier: How Threads Synchronize
A barrier is a meeting point. All threads must arrive before any can proceed.
CK-Engine implements a custom spin-wait barrier using two atomic counters:
n_arrived (how many threads reached the barrier) and n_phase
(which "round" of the barrier we're on).
// barrier_wait() — ck_threadpool.c:89–111
static void barrier_wait(ck_barrier_t *b) {
const int phase = atomic_load(&b->n_phase);
if (atomic_fetch_add(&b->n_arrived, 1) == n - 1) {
// I'm the LAST thread to arrive
atomic_store(&b->n_arrived, 0); // reset for next use
atomic_store(&b->n_phase, phase + 1); // advance phase
} else {
// NOT the last thread — spin until phase advances
while (atomic_load(&b->n_phase) == phase) {
_mm_pause();
}
}
}
The phase counter trick allows the barrier to be reused without an explicit reset —
each barrier use just increments the phase. After many spins without the last thread arriving,
threads call sched_yield() to be nice to the OS on oversubscribed systems.
§10 Thread Pool vs. OpenMP — Side-by-Side
| Aspect | OpenMP | CK Thread Pool |
|---|---|---|
| Thread lifetime | Created per #pragma region (or pooled by runtime — inconsistent) |
Created once at startup, live until shutdown |
| Dispatch cost | 50–200 µs (fork/join overhead) | < 1 µs (atomic increment + optional condvar signal) |
| Control over scheduling | Limited (schedule(static), dynamic) |
Full — you write the partition logic |
| Idle CPU usage | Implementation-dependent (some spin, some sleep) | Configurable: spin → condvar sleep → pause |
| Barrier | Implicit at end of parallel region | Explicit ck_threadpool_barrier() |
| Works with short kernels? | No — overhead dominates | Yes — designed for thousands of tiny dispatches |
| Code complexity | One line: #pragma omp parallel for |
Must write work function, manage pool lifecycle |
| Portability | Standard, compiler-supported everywhere | POSIX pthreads (Linux/macOS) |
OpenMP is great for long-running parallel regions (scientific computing, image processing, big matrix operations). One fork/join overhead amortized over milliseconds of work is fine.
Thread pools are essential when you dispatch thousands of short tasks per second (inference, game engines, audio processing, network servers).
§11 Thread Pool vs. Memory Pool — Same Idea, Different Resource
Both are "pools" — they pre-create expensive resources. The key differences are about what they pool and what synchronization they need.
| Aspect | Thread Pool | Memory Pool |
|---|---|---|
| What is pooled? | OS threads (pthreads) | Memory blocks (heap allocations) |
| What is expensive? | pthread_create / pthread_join (~50–200 µs) |
malloc / free (~0.1–5 µs) |
| Pool initialized with... | N live threads, waiting for work | One big block of memory, subdivided into slots |
| "Acquire" from pool | dispatch(fn, args) — give threads work |
pool_alloc(size) — get a memory block |
| "Release" back to pool | Implicit — threads go back to waiting | pool_free(ptr) — return block to free list |
| Synchronization needed | Atomics + condvar (dispatch/wait signaling) | Atomics or mutex (free-list management) |
| Active resources... | ...execute code concurrently | ...hold data passively |
The Deeper Conceptual Link
Synchronization Comparison
Thread Pool Sync
atomic_int for dispatch/complete counters (hot path).
pthread_mutex + cond for sleep fallback (cold path).
_mm_pause() spin hint for low-latency wake.
memory_order_acquire/release to publish work descriptor.
Memory Pool Sync
Atomic CAS for lock-free free-list (typical).
Or a simple mutex around the free-list pointer.
Bump allocator variant needs zero sync (per-thread pools).
No condvars needed — memory doesn't need to "wake up."
§12 Why Is It Faster?
Cost Breakdown: OpenMP vs. Thread Pool
The Five Reasons It's Faster
1. No Thread Creation
pthread_create involves a system call, stack allocation (~8MB virtual),
kernel scheduling setup, and TLS initialization. The thread pool does this once.
Subsequent dispatches just flip an atomic counter — a single CPU instruction.
2. No Join Overhead
pthread_join involves a system call + scheduler interaction to reap the thread.
The pool never joins during work — completion is tracked with a simple atomic counter.
3. Warm Caches
Persistent threads keep their L1/L2 caches warm across dispatches. A fresh thread starts with cold caches and must re-fetch everything from L3/RAM.
4. No Kernel Involvement
The spin-wait hot path is entirely in userspace. No system calls, no context switches, no kernel scheduler decisions. Just atomic loads in a tight loop.
5. Main Thread Participates
The main thread doesn't just wait — it does 1/N of the work. With OpenMP, the orchestration thread often sits idle during the parallel region. CK-Engine wastes zero cores.
Bonus: Power Management
Between inference batches (e.g., waiting for user input), workers sleep on a condvar using 0% CPU. No busy-waiting during idle periods. OpenMP's behavior here varies by implementation and is often wasteful.
§13 Real Numbers: Before & After
Measured on i7-3630QM (4 cores / 8 threads), Qwen2 0.5B model, Q4_K_M quantization.
| Configuration | ms/token | Tokens/sec | Relative |
|---|---|---|---|
| Serial (single-threaded) | 170 | 5.9 | baseline |
OpenMP #pragma omp parallel for |
327 | 3.1 | 1.9x slower |
| Persistent pthread thread pool | — | — | Expected: 2–4x faster than serial |
Where Does Overhead Go?
§14 How Parallelization Actually Works
Forget the abstractions for a moment. Here is exactly what happens, from physical hardware up.
Step 1: Pin Threads to Physical Cores
At startup, the thread pool creates N−1 persistent pthreads. Ideally, each thread is
pinned (affined) to a distinct physical CPU core using sched_setaffinity()
or pthread_setaffinity_np(). This means the OS scheduler will not migrate a thread to a
different core — it stays put.
Why pin? Three reasons:
- Cache locality: A pinned thread keeps its L1/L2 caches warm. If the OS migrates a thread to a different core, the old core's cache is useless and the new core starts cold.
- No scheduler jitter: The OS scheduler makes decisions every ~1–4ms. Thread migration adds 5–20µs of latency. For sub-millisecond kernels, that's significant.
- NUMA awareness: On multi-socket systems, accessing memory attached to a remote socket costs 2–3x more than local memory. Pinning keeps threads close to their data.
CK-Engine's Current State
The system topology module (src/system_topology.c) can discover the CPU layout:
// topology_discover_affinity() reads the process CPU mask
cpu_set_t mask;
if (sched_getaffinity(0, sizeof(mask), &mask) == 0) {
for (int i = 0; i < MAX_CPUS; i++) {
if (CPU_ISSET(i, &mask)) {
aff->affinity_cpus[aff->num_affinity_cpus++] = i;
}
}
}
The GGML layer has full NUMA-aware pinning with three strategies:
DISTRIBUTE (round-robin across NUMA nodes),
ISOLATE (all threads on same node), and
NUMACTL (respect external numactl settings).
Intel hybrid CPU detection (P-core vs E-core via CPUID) is also available to skip
efficiency cores for compute-heavy work.
Step 2: Identify Embarrassingly Parallel Loops
An embarrassingly parallel loop is one where every iteration is completely independent — no iteration reads what another writes. These loops can be split across cores with zero synchronization.
If iteration i only writes to
output[i] and only reads from
shared inputs, then yes, it's embarrassingly parallel. Every neural network kernel
in CK-Engine has this property along at least one dimension.
Here's what the thread pool does to every such loop:
// BEFORE: Serial loop over M rows
for (int row = 0; row < M; row++) {
compute_row(row, ...);
}
// AFTER: Thread pool splits the loop across N threads
void work_fn(int ith, int nth, void *args) {
int rows_per_thread = M / nth;
int start = ith * rows_per_thread;
int end = (ith == nth - 1) ? M : start + rows_per_thread;
for (int row = start; row < end; row++) {
compute_row(row, ...); // exact same computation, just a slice
}
}
ck_threadpool_dispatch(pool, work_fn, &args);
Each thread computes a contiguous slice of the output. Thread 0 does rows 0..249, thread 1 does rows 250..499, etc. The total work is the same; it's just divided.
Step 3: Ensure No Two Threads Write to the Same Cache Line
This is the critical constraint. The rule is:
Reading from the same cache line across threads is FREE — all cores get their own read-only copy (MESI "Shared" state).
Writing to the same cache line from two threads is EXPENSIVE — every write invalidates all other cores' copies (MESI "Invalid" state), forcing a 40–100ns stall on every access.
In practice, this means the output array partition must be cache-line aligned at the boundaries between threads. Since a cache line is 64 bytes = 16 floats:
For transformer kernels, this alignment happens naturally because:
- Each output row is typically 512–4096 floats (2KB–16KB) — many cache lines per row.
- We split on row boundaries, so each thread owns complete rows.
- Row addresses are aligned to
CK_CACHE_LINE(64 bytes) in the memory allocator. - The only "dangerous" case is the last few elements at a boundary, but with row-level splitting this never happens.
Step 4: What About Reads?
Reads are shared freely. This is what makes matrix operations so parallelizable:
GEMV: y = W · x
Shared read: Input vector x[K] — all threads read the same vector.
Shared read: Weight matrix W[M×K] — each thread reads its own rows, but the data is read-only.
Exclusive write: Output y[M] — each thread writes its own slice of y.
RMSNorm: y = x · γ / RMS(x)
Shared read: Gamma weights γ[D] — all threads read the same normalization weights.
Per-token read: Each thread reads its own token's input x[t].
Exclusive write: Output y[t] — each thread writes its own token.
The MESI cache coherency protocol makes shared reads essentially free. When multiple cores read the same cache line, each core gets its own copy in "Shared" (S) state. No invalidation occurs. The cache line stays resident in every core's L1/L2 simultaneously.
Step 5: Handle Reductions Carefully
Some kernels have reduction operations — where all threads must contribute to a single result (e.g., sum, max). These break the "no shared writes" rule and need special handling.
In CK-Engine's kernels, reductions are intra-token (within a single token's D-dimensional vector), not inter-token. Since we parallelize across tokens (not within a single token's dimension), each thread's reduction is completely private — no cross-thread communication needed.
The Complete Picture
§15 Per-Kernel Parallelization Analysis
Every kernel in the transformer decode loop has a natural parallelization dimension. Here's exactly how each one splits across threads, what data is shared, what is exclusive, and where reductions happen.
GEMV — Matrix-Vector Multiply
File: src/kernels/gemv_omp.c
Operation: y[M] = W[M×K] · x[K]
Call frequency: ~37% of decode time (logits projection is the largest single GEMV)
void gemv_q8_parallel(int ith, int nth, void *args) {
gemv_args_t *a = args;
int rows_per_thread = a->M / nth;
int start = ith * rows_per_thread;
int end = (ith == nth - 1) ? a->M : start + rows_per_thread;
for (int row = start; row < end; row++) {
// Each thread: dot product of one W row with shared x
vec_dot_q8_0_q8_0(a->K, &a->y[row],
&a->W_blocks[row * blocks_per_row],
a->x_blocks);
}
}
| Property | Value |
|---|---|
| Parallel dimension | Rows (M) — embarrassingly parallel |
| Shared reads | Input vector x[K], quantization lookup tables |
| Exclusive writes | Output y[start..end] — contiguous, cache-aligned |
| Reduction | None |
| Cache behavior | x stays in L1 across all rows (K ≈ 896 = 3.5KB). W streams through L2. |
| Speedup potential | Near-linear: 4 threads ≈ 3.5–3.8x (limited by memory bandwidth) |
GEMM — Matrix-Matrix Multiply
File: src/kernels/gemm_kernels.c
Operation: C[M×N] = A[M×K] · B[K×N] + bias[N]
Used for: Prefill (processing entire prompt at once — M > 1)
| Property | Value |
|---|---|
| Parallel dimension | Output rows (M) — embarrassingly parallel |
| Shared reads | B[K×N] (entire matrix), bias[N] |
| Exclusive writes | C[start_row..end_row, 0..N-1] — complete rows per thread |
| Reduction | None — each C[i,j] = dot(A[i], B[:,j]) is independent |
| SIMD | AVX-512 processes 16 floats per cycle, AVX2 does 8 |
| Special case | When M=1 (decode mode), falls back to GEMV — parallelizes over N instead |
RMSNorm — Root Mean Square Normalization
File: src/kernels/rmsnorm_kernels.c
Operation: y[t,d] = x[t,d] · γ[d] / sqrt(mean(x[t]²) + ε)
Called: Before every attention and FFN block (2x per layer ≈ 48 calls for 24 layers)
The reduction (sum of x² across the D dimension) happens within a single token. Since we split across tokens, not across dimensions within a token, each thread does its own reduction with zero cross-thread communication.
void rmsnorm_parallel(int ith, int nth, void *args) {
rmsnorm_args_t *a = args;
int tokens_per_thread = a->T / nth;
int t_start = ith * tokens_per_thread;
int t_end = (ith == nth - 1) ? a->T : t_start + tokens_per_thread;
for (int t = t_start; t < t_end; t++) {
float *x = a->input + t * a->aligned_D;
float *y = a->output + t * a->aligned_D;
// Phase 1: sum of squares (PRIVATE reduction — within this token only)
__m512 sum_sq = _mm512_setzero_ps();
for (int d = 0; d < a->D; d += 16)
sum_sq = _mm512_fmadd_ps(_mm512_loadu_ps(&x[d]),
_mm512_loadu_ps(&x[d]), sum_sq);
float rms = sqrtf(_mm512_reduce_add_ps(sum_sq) / a->D + 1e-6f);
// Phase 2: normalize (element-wise, reads shared gamma)
__m512 rstd = _mm512_set1_ps(1.0f / rms);
for (int d = 0; d < a->D; d += 16)
_mm512_storeu_ps(&y[d],
_mm512_mul_ps(_mm512_mul_ps(_mm512_loadu_ps(&x[d]), rstd),
_mm512_loadu_ps(&a->gamma[d])));
}
}
| Property | Value |
|---|---|
| Parallel dimension | Tokens (T) — embarrassingly parallel |
| Shared reads | γ[D] normalization weights |
| Exclusive writes | output[t_start..t_end, 0..D-1] — complete token vectors |
| Reduction | Intra-token only (private to each thread, no cross-thread sync) |
| Decode mode | T=1 in decode → no token-level parallelism. Could split D dimension with barrier. |
Attention — Q·KT/√d → Softmax → Attn·V
File: src/kernels/attention_kernels.c
Operation: Multi-head scaled dot-product attention with causal mask
Three phases: Score computation, softmax, weighted value summation
Attention is the most complex kernel but also the most naturally parallel. Each attention head is a self-contained computation:
- Phase 1 (Score): For each query position i, compute
score[h,i,j] = Q[h,i] · K[h,j] / √dfor allj ≤ i(causal mask). Triangle of dot products per head. - Phase 2 (Softmax): Per row: find max, subtract, exp, sum, normalize. The reduction (max, sum) is within a single row — private to the thread owning that head.
- Phase 3 (Value):
output[h,i] = ∑j score[h,i,j] · V[h,j]. Accumulates over j — but within a single head, so private to the owning thread.
| Property | Value |
|---|---|
| Parallel dimension | Heads (H) — completely independent |
| Shared reads | None — each head has its own Q, K, V slices in head-major layout |
| Exclusive writes | scores[h, *, *] and output[h, *, *] — entire head is private |
| Reduction | Softmax max/sum is intra-row, within a single head (private) |
| Compute cost | O(T² · d) per head — quadratic in sequence length |
| Ideal threads | num_heads (typically 8–32, more than enough to saturate cores) |
Softmax — With Causal Masking
File: src/kernels/softmax_kernels.c
Operation: softmax(row) = exp(row - max) / sum(exp(row - max)), applied to each row of the attention score matrix
Called: Once per attention layer, integrated into the attention computation
| Property | Value |
|---|---|
| Parallel dimension | Heads (H) or rows (T) within a head — both embarrassingly parallel |
| Shared reads | None (in-place operation) |
| Exclusive writes | score[h, i, *] — each row written independently |
| Reduction | Row-internal max and sum — private per row, no cross-thread sync |
| Causal optimization | Row i only processes i+1 elements (lower triangle) + zeroes rest |
SwiGLU — Gated Linear Unit Activation
File: src/kernels/swiglu_kernels.c
Operation: output[t,d] = SiLU(gate[t,d]) × value[t,d] where input is split into gate and value halves
Called: Once per FFN block (1x per layer)
| Property | Value |
|---|---|
| Parallel dimension | Tokens (T) — embarrassingly parallel, zero dependencies |
| Shared reads | None |
| Exclusive writes | output[t_start..t_end, 0..D-1] |
| Reduction | None — pure element-wise operation |
| SIMD efficiency | Excellent — straight FMA + sigmoid approximation over contiguous memory |
RoPE — Rotary Position Embedding
File: src/kernels/rope_kernels.c
Operation: Rotate Q and K vectors using precomputed sin/cos tables based on token position
Called: Once per attention block (1x per layer, applied to Q and K separately)
| Property | Value |
|---|---|
| Parallel dimension | Heads × Tokens (H·T) — all (h,t) pairs independent |
| Shared reads | cos_cache[T × half_dim], sin_cache[T × half_dim] |
| Exclusive writes | x[h, t, *] — in-place rotation, one head-vector per (h,t) |
| Reduction | None — pure element-wise rotation |
| Best split | Flatten H×T loop, divide evenly across threads |
Embedding Lookup
File: src/kernels/embedding_kernels.c
Operation: output[t] = token_embeddings[token_id[t]] + pos_embeddings[t]
Called: Once per forward pass (first operation)
| Property | Value |
|---|---|
| Parallel dimension | Tokens (T) — each token lookup is independent |
| Shared reads | token_embeddings[vocab_size × D] (table), pos_embeddings[T × D] |
| Exclusive writes | output[t, 0..D-1] — one embedding vector per token |
| Reduction | None |
| Special pattern | Irregular memory access (gather) — token_ids determine which rows to fetch |
| Decode mode | T=1 → no parallelism possible. Prefill (T=512+) benefits greatly. |
KV Cache Write
File: src/kernels/kv_cache_kernels.c
Operation: Copy new K and V vectors for current token into the cache at position token_index
Called: Once per layer (append new token's K,V to cache)
| Property | Value |
|---|---|
| Parallel dimension | KV heads (num_kv_heads) — each head is independent |
| Shared reads | k_token[num_kv_heads × head_dim], v_token[num_kv_heads × head_dim] |
| Exclusive writes | k_cache[h, token_index, *], v_cache[h, token_index, *] |
| Reduction | None — pure memcpy per head |
| Worth parallelizing? | Small — only head_dim floats per head. May not justify dispatch overhead. |
Summary: Parallelization Map
| Kernel | Split Dimension | Shared Reads | Reduction | Priority |
|---|---|---|---|---|
| GEMV | Rows (M) | x[K] | None | HIGH — dominates decode time |
| GEMM | Rows (M) | B[K×N] | None | HIGH — dominates prefill time |
| Attention | Heads (H) | None | Private per head | HIGH — O(T²) cost |
| RMSNorm | Tokens (T) | γ[D] | Private per token | MEDIUM — called 48x/fwd |
| SwiGLU | Tokens (T) | None | None | MEDIUM — element-wise |
| Softmax | Heads (H) or Rows (T) | None | Private per row | MEDIUM |
| RoPE | H × T | cos/sin tables | None | LOW — cheap per call |
| Embedding | Tokens (T) | Embedding table | None | LOW — prefill only |
| KV Cache | KV Heads | Token vectors | None | LOW — small memcpy |
(ith, nth) partitioning
is the perfect parallelization strategy for LLM inference.
§16 Testing & Validation
The thread pool dispatch layer is validated by a dedicated parity test suite that compares every dispatch wrapper against its serial counterpart. The tests verify both numerical correctness (bit-for-bit parity within tolerance) and performance (speedup from parallelization).
Test Suite: test_threadpool_parity.c
Location: tests/test_threadpool_parity.c
| Test | Serial Kernel | Thread Pool Dispatch | What It Validates |
|---|---|---|---|
| Test 1 | gemv_q8_0_q8_0() |
gemv_q8_0_q8_0_parallel_dispatch() |
Q8_0 weights × Q8_0 input (logits, V proj) |
| Test 2 | gemv_q5_0_q8_0() |
gemv_q5_0_q8_0_parallel_dispatch() |
Q5_0 weights × Q8_0 input (Q/K proj, MLP, out proj) |
| Test 3 | gemv_fused_q5_0_bias_dispatch() |
gemv_fused_q5_0_bias_parallel_dispatch() |
Fused: quantize FP32 → GEMV → add bias (production path) |
| Test 4 | Fused dispatch with NULL bias |
Edge case: bias pointer is NULL (no bias add) | |
| Test 5 | Small M dispatch latency measurement | Overhead of packing args + waking threads on tiny workloads | |
Test Dimensions (Qwen2 0.5B shapes)
Tests run on real model dimensions to catch alignment and edge-case issues:
| Config | M (output rows) | K (input dim) | Corresponds To |
|---|---|---|---|
qkv_proj |
896 | 896 | Q/K/V projections (embed → embed) |
mlp_gate_up |
9,728 | 896 | MLP gate+up projection (embed → 4× intermediate) |
mlp_down |
896 | 4,864 | MLP down projection (intermediate → embed) |
logits |
151,936 | 896 | LM head (embed → vocab, largest GEMV) |
How to Run
# Quick parity check (~5 seconds) make test-threadpool-parity-quick # Full parity + speed test (all model dimensions) make test-threadpool-parity # Verbose output (shows max_abs_diff, max_rel_diff for every config) make test-threadpool-parity-verbose
Correctness Criteria
_parallel_simd kernel on its row subset.
Since each output row is computed from independent dot products, the thread pool
result should be bit-identical to serial — the tolerance exists only
to guard against floating-point non-determinism from different SIMD accumulation orders.
CI Integration
The thread pool parity test runs automatically in three places:
| Trigger | What Runs | Where Configured |
|---|---|---|
make test |
test-threadpool-parity-quick |
Makefile (test target) |
make llamacpp-parity-full |
test-threadpool-parity (full) |
Makefile (llamacpp-parity-full target) |
| Nightly / Push to main / PR | test-threadpool-parity (full) |
scripts/nightly_runner.py → MAKE_TARGETS["threadpool_parity"] |
The nightly workflow (.github/workflows/nightly.yml) runs on every push to
main, every pull request, and nightly at 2am UTC. If the thread pool
dispatch diverges from serial output, the test exits with code 1 and the nightly
report marks it as failed.
ADR: Thread Pool Replaced OpenMP for Parallelization
The codebase previously had an OpenMP parallelization pass (parallel_pass.py)
that annotated IR operations with #pragma omp parallel for. This was
superseded by the persistent pthread thread pool for the following reasons:
| Concern | OpenMP | Thread Pool |
|---|---|---|
| Thread creation | Fork/join per #pragma region (~15-50µs) |
Created once at init, spin-wait between dispatches (~0.1µs wake) |
| Core oversubscription | OpenMP team + thread pool = 2N threads on N cores | One pool, known thread count, no conflict |
| Codegen integration | parallel_pass.py emits pragmas, but codegen_v6_6.py never reads them |
Macro redirects in ck_parallel_decode.h — zero codegen changes |
| Kernel interface | Implicit OpenMP runtime | Explicit (ith, nth) args to _parallel_simd kernels |
Files affected by this decision:
version/v6.6/scripts/parallel_pass.py— header documents "SUPERSEDED BY THREAD POOL DISPATCH"version/v6.6/scripts/build_ir_v6_6.py—run_parallel_pass()commented out with ADRversion/v6.6/scripts/ck_run_v6_6.py—--parallel-decodeflag deprecated (warns and continues)
Related Test Suites
The thread pool parity test complements existing tests that cover other aspects of the parallelization stack:
| Make Target | Test File | What It Tests |
|---|---|---|
test-threadpool-parity |
tests/test_threadpool_parity.c |
Serial vs thread pool dispatch (this page) |
test-gemv-omp |
tests/test_gemv_omp_parity.c |
Serial vs OpenMP parallel (legacy comparison) |
test-fusion-gemv |
unittest/fusion/test_gemv_fused_quant_bias.py |
Fused vs unfused GEMV correctness + speedup |
llamacpp-parity-full |
scripts/run_parity_smoketest.sh |
CK-Engine kernels vs llama.cpp reference (includes thread pool + OMP) |
e2e-v66 |
version/v6.6/test/Makefile |
Full v6.6 pipeline: IR → codegen → compile → run (thread pool always enabled) |