CPU · Linux · x86-64 / ARM

C-Kernel-Engine

A CPU-native AI runtime and kernel compiler for auditable inference, training kernels, and distributed CPU execution. Python tooling imports models, builds the IR, generates C, and runs parity checks; the execution path is C-first and compiled with the best native CPU toolchain available.

Why this exists: build lightweight AI systems without sacrificing performance or auditability, keep the full stack visible end-to-end (model import -> IR -> memory layout -> generated C kernels), and scale CPU-first across embedded devices, commodity servers, and owned CPU clusters.

< 1e-5 parity threshold PyTorch + llama.cpp verified 16.5x faster tokenizer 9+ fwd + bwd kernels IR1 → IR2 → IR3 codegen VTune / Advisor profiling CKU PB/s byte-path metric C-first execution

What is C-Kernel-Engine?

C-Kernel-Engine is split into two layers. The tooling layer uses Python for model import, GGUF/safetensors conversion, IR construction, code generation, and PyTorch/llama.cpp parity testing. The execution layer is generated C plus hand-written C kernels with explicit memory offsets, quantization layouts, SIMD dispatch, and native compiler optimization.

The intended deployment model is dependency-light execution, not dependency-free development. The generated runtime can be compiled with GCC or Clang, but the preferred path is to use the strongest native CPU toolchain for the target machine: Intel oneAPI/ICX on Intel CPUs, tuned GCC/Clang on AMD CPUs, ARM toolchains on ARM systems, and standard systems libraries for cluster execution.

Long-Horizon Commodity Compute Bet

CKE assumes AI execution will not remain permanently locked inside expensive accelerator stacks, because the underlying workload is still math, memory movement, and systems software. Accelerators matter, but they do not change the shape of the problem: tensors must live somewhere, bytes must move through memory hierarchies, kernels must execute, and distributed systems must coordinate the work.

If CPUs keep gaining cores, SIMD and matrix units, memory channels, NUMA controls, and fast interconnects, then a clean generated-C runtime that can orchestrate many cheap Linux nodes becomes increasingly interesting. The bet is not that one CPU beats every GPU. The bet is that commodity CPU clusters can become a practical, auditable, incrementally scalable execution substrate for a large class of AI inference and training workloads.

CKU: CKE Throughput Unit

CKU is the project’s north-star systems metric: active model bytes cycled through useful model math per second. It is not memcpy or DSA copy bandwidth. The reference workload is full-context prefill, where active weights and activations flow through GEMM/FMA/attention; decode is tracked separately as GEMV plus KV/cache traffic.

Read the CKU page →

Generated C, Planned Memory

C-Kernel-Engine uses C as an auditable compiler target, not as a large hand-written pointer maze. The model graph, tensor shapes, lifetimes, kernel contracts, and memory sections are resolved before code emission. The generated C then reads and writes deterministic offsets from planned buffers instead of allocating or guessing inside the hot path.

This is the memory-safety model: avoid most ordinary C mistakes by not hand-authoring the model body in C in the first place. The compiler pipeline emits explicit offsets, optional bounds/canary checks, cache-aligned sections, and reproducible layout reports; ASAN, Valgrind, PyTorch parity, llama.cpp parity, perf, VTune, and Advisor validate the artifact from correctness down to memory behavior.

The result is still native C, but the important contract is visible: IR op -> tensor lifetime -> byte offset -> generated pointer -> kernel call. That makes the execution path deterministic, inspectable, and measurable for applications where memory layout is the application.

The north star is distributed CPU inference and training: model/layer ownership across CPU nodes, explicit activation and gradient movement, MPI/RDMA-style communication where available, and pipeline parallelism over hardware that can be bought, powered, audited, and scaled incrementally.

The first version of C-Kernel-Engine tried to drive generation from HuggingFace config.json alone. That was not enough: config metadata does not fully capture quantized weight packing/layout details, nor all runtime stitching constraints needed for reliable kernel binding.

The current pipeline pulls GGUF or safetensors metadata as the weight/layout source of truth, parses it, and combines that with C-Kernel-Engine templates and config hints. That merge produces deterministic weights.bump + sidecar metadata. From that point forward, the same IR chain can target v8 inference and v7 training/backprop paths.

The architecture is deliberately split into a smart front-end (template selection, manifest resolution, IR construction, backward synthesis, layout planning, fusion detection) and a dumb back-end (a codegen emitter that only writes what the lowered IR already decided). This makes correctness auditable at every stage.

Why the Bump Format Exists

weights.bump is a contiguous binary blob optimized for deterministic C runtime access by offset.
The sidecar (for example weights_manifest.json) records tensor names, offsets, shapes, dtypes, quantization metadata, and template bindings.
bump + sidecar is the handoff contract that gives IR1 enough concrete information to lower, codegen, and run inference or backprop paths without hidden runtime guesses.

The IR Pipeline

Both inference and training share the same IR foundation. The inference path (v8) adds quantized dispatch, model-family contracts, multimodal coverage, and runtime profiling hooks. The training path (v7) adds backward synthesis, explicit gradient accumulation, and a separate training memory layout — built on top of the same IR1/IR2 lowering chain.

Shared IR Foundation

HF config + GGUF parse → CK template merge + tensor mapping → weights.bump + sidecar manifest → IR1 (fwd op graph, typed tensors, kernel IDs) → IR2 (fusion patterns, MEGA kernel detection)

Inference Path — v6.6 + Quantization

↓

IR3 Lowering

quantized kernel dispatch, memory offsets
Q4_0 · Q4_K · Q5_K · Q6_K · Q8_0 · BF16 · FP32

↓

Memory Layout

contiguous bump allocator, cache-aligned sections
→ layout.json

↓

Lowered IR3

resolved offsets + kernel bindings

↓

Dumb Codegen

emits only what lowered IR decided
no architecture guessing in the emitter

→ generated_model.c
GEMM / GEMV quantized dispatch
MEGA fused kernels (prefill + decode)
OpenMP thread pool

Training Extension — v7 Backprop IR

↓

IR2 Backward Synthesis

grad edges, fanout accumulation, explicit
accumulation windows → ir2_train_backward.json

↓

Training Memory Layout

weights · activations · grads · optimizer state
canary sentinels between sections → layout_train.json

↓

Layout Audit + Validation

invariant checks, memory audit report
canary write + verify by phase

↓

Dumb Codegen

→ generated_train_runtime_v7.c → libtrain.so

→ libtrain.so
AdamW · grad accumulation windows
cross-entropy (stable log-sum-exp)
PyTorch oracle parity gates

The emitter is intentionally dumb. All architecture decisions — fusion, quantized dispatch, memory placement, backward op ordering — live in the IR stages, not in the code generator. This is the core design principle: if generated behaviour is wrong, the root cause is upstream in IR or layout, never hidden in the emitter. IR Pipeline v6.6 → v7 Backprop IR →

The kernel library is the shared foundation for both pipelines — every kernel used in inference is also the backward-pass primitive used in training:

C-Kernel-Engine: Kernel Library → IR Pipeline → Memory Layout → Dumb Codegen → Inference Runtime + Training Runtime v7

Featured Deep Dive: Gated DeltaNet

Qwen3.5/qwen3next-style recurrent attention uses a compact state matrix instead of scanning the full KV-cache in every recurrent layer. CK implements that path as a dedicated Gated DeltaNet kernel family with parity-focused docs.

Open Gated DeltaNet Deep Dive →

Correctness

Every claim of correctness is backed by a specific, reproducible test. Parity is not a property we assert — it is something we measure at every kernel boundary, training step, and memory section.

Kernel Parity Gates

max diff < 1e-5	All forward and backward kernels tested via Python ctypes against PyTorch autograd. Fails hard on violation.
llama.cpp parity	Quantized GEMM/GEMV kernels (Q4_K, Q5_K, Q6_K, Q8_0) verified against llama.cpp reference output — not just PyTorch. Both references must agree.
per-kernel test	Each kernel ships its own Python parity test. `make test` runs the full suite.
model matrix	`make v6.6-validate-matrix` runs the full kernel contract suite across GPT-2, Qwen2, Qwen3, Gemma3.

Training Parity (v7)

oracle cadence	Step-level and slot-level checks against a live PyTorch oracle. Run with `--parity-on`.
first-divergence	Operator + tensor level tracing. Reports the exact step, op, and tensor where CK diverges from PyTorch.
canary sentinels	Memory integrity markers between training phases (weights / activations / grads / optimizer state). Written and verified each phase.
850-step repro	Deterministic parity run across all three backends (`c` / `c_ptref` / `torch`). Artifacts in `train_runtime_parity_*_latest.json`.

Performance

Measured Across CPU Classes

CKE performance engineering combines numerical parity, practical end-to-end workloads, Linux perf, flamegraphs, assembly inspection, Intel VTune, and Intel Advisor roofline analysis. The objective is to explain the distance from each machine's useful hardware limits and close it kernel by kernel, not to infer efficiency from peak FLOPS or CPU utilization alone.

Intel Core i7	Multiple-generation FP32 and quantized compatibility work, including a dedicated 14th Gen Core i7-14700T AVX2/FMA/AVX-VNNI profiling node. Native BF16 is not claimed on this host.
TI TDA4VM ARM	FP32 and quantized ARM NEON portability under embedded memory and power constraints.
Intel Xeon	External 2nd, 3rd, and 5th Gen validation lanes for FP32/quantized AVX-512 and VNNI where available, plus native BF16/AMX only when the host exposes those features.
Xeon 6	Planned native AMX BF16, AVX-512/VNNI, memory-channel, NUMA, power, and distributed-node laboratory. No Xeon 6 result is claimed yet.

Coverage is reported per host, detected ISA, dtype, and workload; it does not imply every model passes every gate on every CPU. Read the tuning methodology or reproduce the profiling lanes.

Profiling + SIMD

VTune + flamegraph	Hardware counter integration via the IR Visualizer. Load run artifacts and inspect per-op hotspots with full flamegraph data.
AVX-512 MEGA kernels	RMSNorm + QKV + RoPE fused. MLP + Residual fused. Fusion patterns detected automatically at IR2. (v6.6)
AMX	Intel Advanced Matrix Extensions support coming in a future version.
OpenMP thread pool	Parallel prefill and decode passes. Token-parallel MLP. (v6.6)
Bump allocator	Cache-aligned, contiguous weight layout. Zero-copy load from GGUF. No heap fragmentation in hot paths.

Profiling Guide → v7 Profiling →

Tokenizer — 16.5x Faster

Trie-based BPE/WordPiece tokenization. O(k) lookup vs O(n×k). No memcpy in hot path. Full UTF-8 multilingual support.

Text Length	C-Kernel	PyTorch	Speedup
200 chars	20,941/s	4,616/s	4.5x
3,000 chars	95,923/s	6,197/s	15.5x
15,000 chars	114,463/s	6,237/s	18.4x

Average: 16.56x faster than PyTorch/tiktoken

Tokenizer Details →

Kernel Library

All kernels implement forward and backward passes. Each ships with a Python ctypes parity test that loads libckernel_engine.so and compares output against PyTorch autograd with max diff < 1e-5:

Kernel	Forward	Backward	Notes
`attention`	Yes	Yes	Causal mask, GQA, sliding-window, head-major layout
`rope`	Yes	Yes	Rotary position embeddings, precomputed cache
`rmsnorm`	Yes	Yes	Fused normalization, rstd caching, learned scale
`swiglu`	Yes	Yes	Fused gate activation for Llama/Qwen-style MLP
`softmax`	Yes	Yes	Causal row-wise, numerically stable
`layernorm`	Yes	Yes	Rolled and unrolled variants
`gelu`	Yes	Yes	Exact and fast approximation variants
`cross_entropy`	Yes	Yes	Stable log-sum-exp, mean reduction, ignore-index
`gemm`	Yes	N/A	Blocked serial, AVX-512, parallel — Q4_K · Q5_K · Q6_K · Q8_0 · FP32
`mlp`	Yes	Yes	FC1 + activation + FC2, token-parallel

Kernel Reference → Model × Kernel Matrix →

Quick Start

Build + Parity Test

git clone https://github.com/antshiv/C-Kernel-Engine.git
cd C-Kernel-Engine
make                  # builds build/libckernel_engine.so
make test             # runs all kernel parity tests vs PyTorch

Requires Linux, a supported C toolchain, and Python 3 + PyTorch for parity tests. AVX2 is the current x86 production baseline; ARM NEON is an active portability lane; AVX-512, VNNI, BF16, and AMX paths are enabled only where the host supports them.

Inference — Generate C Runtime

make ck-emit \
  CONFIG=path/to/config.json \
  OUT=build/generated_model.c

Emits a complete C file from IR. Load GGUF quantized weights via make gguf-convert.

Full Quickstart →

Training — v7 IR Runtime

. .venv/bin/activate
python version/v7/scripts/ck_run_v7.py init \
  --run /tmp/myrun \
  --template qwen3 \
  --generate-ir --generate-runtime --strict
python version/v7/scripts/ck_run_v7.py train \
  --run /tmp/myrun \
  --backend ck --parity-on

Produces IR1, IR2, layout, generated C runtime, and parity artifacts.

v7 Guide → v7 Runbook →

Operator Spectrum (1 → 8)

Raw Corpus → Tokenizer → Data Prep → Transformer IR → Forward → Backward → Loss → Inference.

Open Spectrum → v7 Runbook →

Work With Us

CPU AI and High-Performance Linux Engineering

CKE's methodology can be applied to CPU AI feasibility, C/Linux performance investigations, AVX/AVX-512/AMX or ARM NEON kernel work, numerical first-divergence analysis, model evaluation systems, and evidence-backed deployment planning.

Work With Us → antshiv.com → Project Issues →

Project Structure

Focused Source Tree src/kernels · version/v6.6 · version/v7 Updated: 2026-07-18 18:09

src/kernels
`-- fused
version/v6.6
|-- docs
|-- include
|-- kernel_maps
|-- patches
|-- scripts
|   `-- parity
|-- src
|   |-- generated
|   |-- kernel_config
|   |-- scripts
|   `-- test_generated
|-- templates
|-- test
|-- testing
|-- tests
|-- tools
`-- unittest
version/v7
|-- artifacts
|   `-- svg_dsl
|       |-- gen1_archive_2026-04-05
|       `-- spec_archive_2026-04-08
|-- contracts
|-- data
|   |-- eval_contracts
|   |-- spec03
|   |   |-- contracts
|   |   |-- holdout
|   |   |-- manifests
|   |   |-- midtrain
|   |   |-- normalized
|   |   |-- pretrain
|   |   |-- raw_assets
|   |   |-- sft
|   |   `-- tokenizer
|   `-- spec04
|       |-- contracts
|       |-- holdout
|       |-- manifests
|       |-- normalized
|       |-- raw_assets
|       `-- tokenizer
|-- docs
|-- examples
|-- experiments
|   `-- svg_dsl
|-- include
|-- kernel_maps
|-- regression
|-- reports
|   |-- spec12_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec12_gold_mappings
|   |-- spec13b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec13b_gold_mappings
|   |-- spec14a_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec14a_gold_mappings
|   |-- spec14b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec14b_gold_mappings
|   |-- spec15a_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec15a_gold_mappings
|   |-- spec15b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec15b_gold_mappings
|   |-- spec_broader_1_family_packs -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec_broader_1_family_packs
|   `-- spec_broader_1_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec_broader_1_gold_mappings
|-- scripts
|   |-- dataset
|   `-- parity
|-- src
|-- templates
|-- test
|-- tests
|   |-- contracts
|   `-- fixtures
`-- tools
    `-- src

72 directories