CPU · Linux · x86-64 / ARM

C-Kernel-Engine

A CPU-native AI runtime and kernel compiler for auditable inference, training kernels, and distributed CPU execution. Python tooling imports models, builds the IR, generates C, and runs parity checks; the execution path is C-first and compiled with the best native CPU toolchain available.

Why this exists: build lightweight AI systems without sacrificing performance or auditability, keep the full stack visible end-to-end (model import -> IR -> memory layout -> generated C kernels), and scale CPU-first across embedded devices, commodity servers, and owned CPU clusters.

< 1e-5 parity threshold PyTorch + llama.cpp verified 16.5x faster tokenizer 9+ fwd + bwd kernels IR1 → IR2 → IR3 codegen VTune / Advisor profiling C-first execution

What is C-Kernel-Engine?

C-Kernel-Engine is split into two layers. The tooling layer uses Python for model import, GGUF/safetensors conversion, IR construction, code generation, and PyTorch/llama.cpp parity testing. The execution layer is generated C plus hand-written C kernels with explicit memory offsets, quantization layouts, SIMD dispatch, and native compiler optimization.

The intended deployment model is dependency-light execution, not dependency-free development. The generated runtime can be compiled with GCC or Clang, but the preferred path is to use the strongest native CPU toolchain for the target machine: Intel oneAPI/ICX on Intel CPUs, tuned GCC/Clang on AMD CPUs, ARM toolchains on ARM systems, and standard systems libraries for cluster execution.

Generated C, Planned Memory

C-Kernel-Engine uses C as an auditable compiler target, not as a large hand-written pointer maze. The model graph, tensor shapes, lifetimes, kernel contracts, and memory sections are resolved before code emission. The generated C then reads and writes deterministic offsets from planned buffers instead of allocating or guessing inside the hot path.

This is the memory-safety model: avoid most ordinary C mistakes by not hand-authoring the model body in C in the first place. The compiler pipeline emits explicit offsets, optional bounds/canary checks, cache-aligned sections, and reproducible layout reports; ASAN, Valgrind, PyTorch parity, llama.cpp parity, perf, VTune, and Advisor validate the artifact from correctness down to memory behavior.

The result is still native C, but the important contract is visible: IR op -> tensor lifetime -> byte offset -> generated pointer -> kernel call. That makes the execution path deterministic, inspectable, and measurable for applications where memory layout is the application.

The north star is distributed CPU inference and training: model/layer ownership across CPU nodes, explicit activation and gradient movement, MPI/RDMA-style communication where available, and pipeline parallelism over hardware that can be bought, powered, audited, and scaled incrementally.

The first version of C-Kernel-Engine tried to drive generation from HuggingFace config.json alone. That was not enough: config metadata does not fully capture quantized weight packing/layout details, nor all runtime stitching constraints needed for reliable kernel binding.

The current pipeline pulls GGUF or safetensors metadata as the weight/layout source of truth, parses it, and combines that with C-Kernel-Engine templates and config hints. That merge produces deterministic weights.bump + sidecar metadata. From that point forward, the same IR chain can target v8 inference and v7 training/backprop paths.

The architecture is deliberately split into a smart front-end (template selection, manifest resolution, IR construction, backward synthesis, layout planning, fusion detection) and a dumb back-end (a codegen emitter that only writes what the lowered IR already decided). This makes correctness auditable at every stage.

Why the Bump Format Exists

weights.bump is a contiguous binary blob optimized for deterministic C runtime access by offset.
The sidecar (for example weights_manifest.json) records tensor names, offsets, shapes, dtypes, quantization metadata, and template bindings.
bump + sidecar is the handoff contract that gives IR1 enough concrete information to lower, codegen, and run inference or backprop paths without hidden runtime guesses.

The IR Pipeline

Both inference and training share the same IR foundation. The inference path (v8) adds quantized dispatch, model-family contracts, multimodal coverage, and runtime profiling hooks. The training path (v7) adds backward synthesis, explicit gradient accumulation, and a separate training memory layout — built on top of the same IR1/IR2 lowering chain.

Shared IR Foundation

HF config + GGUF parse → CK template merge + tensor mapping → weights.bump + sidecar manifest → IR1 (fwd op graph, typed tensors, kernel IDs) → IR2 (fusion patterns, MEGA kernel detection)

Inference Path — v6.6 + Quantization

↓

IR3 Lowering

quantized kernel dispatch, memory offsets
Q4_0 · Q4_K · Q5_K · Q6_K · Q8_0 · BF16 · FP32

↓

Memory Layout

contiguous bump allocator, cache-aligned sections
→ layout.json

↓

Lowered IR3

resolved offsets + kernel bindings

↓

Dumb Codegen

emits only what lowered IR decided
no architecture guessing in the emitter

→ generated_model.c
GEMM / GEMV quantized dispatch
MEGA fused kernels (prefill + decode)
OpenMP thread pool

Training Extension — v7 Backprop IR

↓

IR2 Backward Synthesis

grad edges, fanout accumulation, explicit
accumulation windows → ir2_train_backward.json

↓

Training Memory Layout

weights · activations · grads · optimizer state
canary sentinels between sections → layout_train.json

↓

Layout Audit + Validation

invariant checks, memory audit report
canary write + verify by phase

↓

Dumb Codegen

→ generated_train_runtime_v7.c → libtrain.so

→ libtrain.so
AdamW · grad accumulation windows
cross-entropy (stable log-sum-exp)
PyTorch oracle parity gates

The emitter is intentionally dumb. All architecture decisions — fusion, quantized dispatch, memory placement, backward op ordering — live in the IR stages, not in the code generator. This is the core design principle: if generated behaviour is wrong, the root cause is upstream in IR or layout, never hidden in the emitter. IR Pipeline v6.6 → v7 Backprop IR →

The kernel library is the shared foundation for both pipelines — every kernel used in inference is also the backward-pass primitive used in training:

C-Kernel-Engine: Kernel Library → IR Pipeline → Memory Layout → Dumb Codegen → Inference Runtime + Training Runtime v7

Featured Deep Dive: Gated DeltaNet

Qwen3.5/qwen3next-style recurrent attention uses a compact state matrix instead of scanning the full KV-cache in every recurrent layer. CK implements that path as a dedicated Gated DeltaNet kernel family with parity-focused docs.

Open Gated DeltaNet Deep Dive →

Correctness

Every claim of correctness is backed by a specific, reproducible test. Parity is not a property we assert — it is something we measure at every kernel boundary, training step, and memory section.

Kernel Parity Gates

max diff < 1e-5	All forward and backward kernels tested via Python ctypes against PyTorch autograd. Fails hard on violation.
llama.cpp parity	Quantized GEMM/GEMV kernels (Q4_K, Q5_K, Q6_K, Q8_0) verified against llama.cpp reference output — not just PyTorch. Both references must agree.
per-kernel test	Each kernel ships its own Python parity test. `make test` runs the full suite.
model matrix	`make v6.6-validate-matrix` runs the full kernel contract suite across GPT-2, Qwen2, Qwen3, Gemma3.

Training Parity (v7)

oracle cadence	Step-level and slot-level checks against a live PyTorch oracle. Run with `--parity-on`.
first-divergence	Operator + tensor level tracing. Reports the exact step, op, and tensor where CK diverges from PyTorch.
canary sentinels	Memory integrity markers between training phases (weights / activations / grads / optimizer state). Written and verified each phase.
850-step repro	Deterministic parity run across all three backends (`c` / `c_ptref` / `torch`). Artifacts in `train_runtime_parity_*_latest.json`.

Performance

Profiling + SIMD

VTune + flamegraph	Hardware counter integration via the IR Visualizer. Load run artifacts and inspect per-op hotspots with full flamegraph data.
AVX-512 MEGA kernels	RMSNorm + QKV + RoPE fused. MLP + Residual fused. Fusion patterns detected automatically at IR2. (v6.6)
AMX	Intel Advanced Matrix Extensions support coming in a future version.
OpenMP thread pool	Parallel prefill and decode passes. Token-parallel MLP. (v6.6)
Bump allocator	Cache-aligned, contiguous weight layout. Zero-copy load from GGUF. No heap fragmentation in hot paths.

Profiling Guide → v7 Profiling →

Tokenizer — 16.5x Faster

Trie-based BPE/WordPiece tokenization. O(k) lookup vs O(n×k). No memcpy in hot path. Full UTF-8 multilingual support.

Text Length	C-Kernel	PyTorch	Speedup
200 chars	20,941/s	4,616/s	4.5x
3,000 chars	95,923/s	6,197/s	15.5x
15,000 chars	114,463/s	6,237/s	18.4x

Average: 16.56x faster than PyTorch/tiktoken

Tokenizer Details →

Kernel Library

All kernels implement forward and backward passes. Each ships with a Python ctypes parity test that loads libckernel_engine.so and compares output against PyTorch autograd with max diff < 1e-5:

Kernel	Forward	Backward	Notes
`attention`	Yes	Yes	Causal mask, GQA, sliding-window, head-major layout
`rope`	Yes	Yes	Rotary position embeddings, precomputed cache
`rmsnorm`	Yes	Yes	Fused normalization, rstd caching, learned scale
`swiglu`	Yes	Yes	Fused gate activation for Llama/Qwen-style MLP
`softmax`	Yes	Yes	Causal row-wise, numerically stable
`layernorm`	Yes	Yes	Rolled and unrolled variants
`gelu`	Yes	Yes	Exact and fast approximation variants
`cross_entropy`	Yes	Yes	Stable log-sum-exp, mean reduction, ignore-index
`gemm`	Yes	N/A	Blocked serial, AVX-512, parallel — Q4_K · Q5_K · Q6_K · Q8_0 · FP32
`mlp`	Yes	Yes	FC1 + activation + FC2, token-parallel

Kernel Reference → Model × Kernel Matrix →

Quick Start

Build + Parity Test

git clone https://github.com/antshiv/C-Kernel-Engine.git
cd C-Kernel-Engine
make                  # builds build/libckernel_engine.so
make test             # runs all kernel parity tests vs PyTorch

Requires Linux, GCC, Python 3 + PyTorch for parity tests. AVX2 minimum; AVX-512 recommended.

Inference — Generate C Runtime

make ck-emit \
  CONFIG=path/to/config.json \
  OUT=build/generated_model.c

Emits a complete C file from IR. Load GGUF quantized weights via make gguf-convert.

Full Quickstart →

Training — v7 IR Runtime

. .venv/bin/activate
python version/v7/scripts/ck_run_v7.py init \
  --run /tmp/myrun \
  --template qwen3 \
  --generate-ir --generate-runtime --strict
python version/v7/scripts/ck_run_v7.py train \
  --run /tmp/myrun \
  --backend ck --parity-on

Produces IR1, IR2, layout, generated C runtime, and parity artifacts.

v7 Guide → v7 Runbook →

Operator Spectrum (1 → 8)

Raw Corpus → Tokenizer → Data Prep → Transformer IR → Forward → Backward → Loss → Inference.

Open Spectrum → v7 Runbook →

Project Structure

Focused Source Tree src/kernels · version/v6.6 · version/v7 Updated: 2026-06-24 11:49

src/kernels
`-- fused
version/v6.6
|-- docs
|-- include
|-- kernel_maps
|-- patches
|-- scripts
|   `-- parity
|-- src
|   |-- generated
|   |-- kernel_config
|   |-- scripts
|   `-- test_generated
|-- templates
|-- test
|-- testing
|-- tests
|-- tools
`-- unittest
version/v7
|-- artifacts
|   `-- svg_dsl
|       |-- gen1_archive_2026-04-05
|       `-- spec_archive_2026-04-08
|-- contracts
|-- data
|   |-- eval_contracts
|   |-- generated
|   |   |-- toy_svg_semantic_shapes_tokenizer
|   |   `-- toy_svg_structured_atoms_tokenizer
|   |-- probe_contracts
|   |-- spec03
|   |   |-- contracts
|   |   |-- holdout
|   |   |-- manifests
|   |   |-- midtrain
|   |   |-- normalized
|   |   |-- pretrain
|   |   |-- raw_assets
|   |   |-- sft
|   |   `-- tokenizer
|   `-- spec04
|       |-- contracts
|       |-- holdout
|       |-- manifests
|       |-- midtrain
|       |-- normalized
|       |-- pretrain
|       |-- raw_assets
|       |-- sft
|       `-- tokenizer
|-- docs
|-- examples
|-- experiments
|   `-- svg_dsl
|       |-- catalog
|       |-- core
|       |-- programs
|       `-- renderers
|-- include
|-- kernel_maps
|-- regression
|-- reports
|   |-- spec12_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec12_gold_mappings
|   |-- spec13b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec13b_gold_mappings
|   |-- spec14a_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec14a_gold_mappings
|   |-- spec14b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec14b_gold_mappings
|   |-- spec15a_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec15a_gold_mappings
|   |-- spec15b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec15b_gold_mappings
|   |-- spec_broader_1_family_packs -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec_broader_1_family_packs
|   `-- spec_broader_1_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec_broader_1_gold_mappings
|-- runs
|   |-- logs
|   `-- overnight_monitor
|       `-- spec10
|-- scripts
|   |-- dataset
|   `-- parity
|-- src
|-- templates
|-- test
|-- tests
|   |-- contracts
|   `-- fixtures
`-- tools
    `-- src

87 directories