C-Kernel-Engine
A CPU-native AI runtime and kernel compiler for auditable inference, training kernels, and distributed CPU execution. Python tooling imports models, builds the IR, generates C, and runs parity checks; the execution path is C-first and compiled with the best native CPU toolchain available.
Why this exists: build lightweight AI systems without sacrificing performance or auditability, keep the full stack visible end-to-end (model import -> IR -> memory layout -> generated C kernels), and scale CPU-first across embedded devices, commodity servers, and owned CPU clusters.
What is C-Kernel-Engine?
C-Kernel-Engine is split into two layers. The tooling layer uses Python for model import, GGUF/safetensors conversion, IR construction, code generation, and PyTorch/llama.cpp parity testing. The execution layer is generated C plus hand-written C kernels with explicit memory offsets, quantization layouts, SIMD dispatch, and native compiler optimization.
The intended deployment model is dependency-light execution, not dependency-free development. The generated runtime can be compiled with GCC or Clang, but the preferred path is to use the strongest native CPU toolchain for the target machine: Intel oneAPI/ICX on Intel CPUs, tuned GCC/Clang on AMD CPUs, ARM toolchains on ARM systems, and standard systems libraries for cluster execution.
Generated C, Planned Memory
C-Kernel-Engine uses C as an auditable compiler target, not as a large hand-written pointer maze. The model graph, tensor shapes, lifetimes, kernel contracts, and memory sections are resolved before code emission. The generated C then reads and writes deterministic offsets from planned buffers instead of allocating or guessing inside the hot path.
This is the memory-safety model: avoid most ordinary C mistakes by not hand-authoring the model body in C in the first place. The compiler pipeline emits explicit offsets, optional bounds/canary checks, cache-aligned sections, and reproducible layout reports; ASAN, Valgrind, PyTorch parity, llama.cpp parity, perf, VTune, and Advisor validate the artifact from correctness down to memory behavior.
The result is still native C, but the important contract is visible: IR op -> tensor lifetime -> byte offset -> generated pointer -> kernel call. That makes the execution path deterministic, inspectable, and measurable for applications where memory layout is the application.
The north star is distributed CPU inference and training: model/layer ownership across CPU nodes, explicit activation and gradient movement, MPI/RDMA-style communication where available, and pipeline parallelism over hardware that can be bought, powered, audited, and scaled incrementally.
The first version of C-Kernel-Engine tried to drive generation from HuggingFace config.json alone. That was not enough: config metadata does not fully capture quantized weight packing/layout details, nor all runtime stitching constraints needed for reliable kernel binding.
The current pipeline pulls GGUF or safetensors metadata as the weight/layout source of truth, parses it, and combines that with C-Kernel-Engine templates and config hints. That merge produces deterministic weights.bump + sidecar metadata. From that point forward, the same IR chain can target v8 inference and v7 training/backprop paths.
The architecture is deliberately split into a smart front-end (template selection, manifest resolution, IR construction, backward synthesis, layout planning, fusion detection) and a dumb back-end (a codegen emitter that only writes what the lowered IR already decided). This makes correctness auditable at every stage.
Why the Bump Format Exists
weights.bumpis a contiguous binary blob optimized for deterministic C runtime access by offset.- The sidecar (for example
weights_manifest.json) records tensor names, offsets, shapes, dtypes, quantization metadata, and template bindings. bump + sidecaris the handoff contract that gives IR1 enough concrete information to lower, codegen, and run inference or backprop paths without hidden runtime guesses.
The IR Pipeline
Both inference and training share the same IR foundation. The inference path (v8) adds quantized dispatch, model-family contracts, multimodal coverage, and runtime profiling hooks. The training path (v7) adds backward synthesis, explicit gradient accumulation, and a separate training memory layout — built on top of the same IR1/IR2 lowering chain.
Q4_0 · Q4_K · Q5_K · Q6_K · Q8_0 · BF16 · FP32
→ layout.json
no architecture guessing in the emitter
GEMM / GEMV quantized dispatch
MEGA fused kernels (prefill + decode)
OpenMP thread pool
accumulation windows → ir2_train_backward.json
canary sentinels between sections → layout_train.json
canary write + verify by phase
AdamW · grad accumulation windows
cross-entropy (stable log-sum-exp)
PyTorch oracle parity gates
The kernel library is the shared foundation for both pipelines — every kernel used in inference is also the backward-pass primitive used in training:
Featured Deep Dive: Gated DeltaNet
Qwen3.5/qwen3next-style recurrent attention uses a compact state matrix instead of scanning the full KV-cache in every recurrent layer. CK implements that path as a dedicated Gated DeltaNet kernel family with parity-focused docs.
Correctness
Every claim of correctness is backed by a specific, reproducible test. Parity is not a property we assert — it is something we measure at every kernel boundary, training step, and memory section.
Kernel Parity Gates
| max diff < 1e-5 | All forward and backward kernels tested via Python ctypes against PyTorch autograd. Fails hard on violation. |
| llama.cpp parity | Quantized GEMM/GEMV kernels (Q4_K, Q5_K, Q6_K, Q8_0) verified against llama.cpp reference output — not just PyTorch. Both references must agree. |
| per-kernel test | Each kernel ships its own Python parity test. make test runs the full suite. |
| model matrix | make v6.6-validate-matrix runs the full kernel contract suite across GPT-2, Qwen2, Qwen3, Gemma3. |
Training Parity (v7)
| oracle cadence | Step-level and slot-level checks against a live PyTorch oracle. Run with --parity-on. |
| first-divergence | Operator + tensor level tracing. Reports the exact step, op, and tensor where CK diverges from PyTorch. |
| canary sentinels | Memory integrity markers between training phases (weights / activations / grads / optimizer state). Written and verified each phase. |
| 850-step repro | Deterministic parity run across all three backends (c / c_ptref / torch). Artifacts in train_runtime_parity_*_latest.json. |
Performance
Profiling + SIMD
| VTune + flamegraph | Hardware counter integration via the IR Visualizer. Load run artifacts and inspect per-op hotspots with full flamegraph data. |
| AVX-512 MEGA kernels | RMSNorm + QKV + RoPE fused. MLP + Residual fused. Fusion patterns detected automatically at IR2. (v6.6) |
| AMX | Intel Advanced Matrix Extensions support coming in a future version. |
| OpenMP thread pool | Parallel prefill and decode passes. Token-parallel MLP. (v6.6) |
| Bump allocator | Cache-aligned, contiguous weight layout. Zero-copy load from GGUF. No heap fragmentation in hot paths. |
Tokenizer — 16.5x Faster
Trie-based BPE/WordPiece tokenization. O(k) lookup vs O(n×k). No memcpy in hot path. Full UTF-8 multilingual support.
| Text Length | C-Kernel | PyTorch | Speedup |
|---|---|---|---|
| 200 chars | 20,941/s | 4,616/s | 4.5x |
| 3,000 chars | 95,923/s | 6,197/s | 15.5x |
| 15,000 chars | 114,463/s | 6,237/s | 18.4x |
Average: 16.56x faster than PyTorch/tiktoken
Kernel Library
All kernels implement forward and backward passes. Each ships with a Python ctypes parity test that loads libckernel_engine.so and compares output against PyTorch autograd with max diff < 1e-5:
| Kernel | Forward | Backward | Notes |
|---|---|---|---|
attention |
Yes | Yes | Causal mask, GQA, sliding-window, head-major layout |
rope |
Yes | Yes | Rotary position embeddings, precomputed cache |
rmsnorm |
Yes | Yes | Fused normalization, rstd caching, learned scale |
swiglu |
Yes | Yes | Fused gate activation for Llama/Qwen-style MLP |
softmax |
Yes | Yes | Causal row-wise, numerically stable |
layernorm |
Yes | Yes | Rolled and unrolled variants |
gelu |
Yes | Yes | Exact and fast approximation variants |
cross_entropy |
Yes | Yes | Stable log-sum-exp, mean reduction, ignore-index |
gemm |
Yes | N/A | Blocked serial, AVX-512, parallel — Q4_K · Q5_K · Q6_K · Q8_0 · FP32 |
mlp |
Yes | Yes | FC1 + activation + FC2, token-parallel |
Kernel Reference → Model × Kernel Matrix →
Quick Start
Build + Parity Test
git clone https://github.com/antshiv/C-Kernel-Engine.git cd C-Kernel-Engine make # builds build/libckernel_engine.so make test # runs all kernel parity tests vs PyTorch
Requires Linux, GCC, Python 3 + PyTorch for parity tests. AVX2 minimum; AVX-512 recommended.
Inference — Generate C Runtime
make ck-emit \ CONFIG=path/to/config.json \ OUT=build/generated_model.c
Emits a complete C file from IR. Load GGUF quantized weights via make gguf-convert.
Training — v7 IR Runtime
. .venv/bin/activate python version/v7/scripts/ck_run_v7.py init \ --run /tmp/myrun \ --template qwen3 \ --generate-ir --generate-runtime --strict python version/v7/scripts/ck_run_v7.py train \ --run /tmp/myrun \ --backend ck --parity-on
Produces IR1, IR2, layout, generated C runtime, and parity artifacts.
Operator Spectrum (1 → 8)
Raw Corpus → Tokenizer → Data Prep → Transformer IR → Forward → Backward → Loss → Inference.
Project Structure
src/kernels
`-- fused
version/v6.6
|-- docs
|-- include
|-- kernel_maps
|-- patches
|-- scripts
| `-- parity
|-- src
| |-- generated
| |-- kernel_config
| |-- scripts
| `-- test_generated
|-- templates
|-- test
|-- testing
|-- tests
|-- tools
`-- unittest
version/v7
|-- artifacts
| `-- svg_dsl
| |-- gen1_archive_2026-04-05
| `-- spec_archive_2026-04-08
|-- contracts
|-- data
| |-- eval_contracts
| |-- generated
| | |-- toy_svg_semantic_shapes_tokenizer
| | `-- toy_svg_structured_atoms_tokenizer
| |-- probe_contracts
| |-- spec03
| | |-- contracts
| | |-- holdout
| | |-- manifests
| | |-- midtrain
| | |-- normalized
| | |-- pretrain
| | |-- raw_assets
| | |-- sft
| | `-- tokenizer
| `-- spec04
| |-- contracts
| |-- holdout
| |-- manifests
| |-- midtrain
| |-- normalized
| |-- pretrain
| |-- raw_assets
| |-- sft
| `-- tokenizer
|-- docs
|-- examples
|-- experiments
| `-- svg_dsl
| |-- catalog
| |-- core
| |-- programs
| `-- renderers
|-- include
|-- kernel_maps
|-- regression
|-- reports
| |-- spec12_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec12_gold_mappings
| |-- spec13b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec13b_gold_mappings
| |-- spec14a_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec14a_gold_mappings
| |-- spec14b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec14b_gold_mappings
| |-- spec15a_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec15a_gold_mappings
| |-- spec15b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec15b_gold_mappings
| |-- spec_broader_1_family_packs -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec_broader_1_family_packs
| `-- spec_broader_1_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec_broader_1_gold_mappings
|-- runs
| |-- logs
| `-- overnight_monitor
| `-- spec10
|-- scripts
| |-- dataset
| `-- parity
|-- src
|-- templates
|-- test
|-- tests
| |-- contracts
| `-- fixtures
`-- tools
`-- src
87 directories