Version History & Roadmap
C-Kernel-Engine development roadmap. Every version builds on the IR graph and kernel foundation, enabling new capabilities without breaking existing code.
New: CK-Native Training Curriculum
Use the curriculum page to see how v7 training foundations feed the rest of the roadmap:
v8-v11 multimodal training, v12-v13 sparse scale, v14 adapters,
v15 embedded runtime, and v16-v17 interpretability plus control integration.
Architecture Evolution
Every version builds on the IR graph + kernel foundation. Click to explore the journey from C orchestrator to pure IR-driven code generation.
IR Evolution: v6.x Orchestrator -> v6.6 Inference IR -> v7 Training IR
The progression is explicit: v6.x hardcoded orchestration, v6.5 generated-hybrid paths, v6.6 pure IR-driven inference, and v7 IR-driven training lowering with backward synthesis, contiguous training layout, and runtime safety diagnostics.
v6.x used a C orchestrator that dispatched to kernels based on operation type. No code generation - the orchestrator called kernels directly.
Code Path
- ✗ Code generation
- ✓ C orchestrator
- ✓ Kernel dispatch
- ✗ IR graph
Kernels
- gemm_nt_q4_k
- gemv_q8_0
- attention_decode
- rope_forward_qk
Limitations
- No fusion patterns
- Fixed execution order
- Manual kernel dispatch
v6.5 introduced code generation from IR, but the generated C code contained hardcoded logic paths. Templates determined structure, not the IR.
Code Generation
- ✗ Pure IR-driven
- ✓ Template-based gen
- ✓ IR v1 defined
- ✗ Fusion detection
IR Structure
- ops: ["layer_norm", "matmul"]
- attrs: {eps, n_heads}
- No fusion annotations
- Templates hardcoded
New in v6.5
- Memory layout defined
- Weight manifest
- BUMP format
- Template codegen
v6.6 achieves pure IR-driven code generation for the core inference flow. Ops are emitted from IR, kernels selected automatically, memory from layout.json. Model dimensions like attn_out_dim are now part of the config contract (parsed from weights), reducing hardcoded paths.
See the IR Pipeline v6.6 visual guide for templates → IR1/IR2 → lowering → memory layout → codegen.
IR-Driven (Correct)
- ✓ Ops from IR
- ✓ Kernel selection auto
- ✓ Memory from layout.json
- ✓ Fusion patterns
- ✓ Config contract (attn_out_dim)
MEGA Kernels
- mega_fused_attention_prefill
- mega_fused_outproj_mlp
- Norm + QKV + RoPE fuse
- MLP + Residual fuse
Tech Debt (Qwen2-specific)
- RoPE scaling type
- KV cache layout
- Activation function
- Chat template selection
v7 keeps the same IR-first architecture and adds training-specific lowering: IR1 train-forward, IR2 backward synthesis with explicit gradient accumulation, contiguous training memory layout, generated C runtime, and canary/layout diagnostics.
See v7 Backprop IR Pipeline for the full visual walk-through.
Lowered Artifacts
- ✓
ir1_train_forward.json - ✓
ir2_train_backward.json - ✓
layout_train.json - ✓
generated_train_runtime_v7.c
Safety + Diagnostics
- ✓ canary checks by phase
- ✓ layout audit report
- ✓ first-divergence tracing
- ✓ CK vs PyTorch parity gates
Next Lowering Steps
- Execution-plan JSON (
train_exec_plan.json) - Threaded GEMM split policy (M/N/K)
- Deterministic reduction contracts
- One-command operator train gate
Detailed Roadmap
| Version | Focus | Key Features | Depends On | Status |
|---|---|---|---|---|
| v6.6 | IR Fusion + Parallel | IR-driven fusion kernels, OpenMP parallelization, MEGA kernels | Base | Released |
| v7.0 | Training IR + Data Pipeline Foundation | IR1 train-forward + IR2 backward synthesis, Stage A/Stage B dataset preparation (ASCII cleanup + tokenizer roundtrip gates), v7 init/train/parity CLI flow | v6.6 | Building |
| v7.1 | Runtime Safety + Diagnostics | Contiguous train memory layout, canary diagnostics, CK-vs-PyTorch parity regimen (epoch + grad-accum sweeps), deterministic replay + drift triage artifacts, IR visualizer training-dashboard hardening (artifact ingestion + bug-fix cadence), and GGUF inference contract stabilization across Gemma / Qwen3.5 / Nanbeige bring-up | v7.0 | Building |
| v7.2 | Threaded Training Runtime | IR3 execution plan for dispatch, threaded GEMM policy (split M/N/K), deterministic reductions | v7.1 | Planned |
| v8.0 | Vision Encoder | Patch embedding lowering, image positional encoding, ViT support, and block-local multimodal stitching groundwork | v7.x training | Building |
| v9.0 | Vision Training | Backward through vision encoder, image classification fine-tuning | v8.0 | Planned |
| v10.0 | Audio Encoder | Conformer/Whisper-style encoder, spectrogram preprocessing | v7.x training | Planned |
| v11.0 | Audio Training | Backward through audio encoder, speech recognition training | v10.0 | Planned |
| v12.0 | MoE - Mixture of Experts | Expert selection kernels, router computation, load balancing loss | v7.x training | Planned |
| v13.0 | MoE Backward | Backward pass through MoE gates and experts | v12.0 | Planned |
| v14.0 | LoRA / QLoRA | Adapter injection, low-rank decomposition, 4-bit LoRA support | v7.x training | Planned |
| v15.0 | Embedded AI Inference | Bring LLM/AI inference to constrained flight-control and robotics hardware with deterministic runtime contracts and portable HAL integration | v7.x training sign-off + v1-v14 kernel families | Planned |
| v16.0 | Mechanistic Interpretability | Train sparse autoencoders on traced activations, stitch features back to IR ops/kernels, and add causal intervention tooling for explainable model behavior | v15.0 | Planned |
| v17.0 | LLM + Vision Policy Integration | Integrate embedded LLM + vision policy outputs into real-time control loops and validate end-to-end autonomy behavior on robotics-class systems | v16.0 | Planned |
| v18.0 | Distributed HPC Training | Multi-node CPU training with RDMA (InfiniBand/RoCE), gradient compression, and HPC cluster orchestration for large-scale model training (10-100B parameters) on commodity hardware | v17.0 | Planned |
Version Timeline
"Training IR Runtime + Operator Cockpit" - From data preparation and parity-first training to fully lowered train execution
- PR1: v7 CLI contracts + strict validation + inference smoke gate
- PR2: Generated train runtime path wired (ck backend callable)
- PR3: Contiguous train layout + canary diagnostics + strict first-divergence tracing
- PR4 (current): Oracle cadence + drift triage artifacts + deterministic replay hardening + training telemetry ingestion for IR visualizer
- PR5: Threaded GEMM dispatch lowered from IR3 execution plan
- PR6: Run-dir artifact unification + one-command v7 train gate
- IR Visualizer (ongoing): training tabs, parity cockpit, data/tokenizer transparency, and robustness bug fixes (path loading, mode wiring, runtime JS errors).
Operator history highlights (training stack):
- Data preparation: Stage 0.5/0.55 corpus path (docs SVG assets -> UTF-8/ASCII cleanup -> manifests + tokenizer map).
- Stage A / Stage B flow: pretrain/midtrain dataset tracks with checkpoint promotion and run-scoped reports.
- PyTorch parity analysis: first-divergence localizer, 10-epoch cadence, grad-accum sweeps, replay determinism checks.
- IR visualizer training cockpit: training dashboard, parity tracker, data/tokenizer transparency, and operator runbook command blocks.
Recent inference/runtime fixes on the same branch:
-
Gemma 3 parity fix: RoPE was being applied with the wrong Q/K pairing style. Gemma expects split-half RoPE; the runtime now reads exported
rope_layoutand selects the matchingrope_qkkernel automatically. - Qwen3.5 recurrent bring-up: hybrid recurrent-attention decode, Gated DeltaNet parity, and contract-driven visible/suppressed thinking now run on the built-in C tokenizer path.
-
Nanbeige bring-up lane: SentencePiece + ChatML contracts stabilized on the C tokenizer path with untied
output.weightpreserved and long-think output treated as model behavior.
Recent v7 milestone commits:
-
524791c4fix(v7/regression): inference-family regression gate hardened, contract audit added, andmake regression-fastpromoted into nightly CI. -
af99d411fix(v7/inference): chat contracts unified, Gemma split-half RoPE selected from exported metadata, Qwen families kept on the C tokenizer path, and Nanbeige contract bring-up stabilized. -
80d88071docs(v7): model-kernel matrix, runbook, and version-history pages updated to record the family bring-up fixes and canonical cache-backed bring-up flow.
Execution Order (Current Priority: Training First)
Methodical completion path: finish v7.x training sign-off end-to-end before opening active v15 embedded implementation. v15-v17 stay visible as downstream roadmap, but are gated by training completion criteria.
| Phase | Objective | Exit Criteria (must pass) | Status |
|---|---|---|---|
| T0 | Runbook Lock | Single canonical Stage A/B command path + deterministic env contract documented and reproducible on clean checkout | Building |
| T1 | Data + Tokenizer Integrity | ASCII/SVG cleanup gates, tokenizer roundtrip gates, manifest invariants all passing | Building |
| T2 | Train Correctness | CK-vs-PyTorch parity cadence, first-divergence triage closure, deterministic replay stability | Building |
| T3 | Runtime + Throughput Stability | IR3 threaded dispatch, deterministic reductions, no regressions in train contracts/smokes | Planned |
| T4 | Release Gate + Handoff | One-command v7 train gate green, run-dir artifacts unified, operator docs frozen for release handoff | Planned |
v15 Entry Criteria (Hard Gate)
- Stage A and Stage B runs reproducible from the runbook on a clean machine.
- Parity/drift gates pass at declared cadence with stable replay artifacts.
- IR visualizer training cockpit stable for run-dir artifacts (data, tokenizer, parity, profile).
- v7 train release checklist and docs sign-off completed.
Embedded + Interpretability Arc (v15 -> v17)
v1-v14 expands model/training architecture options first; v15 lands embedded deployment targets with hard realtime constraints; v16 adds mechanistic interpretability to de-risk policy behavior before v17 autonomy integration.
| v15 Component | Target | Description |
|---|---|---|
| Memory footprint | <= 512KB runtime core |
Sub-512KB inference/runtime core for MCU-class deployment (model weights budget tracked separately). |
| Latency targets | < 10ms control-inference step |
Cortex-M baseline loop target for control inference; higher-throughput token targets tracked for Cortex-A/NPU tiers. |
| Determinism | WCET + jitter budgets | Worst-case execution-time guarantees and bounded jitter suitable for real-time control loops. |
| HAL abstraction | Portable I/O interfaces | Unified hooks for IMUs, motor control (PWM), and sensor buses (I2C/SPI/UART/CAN). |
v15.0 Timeline (Q2-Q4 2026)
- Phase 1: Core runtime with deterministic memory and loop contracts.
- Phase 2: ARM Cortex-M4/M7 baseline (STM32 and NXP target track).
- Phase 3: Sensor/actuator HAL hooks (IMU, motor PWM, serial buses).
- Phase 4: Benchmarks for memory, latency, and control quality vs cloud/offboard baselines.
v16.0 Timeline (2027): Mechanistic Interpretability
- PR16.1: Activation tap points in lowered IR (decode/prefill) with stable op/layer IDs and deterministic dump contracts.
- PR16.2: Sparse autoencoder (SAE) training pipeline on traced activations with feature dictionary/version artifacts.
- PR16.3: IR-to-feature stitching layer to map SAE features to kernel paths, sections, and memory ranges.
- PR16.4: Causal intervention harness (feature ablate/patch) with impact reports on logits and control-relevant metrics.
- PR16.5: Interpretability artifacts integrated into ir_visualizer and operator reports.
v17 Timeline (2027+)
- Embedded LLM + vision multimodal policy execution.
- Policy adapters tied into real-time control loops.
- Field validation for drones and broader robotics deployments.
v18 Timeline (2028+)
- RDMA transport layer (InfiniBand/RoCE) for gradient synchronization.
- Distributed optimizer state sharding (ZeRO-2/3 style for CPU memory).
- Gradient compression (top-k, quantization) to reduce bandwidth requirements.
- Activation checkpointing (recompute forward activations to trade compute for memory).
- Pipeline parallelism for model-layer distribution across nodes.
- HPC cluster job scheduler integration (SLURM, PBS, Flux).
- Fault tolerance with checkpoint restart and node failure recovery.
- Validation on 64+ node cluster with 10-100B parameter models.
v7 PR Progress (As of 2026-02-15)
Snapshot of current v7 workstream in this repository. Status is based on implemented PR slices and current gate behavior.
| PR | Scope | Status | Notes |
|---|---|---|---|
| PR1 | v7 CLI contracts + strict preflight + inference smoke | Done | Foundation for safe training iteration |
| PR2 | Generated runtime callable via --backend ck |
Done | CK runtime path executes train step loop |
| PR3 | Contiguous train layout, canary diagnostics, layout audit, drift localization | Done | Major safety/stability milestone for v7 runtime |
| PR4 | Oracle cadence + drift triage + replay determinism | Building | Tighten parity and first-divergence operator diagnostics |
| PR5 | IR3-lowered threaded GEMM dispatch | Planned | Primary performance milestone (split M/N/K policy) |
| PR6 | Run-dir unification + single v7 train gate command | Planned | Operator-grade end-to-end workflow |
v7 Commit Progression (Recent)
Recent v7 commits on this branch that map to the PR progression above:
| Commit | Summary |
|---|---|
40a3b221 | land training-oracle runtime updates + regenerated docs |
20b7fcbe | runtime checkpoints + PR4.5 roadmap/docs |
c3041b1a | train IR/layout/codegen runtime + parity-visible tooling |
db570fe5 | train layout audit + default training kernel parity gates |
f2e30342 | execute generated C runtime for --backend ck |
e7b81c56 | strict train CLI + inference smoke gate |
115143a6 | run-dir training viewer + PyTorch oracle parity scaffolding |
9b98139d | initial v7 backprop foundation |
Full log command: git log --oneline -- version/v7
"IR Fusion + Operator Tooling" - shipped with profiler-integrated diagnostics
- Fusion patterns in IR (RMSNorm+QKV+RoPE fuse)
- MEGA kernels: mega_fused_attention_prefill
- Thread pool for parallel prefill/decode
- IR visualizer with perf gate artifacts (perf/flamegraph/VTune integration)
- v6.6 tooling contracts + 3-model matrix validation targets
- Sliding-window attention contract test integrated into make/nightly parity flow
- llama.cpp parity flow completed in prepared environments (torch + llama parity lib available)
"IR Foundation" - Complete IR-based code generation
- IR v1 → IR v2 → IR v3 lowering pipeline
- Full FP32 inference engine
- Q4_K + Q5 + Q6_K quantization support
- GGUF → BUMP weight format
Design Philosophy
Why This Roadmap?
Every version builds on the IR graph + kernel foundation. This means:
- IR continuity: v7 training reuses the same IR-first architecture, with train-specific layout and execution artifacts
- Fusion works for both forward and backward (doubles speed)
- Multi-modal is just adding new kernel types to IR
- Capability-first sequencing: v1-v14 expands architecture/training options before v15 embedded hardening
- Execution discipline: close v7.x training gates before active v15 implementation
- Graph-based = easy optimization passes, validation, debugging
Key Principles
- IR First: All capabilities expressed in IR, generated to C
- Kernel Fusion: Every version expands fusion patterns
- No Premature Optimization: Explore fast-moving architectures first, then harden for embedded constraints
- No GPU Dependency: Pure CPU execution, any x86_64 machine
- Transparency: See every operation via ir_visualizer.html
Version Numbering
Version numbers follow this convention:
- Current priority (2026): Complete v7.x training sign-off while actively bringing up v8.x vision inference foundations before opening v15 implementation
- v6.x: Inference-only (forward pass)
- v7.x: Training foundation (forward + backward)
- v8.x: Vision encoding and image training foundations, now in active bring-up
- v10.x: Audio encoding and audio training foundations
- v12.x: MoE architecture (efficient large models)
- v14.x: Parameter-efficient fine-tuning (LoRA)
- v15.x: Embedded AI inference on constrained hardware (enabled by v1-v14 groundwork)
- v16.x: Mechanistic interpretability (SAE feature learning, IR feature mapping, causal intervention tooling)
- v17.x: LLM + vision policy integration for realtime autonomy loops
- v18.x: Distributed HPC training on CPUs with RDMA (InfiniBand/RoCE), gradient compression, and multi-node orchestration for 10-100B parameter models