Version History & Roadmap

C-Kernel-Engine development roadmap. Every version builds on the IR graph and kernel foundation, enabling new capabilities without breaking existing code.

New: CK-Native Training Curriculum

Use the curriculum page to see how v7 training foundations feed the rest of the roadmap: v8-v11 multimodal training, v12-v13 sparse scale, v14 adapters, v15 embedded runtime, and v16-v17 interpretability plus control integration.

Architecture Evolution

Every version builds on the IR graph + kernel foundation. Click to explore the journey from C orchestrator to pure IR-driven code generation.

v6.6
IR + Tooling
v7
Training Build
v8-9
Vision
v10-11
Audio
v12-13
MoE
v14
LoRA
v15
Embedded Compat
v16
Mech Interp
v17
LLM+Vision Ctrl
v18
Dist HPC Train

IR Evolution: v6.x Orchestrator -> v6.6 Inference IR -> v7 Training IR

The progression is explicit: v6.x hardcoded orchestration, v6.5 generated-hybrid paths, v6.6 pure IR-driven inference, and v7 IR-driven training lowering with backward synthesis, contiguous training layout, and runtime safety diagnostics.

v6.x C Orchestrator: Kernel Dispatch Orchestrator

v6.x used a C orchestrator that dispatched to kernels based on operation type. No code generation - the orchestrator called kernels directly.

Code Path

  • Code generation
  • C orchestrator
  • Kernel dispatch
  • IR graph

Kernels

  • gemm_nt_q4_k
  • gemv_q8_0
  • attention_decode
  • rope_forward_qk

Limitations

  • No fusion patterns
  • Fixed execution order
  • Manual kernel dispatch
// v6.x style: C orchestrator dispatches ckernel_orchestrator_run(ctx, CK_MODE_DECODE); // Dispatcher calls kernels based on op_type
v6.5 Generated C: IR with Hardcoded Paths Generated C

v6.5 introduced code generation from IR, but the generated C code contained hardcoded logic paths. Templates determined structure, not the IR.

Code Generation

  • Pure IR-driven
  • Template-based gen
  • IR v1 defined
  • Fusion detection

IR Structure

  • ops: ["layer_norm", "matmul"]
  • attrs: {eps, n_heads}
  • No fusion annotations
  • Templates hardcoded

New in v6.5

  • Memory layout defined
  • Weight manifest
  • BUMP format
  • Template codegen
// v6.5 style: Template generates C with hardcoded paths generate_code(ir, template_decode); // Output has if(mode==DECODE) { ... hardcoded ... }
v6.6 Pure IR-Driven Inference (Released) Released

v6.6 achieves pure IR-driven code generation for the core inference flow. Ops are emitted from IR, kernels selected automatically, memory from layout.json. Model dimensions like attn_out_dim are now part of the config contract (parsed from weights), reducing hardcoded paths.

See the IR Pipeline v6.6 visual guide for templates → IR1/IR2 → lowering → memory layout → codegen.

IR-Driven (Correct)

  • Ops from IR
  • Kernel selection auto
  • Memory from layout.json
  • Fusion patterns
  • Config contract (attn_out_dim)

MEGA Kernels

  • mega_fused_attention_prefill
  • mega_fused_outproj_mlp
  • Norm + QKV + RoPE fuse
  • MLP + Residual fuse

Tech Debt (Qwen2-specific)

  • RoPE scaling type
  • KV cache layout
  • Activation function
  • Chat template selection
100%
IR Coverage
7
MEGA Kernels
~4
Tech Debt Items
1
IR→C Pass
// v6.6 style: IR-driven with config contract ir = ir_from_model(model) // Build IR from model fused = fuse_patterns(ir) // Detect fusion patterns code = generate_c(fused, registry) // Generate C // Core flow IR-driven; remaining model-specific defaults isolated
v7 IR-Driven Training Runtime (Building) Current Build

v7 keeps the same IR-first architecture and adds training-specific lowering: IR1 train-forward, IR2 backward synthesis with explicit gradient accumulation, contiguous training memory layout, generated C runtime, and canary/layout diagnostics.

See v7 Backprop IR Pipeline for the full visual walk-through.

Lowered Artifacts

  • ir1_train_forward.json
  • ir2_train_backward.json
  • layout_train.json
  • generated_train_runtime_v7.c

Safety + Diagnostics

  • canary checks by phase
  • layout audit report
  • first-divergence tracing
  • CK vs PyTorch parity gates

Next Lowering Steps

  • Execution-plan JSON (train_exec_plan.json)
  • Threaded GEMM split policy (M/N/K)
  • Deterministic reduction contracts
  • One-command operator train gate
IR1+IR2
Training Graph
Contiguous
Train Memory
Canary
Runtime Safety
Oracle
Parity Tracking
// v7 style: training lowering pipeline ir1 = build_ir_train_forward(template, manifest) ir2 = synthesize_backward(ir1, grad_rules) layout = plan_train_memory(ir2) // layout_train.json code = generate_train_runtime(ir2, layout) // generated_train_runtime_v7.c // diagnostics + parity gates validate execution

Detailed Roadmap

Version Focus Key Features Depends On Status
v6.6 IR Fusion + Parallel IR-driven fusion kernels, OpenMP parallelization, MEGA kernels Base Released
v7.0 Training IR + Data Pipeline Foundation IR1 train-forward + IR2 backward synthesis, Stage A/Stage B dataset preparation (ASCII cleanup + tokenizer roundtrip gates), v7 init/train/parity CLI flow v6.6 Building
v7.1 Runtime Safety + Diagnostics Contiguous train memory layout, canary diagnostics, CK-vs-PyTorch parity regimen (epoch + grad-accum sweeps), deterministic replay + drift triage artifacts, IR visualizer training-dashboard hardening (artifact ingestion + bug-fix cadence), and GGUF inference contract stabilization across Gemma / Qwen3.5 / Nanbeige bring-up v7.0 Building
v7.2 Threaded Training Runtime IR3 execution plan for dispatch, threaded GEMM policy (split M/N/K), deterministic reductions v7.1 Planned
v8.0 Vision Encoder Patch embedding lowering, image positional encoding, ViT support, and block-local multimodal stitching groundwork v7.x training Building
v9.0 Vision Training Backward through vision encoder, image classification fine-tuning v8.0 Planned
v10.0 Audio Encoder Conformer/Whisper-style encoder, spectrogram preprocessing v7.x training Planned
v11.0 Audio Training Backward through audio encoder, speech recognition training v10.0 Planned
v12.0 MoE - Mixture of Experts Expert selection kernels, router computation, load balancing loss v7.x training Planned
v13.0 MoE Backward Backward pass through MoE gates and experts v12.0 Planned
v14.0 LoRA / QLoRA Adapter injection, low-rank decomposition, 4-bit LoRA support v7.x training Planned
v15.0 Embedded AI Inference Bring LLM/AI inference to constrained flight-control and robotics hardware with deterministic runtime contracts and portable HAL integration v7.x training sign-off + v1-v14 kernel families Planned
v16.0 Mechanistic Interpretability Train sparse autoencoders on traced activations, stitch features back to IR ops/kernels, and add causal intervention tooling for explainable model behavior v15.0 Planned
v17.0 LLM + Vision Policy Integration Integrate embedded LLM + vision policy outputs into real-time control loops and validate end-to-end autonomy behavior on robotics-class systems v16.0 Planned
v18.0 Distributed HPC Training Multi-node CPU training with RDMA (InfiniBand/RoCE), gradient compression, and HPC cluster orchestration for large-scale model training (10-100B parameters) on commodity hardware v17.0 Planned

Version Timeline

v7.0
Building Current sprint track

"Training IR Runtime + Operator Cockpit" - From data preparation and parity-first training to fully lowered train execution

  • PR1: v7 CLI contracts + strict validation + inference smoke gate
  • PR2: Generated train runtime path wired (ck backend callable)
  • PR3: Contiguous train layout + canary diagnostics + strict first-divergence tracing
  • PR4 (current): Oracle cadence + drift triage artifacts + deterministic replay hardening + training telemetry ingestion for IR visualizer
  • PR5: Threaded GEMM dispatch lowered from IR3 execution plan
  • PR6: Run-dir artifact unification + one-command v7 train gate
  • IR Visualizer (ongoing): training tabs, parity cockpit, data/tokenizer transparency, and robustness bug fixes (path loading, mode wiring, runtime JS errors).

Operator history highlights (training stack):

  • Data preparation: Stage 0.5/0.55 corpus path (docs SVG assets -> UTF-8/ASCII cleanup -> manifests + tokenizer map).
  • Stage A / Stage B flow: pretrain/midtrain dataset tracks with checkpoint promotion and run-scoped reports.
  • PyTorch parity analysis: first-divergence localizer, 10-epoch cadence, grad-accum sweeps, replay determinism checks.
  • IR visualizer training cockpit: training dashboard, parity tracker, data/tokenizer transparency, and operator runbook command blocks.

Recent inference/runtime fixes on the same branch:

  • Gemma 3 parity fix: RoPE was being applied with the wrong Q/K pairing style. Gemma expects split-half RoPE; the runtime now reads exported rope_layout and selects the matching rope_qk kernel automatically.
  • Qwen3.5 recurrent bring-up: hybrid recurrent-attention decode, Gated DeltaNet parity, and contract-driven visible/suppressed thinking now run on the built-in C tokenizer path.
  • Nanbeige bring-up lane: SentencePiece + ChatML contracts stabilized on the C tokenizer path with untied output.weight preserved and long-think output treated as model behavior.

Recent v7 milestone commits:

  • 524791c4 fix(v7/regression): inference-family regression gate hardened, contract audit added, and make regression-fast promoted into nightly CI.
  • af99d411 fix(v7/inference): chat contracts unified, Gemma split-half RoPE selected from exported metadata, Qwen families kept on the C tokenizer path, and Nanbeige contract bring-up stabilized.
  • 80d88071 docs(v7): model-kernel matrix, runbook, and version-history pages updated to record the family bring-up fixes and canonical cache-backed bring-up flow.

Execution Order (Current Priority: Training First)

Methodical completion path: finish v7.x training sign-off end-to-end before opening active v15 embedded implementation. v15-v17 stay visible as downstream roadmap, but are gated by training completion criteria.

Phase Objective Exit Criteria (must pass) Status
T0 Runbook Lock Single canonical Stage A/B command path + deterministic env contract documented and reproducible on clean checkout Building
T1 Data + Tokenizer Integrity ASCII/SVG cleanup gates, tokenizer roundtrip gates, manifest invariants all passing Building
T2 Train Correctness CK-vs-PyTorch parity cadence, first-divergence triage closure, deterministic replay stability Building
T3 Runtime + Throughput Stability IR3 threaded dispatch, deterministic reductions, no regressions in train contracts/smokes Planned
T4 Release Gate + Handoff One-command v7 train gate green, run-dir artifacts unified, operator docs frozen for release handoff Planned

v15 Entry Criteria (Hard Gate)

Embedded + Interpretability Arc (v15 -> v17)

v1-v14 expands model/training architecture options first; v15 lands embedded deployment targets with hard realtime constraints; v16 adds mechanistic interpretability to de-risk policy behavior before v17 autonomy integration.

v15 Component Target Description
Memory footprint <= 512KB runtime core Sub-512KB inference/runtime core for MCU-class deployment (model weights budget tracked separately).
Latency targets < 10ms control-inference step Cortex-M baseline loop target for control inference; higher-throughput token targets tracked for Cortex-A/NPU tiers.
Determinism WCET + jitter budgets Worst-case execution-time guarantees and bounded jitter suitable for real-time control loops.
HAL abstraction Portable I/O interfaces Unified hooks for IMUs, motor control (PWM), and sensor buses (I2C/SPI/UART/CAN).

v15.0 Timeline (Q2-Q4 2026)

v16.0 Timeline (2027): Mechanistic Interpretability

v17 Timeline (2027+)

v18 Timeline (2028+)

v7 PR Progress (As of 2026-02-15)

Snapshot of current v7 workstream in this repository. Status is based on implemented PR slices and current gate behavior.

PR Scope Status Notes
PR1 v7 CLI contracts + strict preflight + inference smoke Done Foundation for safe training iteration
PR2 Generated runtime callable via --backend ck Done CK runtime path executes train step loop
PR3 Contiguous train layout, canary diagnostics, layout audit, drift localization Done Major safety/stability milestone for v7 runtime
PR4 Oracle cadence + drift triage + replay determinism Building Tighten parity and first-divergence operator diagnostics
PR5 IR3-lowered threaded GEMM dispatch Planned Primary performance milestone (split M/N/K policy)
PR6 Run-dir unification + single v7 train gate command Planned Operator-grade end-to-end workflow

v7 Commit Progression (Recent)

Recent v7 commits on this branch that map to the PR progression above:

Commit Summary
40a3b221land training-oracle runtime updates + regenerated docs
20b7fcberuntime checkpoints + PR4.5 roadmap/docs
c3041b1atrain IR/layout/codegen runtime + parity-visible tooling
db570fe5train layout audit + default training kernel parity gates
f2e30342execute generated C runtime for --backend ck
e7b81c56strict train CLI + inference smoke gate
115143a6run-dir training viewer + PyTorch oracle parity scaffolding
9b98139dinitial v7 backprop foundation

Full log command: git log --oneline -- version/v7

v6.6
Released 2026-01-15

"IR Fusion + Operator Tooling" - shipped with profiler-integrated diagnostics

  • Fusion patterns in IR (RMSNorm+QKV+RoPE fuse)
  • MEGA kernels: mega_fused_attention_prefill
  • Thread pool for parallel prefill/decode
  • IR visualizer with perf gate artifacts (perf/flamegraph/VTune integration)
  • v6.6 tooling contracts + 3-model matrix validation targets
  • Sliding-window attention contract test integrated into make/nightly parity flow
  • llama.cpp parity flow completed in prepared environments (torch + llama parity lib available)
v6.5
Released 2026-01-01

"IR Foundation" - Complete IR-based code generation

  • IR v1 → IR v2 → IR v3 lowering pipeline
  • Full FP32 inference engine
  • Q4_K + Q5 + Q6_K quantization support
  • GGUF → BUMP weight format

Design Philosophy

Why This Roadmap?

Every version builds on the IR graph + kernel foundation. This means:

  • IR continuity: v7 training reuses the same IR-first architecture, with train-specific layout and execution artifacts
  • Fusion works for both forward and backward (doubles speed)
  • Multi-modal is just adding new kernel types to IR
  • Capability-first sequencing: v1-v14 expands architecture/training options before v15 embedded hardening
  • Execution discipline: close v7.x training gates before active v15 implementation
  • Graph-based = easy optimization passes, validation, debugging

Key Principles

  • IR First: All capabilities expressed in IR, generated to C
  • Kernel Fusion: Every version expands fusion patterns
  • No Premature Optimization: Explore fast-moving architectures first, then harden for embedded constraints
  • No GPU Dependency: Pure CPU execution, any x86_64 machine
  • Transparency: See every operation via ir_visualizer.html

Version Numbering

Version numbers follow this convention:

💡
Track Progress: See Test Report for nightly results or GitHub Issues for tasks.
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close