Version History & Roadmap

C-Kernel-Engine development roadmap. Every version builds on the IR graph and kernel foundation, enabling new capabilities without breaking existing code.

New: CK-Native Training Curriculum

Use the curriculum page to see how v7 training foundations feed the rest of the roadmap: v8-v11 multimodal training, v12-v13 sparse scale, v14 adapters, v15 embedded runtime, and v16-v17 interpretability plus control integration.

Open training-curriculum.html version-integrated learning plan and cluster readiness gates Open v7-training-progression-playbook.html current staged experiments across SVG, routing, and code tracks

Architecture Evolution

Every version builds on the IR graph + kernel foundation. Click to explore the journey from C orchestrator to pure IR-driven code generation.

v6.6

IR + Tooling

→

Training Build

→

v8-9

Vision

→

v10-11

Audio

→

v12-13

MoE

→

v14

LoRA

→

v15

Embedded Compat

→

v16

Mech Interp

→

v17

LLM+Vision Ctrl

→

v18

Dist HPC Train

IR Evolution: v6.x Orchestrator -> v6.6 Inference IR -> v7 Training IR

The progression is explicit: v6.x hardcoded orchestration, v6.5 generated-hybrid paths, v6.6 pure IR-driven inference, and v7 IR-driven training lowering with backward synthesis, contiguous training layout, and runtime safety diagnostics.

v6.x C Orchestrator: Kernel Dispatch Orchestrator ▼

v6.x used a C orchestrator that dispatched to kernels based on operation type. No code generation - the orchestrator called kernels directly.

Code Path

✗ Code generation
✓ C orchestrator
✓ Kernel dispatch
✗ IR graph

Kernels

gemm_nt_q4_k
gemv_q8_0
attention_decode
rope_forward_qk

Limitations

No fusion patterns
Fixed execution order
Manual kernel dispatch

// v6.x style: C orchestrator dispatches ckernel_orchestrator_run(ctx, CK_MODE_DECODE); // Dispatcher calls kernels based on op_type

v6.5 Generated C: IR with Hardcoded Paths Generated C ▼

v6.5 introduced code generation from IR, but the generated C code contained hardcoded logic paths. Templates determined structure, not the IR.

Code Generation

✗ Pure IR-driven
✓ Template-based gen
✓ IR v1 defined
✗ Fusion detection

IR Structure

ops: ["layer_norm", "matmul"]
attrs: {eps, n_heads}
No fusion annotations
Templates hardcoded

New in v6.5

Memory layout defined
Weight manifest
BUMP format
Template codegen

// v6.5 style: Template generates C with hardcoded paths generate_code(ir, template_decode); // Output has if(mode==DECODE) { ... hardcoded ... }

v6.6 Pure IR-Driven Inference (Released) Released ▼

v6.6 achieves pure IR-driven code generation for the core inference flow. Ops are emitted from IR, kernels selected automatically, memory from layout.json. Model dimensions like attn_out_dim are now part of the config contract (parsed from weights), reducing hardcoded paths.

See the IR Pipeline v6.6 visual guide for templates → IR1/IR2 → lowering → memory layout → codegen.

IR-Driven (Correct)

✓ Ops from IR
✓ Kernel selection auto
✓ Memory from layout.json
✓ Fusion patterns
✓ Config contract (attn_out_dim)

MEGA Kernels

mega_fused_attention_prefill
mega_fused_outproj_mlp
Norm + QKV + RoPE fuse
MLP + Residual fuse

Tech Debt (Qwen2-specific)

RoPE scaling type
KV cache layout
Activation function
Chat template selection

100%

IR Coverage

MEGA Kernels

Tech Debt Items

IR→C Pass

// v6.6 style: IR-driven with config contract ir = ir_from_model(model) // Build IR from model fused = fuse_patterns(ir) // Detect fusion patterns code = generate_c(fused, registry) // Generate C // Core flow IR-driven; remaining model-specific defaults isolated

v7 IR-Driven Training Runtime (Building) Current Build ▼

v7 keeps the same IR-first architecture and adds training-specific lowering: IR1 train-forward, IR2 backward synthesis with explicit gradient accumulation, contiguous training memory layout, generated C runtime, and canary/layout diagnostics.

See v7 Backprop IR Pipeline for the full visual walk-through.

Lowered Artifacts

✓ ir1_train_forward.json
✓ ir2_train_backward.json
✓ layout_train.json
✓ generated_train_runtime_v7.c

Safety + Diagnostics

✓ canary checks by phase
✓ layout audit report
✓ first-divergence tracing
✓ CK vs PyTorch parity gates

Next Lowering Steps

Execution-plan JSON (train_exec_plan.json)
Threaded GEMM split policy (M/N/K)
Deterministic reduction contracts
One-command operator train gate

IR1+IR2

Training Graph

Contiguous

Train Memory

Canary

Runtime Safety

Oracle

Parity Tracking

// v7 style: training lowering pipeline ir1 = build_ir_train_forward(template, manifest) ir2 = synthesize_backward(ir1, grad_rules) layout = plan_train_memory(ir2) // layout_train.json code = generate_train_runtime(ir2, layout) // generated_train_runtime_v7.c // diagnostics + parity gates validate execution

Current Execution Order

New model families are added only after their mathematical requirements can be expressed by circuits and resolved exactly through kernel capabilities. The compiler emits the resolved decision; it does not infer a model-specific fallback. This keeps correctness work reusable across text, vision, and audio.

Order	Workstream	Exit Gate	State
1	v8 compiler and numerical hardening Circuits, kernel maps, exact resolution, bounded X-ray attribution, and zero model-family dispatch in protected compiler paths.	Supported text and vision circuits compile deterministically; incompatible or ambiguous capabilities hard-fail; stitched and end-to-end regressions remain clean.	Active
2	Qwen3-VL parity closure Encoder checkpoints, mixed prefill, teacher-forced decode, and persistent decode against llama.cpp and PyTorch references.	First-divergence reports advance through every remaining boundary and each fix is retained by a numerical contract plus nightly coverage.	Active
3	Whisper tiny/base inference The reusable primitive layer now includes signed-16 PCM conversion, rational linear resampling, precomputed-table STFT, log-Mel, two-stage Conv1D stems, channel-to-token layout conversion, and unequal-length cross-attention. Each arithmetic boundary resolves through a kernel-map contract and runs in nightly without a Whisper compiler branch.	Next: WAV container parsing, a band-limited resampler, optimized FFT, generated encoder execution, tokenizer/timestamps, then fixed public fixtures matching PyTorch and whisper.cpp through transcript output.	Active
4	Audio performance and FP32 training Encoder tiling/threading, cached cross-attention K/V, then Conv1D and cross-attention backward in v7.	Inference parity is stable; performance evidence separates frontend, encoder, and autoregressive decode; FP32 backward matches PyTorch.	Gated

Detailed Roadmap

Version	Focus	Key Features	Depends On	Status
v6.6	IR Fusion + Parallel	IR-driven fusion kernels, OpenMP parallelization, MEGA kernels	Base	Released
v7.0	Training IR + Data Pipeline Foundation	IR1 train-forward + IR2 backward synthesis, Stage A/Stage B dataset preparation (ASCII cleanup + tokenizer roundtrip gates), v7 init/train/parity CLI flow	v6.6	Building
v7.1	Runtime Safety + Diagnostics	Contiguous train memory layout, canary diagnostics, CK-vs-PyTorch parity regimen (epoch + grad-accum sweeps), deterministic replay + drift triage artifacts, IR visualizer training-dashboard hardening (artifact ingestion + bug-fix cadence), and GGUF inference contract stabilization across Gemma / Qwen3.5 / Nanbeige bring-up	v7.0	Building
v7.2	Threaded Training Runtime	IR3 execution plan for dispatch, threaded GEMM policy (split M/N/K), deterministic reductions	v7.1	Planned
v8.0	Inference Compiler + Vision Encoder	Circuit-owned topology and weight policy, fail-closed kernel capability resolution, explicit numerical and threading contracts, GraphIR-to-call-IR decision preservation, bounded X-ray attribution, patch embedding, resized position encoding, M-RoPE, deepstack, and multimodal stitching	v7.x training	Building
v9.0	Vision Parity + Training	Qwen3-VL encoder, mixed-prefill, teacher-forced and persistent-decode parity; then FP32 backward through vision encoder and image fine-tuning	v8.0	Planned
v10.0	Audio Inference	Whisper tiny/base first. Contracted PCM, reference resampling, STFT/log-Mel, Conv1D stem, layout conversion, and unequal-length attention are implemented; WAV parsing, band-limited resampling, optimized FFT, generated encoder/decoder composition, timestamps, and transcript parity remain.	v8 exact resolution + X-ray gates	Building
v11.0	Audio Training	FP32 Conv1D, encoder-attention and cross-attention backward; speech recognition training after inference parity. Log-Mel remains fixed preprocessing initially.	v10.0 inference parity	Planned
v12.0	MoE - Mixture of Experts	Expert selection kernels, router computation, load balancing loss	v7.x training	Planned
v13.0	MoE Backward	Backward pass through MoE gates and experts	v12.0	Planned
v14.0	LoRA / QLoRA	Adapter injection, low-rank decomposition, 4-bit LoRA support	v7.x training	Planned
v15.0	Embedded AI Inference	Bring LLM/AI inference to constrained flight-control and robotics hardware with deterministic runtime contracts and portable HAL integration	v7.x training sign-off + v1-v14 kernel families	Planned
v16.0	Mechanistic Interpretability	Train sparse autoencoders on traced activations, stitch features back to IR ops/kernels, and add causal intervention tooling for explainable model behavior	v15.0	Planned
v17.0	LLM + Vision Policy Integration	Integrate embedded LLM + vision policy outputs into real-time control loops and validate end-to-end autonomy behavior on robotics-class systems	v16.0	Planned
v18.0	Distributed HPC Training	Multi-node CPU training with RDMA (InfiniBand/RoCE), gradient compression, and HPC cluster orchestration for large-scale model training (10-100B parameters) on commodity hardware	v17.0	Planned

Version Timeline

v7.0

Building Current sprint track

"Training IR Runtime + Operator Cockpit" - From data preparation and parity-first training to fully lowered train execution

✓ PR1: v7 CLI contracts + strict validation + inference smoke gate
✓ PR2: Generated train runtime path wired (ck backend callable)
✓ PR3: Contiguous train layout + canary diagnostics + strict first-divergence tracing
▶ PR4 (current): Oracle cadence + drift triage artifacts + deterministic replay hardening + training telemetry ingestion for IR visualizer
○ PR5: Threaded GEMM dispatch lowered from IR3 execution plan
○ PR6: Run-dir artifact unification + one-command v7 train gate
▶ IR Visualizer (ongoing): training tabs, parity cockpit, data/tokenizer transparency, and robustness bug fixes (path loading, mode wiring, runtime JS errors).

Operator history highlights (training stack):

✓ Data preparation: Stage 0.5/0.55 corpus path (docs SVG assets -> UTF-8/ASCII cleanup -> manifests + tokenizer map).
✓ Stage A / Stage B flow: pretrain/midtrain dataset tracks with checkpoint promotion and run-scoped reports.
✓ PyTorch parity analysis: first-divergence localizer, 10-epoch cadence, grad-accum sweeps, replay determinism checks.
✓ IR visualizer training cockpit: training dashboard, parity tracker, data/tokenizer transparency, and operator runbook command blocks.

Recent inference/runtime fixes on the same branch:

✓ Gemma 3 parity fix: RoPE was being applied with the wrong Q/K pairing style. Gemma expects split-half RoPE; the runtime now reads exported rope_layout and selects the matching rope_qk kernel automatically.
✓ Qwen3.5 recurrent bring-up: hybrid recurrent-attention decode, Gated DeltaNet parity, and contract-driven visible/suppressed thinking now run on the built-in C tokenizer path.
✓ Nanbeige bring-up lane: SentencePiece + ChatML contracts stabilized on the C tokenizer path with untied output.weight preserved and long-think output treated as model behavior.

Recent v7 milestone commits:

✓ 524791c4 fix(v7/regression): inference-family regression gate hardened, contract audit added, and make regression-fast promoted into nightly CI.
✓ af99d411 fix(v7/inference): chat contracts unified, Gemma split-half RoPE selected from exported metadata, Qwen families kept on the C tokenizer path, and Nanbeige contract bring-up stabilized.
✓ 80d88071 docs(v7): model-kernel matrix, runbook, and version-history pages updated to record the family bring-up fixes and canonical cache-backed bring-up flow.

Execution Order (Current Priority: Training First)

Methodical completion path: finish v7.x training sign-off end-to-end before opening active v15 embedded implementation. v15-v17 stay visible as downstream roadmap, but are gated by training completion criteria.

Phase	Objective	Exit Criteria (must pass)	Status
T0	Runbook Lock	Single canonical Stage A/B command path + deterministic env contract documented and reproducible on clean checkout	Building
T1	Data + Tokenizer Integrity	ASCII/SVG cleanup gates, tokenizer roundtrip gates, manifest invariants all passing	Building
T2	Train Correctness	CK-vs-PyTorch parity cadence, first-divergence triage closure, deterministic replay stability	Building
T3	Runtime + Throughput Stability	IR3 threaded dispatch, deterministic reductions, no regressions in train contracts/smokes	Planned
T4	Release Gate + Handoff	One-command v7 train gate green, run-dir artifacts unified, operator docs frozen for release handoff	Planned

v15 Entry Criteria (Hard Gate)

○ Stage A and Stage B runs reproducible from the runbook on a clean machine.
○ Parity/drift gates pass at declared cadence with stable replay artifacts.
○ IR visualizer training cockpit stable for run-dir artifacts (data, tokenizer, parity, profile).
○ v7 train release checklist and docs sign-off completed.

Embedded + Interpretability Arc (v15 -> v17)

v1-v14 expands model/training architecture options first; v15 lands embedded deployment targets with hard realtime constraints; v16 adds mechanistic interpretability to de-risk policy behavior before v17 autonomy integration.

v15 Component	Target	Description
Memory footprint	`<= 512KB` runtime core	Sub-512KB inference/runtime core for MCU-class deployment (model weights budget tracked separately).
Latency targets	`< 10ms` control-inference step	Cortex-M baseline loop target for control inference; higher-throughput token targets tracked for Cortex-A/NPU tiers.
Determinism	WCET + jitter budgets	Worst-case execution-time guarantees and bounded jitter suitable for real-time control loops.
HAL abstraction	Portable I/O interfaces	Unified hooks for IMUs, motor control (PWM), and sensor buses (I2C/SPI/UART/CAN).

v15.0 Timeline (Q2-Q4 2026)

○ Phase 1: Core runtime with deterministic memory and loop contracts.
○ Phase 2: ARM Cortex-M4/M7 baseline (STM32 and NXP target track).
○ Phase 3: Sensor/actuator HAL hooks (IMU, motor PWM, serial buses).
○ Phase 4: Benchmarks for memory, latency, and control quality vs cloud/offboard baselines.

v16.0 Timeline (2027): Mechanistic Interpretability

○ PR16.1: Activation tap points in lowered IR (decode/prefill) with stable op/layer IDs and deterministic dump contracts.
○ PR16.2: Sparse autoencoder (SAE) training pipeline on traced activations with feature dictionary/version artifacts.
○ PR16.3: IR-to-feature stitching layer to map SAE features to kernel paths, sections, and memory ranges.
○ PR16.4: Causal intervention harness (feature ablate/patch) with impact reports on logits and control-relevant metrics.
○ PR16.5: Interpretability artifacts integrated into ir_visualizer and operator reports.

v17 Timeline (2027+)

○ Embedded LLM + vision multimodal policy execution.
○ Policy adapters tied into real-time control loops.
○ Field validation for drones and broader robotics deployments.

v18 Timeline (2028+)

○ RDMA transport layer (InfiniBand/RoCE) for gradient synchronization.
○ Distributed optimizer state sharding (ZeRO-2/3 style for CPU memory).
○ Gradient compression (top-k, quantization) to reduce bandwidth requirements.
○ Activation checkpointing (recompute forward activations to trade compute for memory).
○ Pipeline parallelism for model-layer distribution across nodes.
○ HPC cluster job scheduler integration (SLURM, PBS, Flux).
○ Fault tolerance with checkpoint restart and node failure recovery.
○ Validation on 64+ node cluster with 10-100B parameter models.

v7 PR Progress (As of 2026-02-15)

Snapshot of current v7 workstream in this repository. Status is based on implemented PR slices and current gate behavior.

PR	Scope	Status	Notes
PR1	v7 CLI contracts + strict preflight + inference smoke	Done	Foundation for safe training iteration
PR2	Generated runtime callable via `--backend ck`	Done	CK runtime path executes train step loop
PR3	Contiguous train layout, canary diagnostics, layout audit, drift localization	Done	Major safety/stability milestone for v7 runtime
PR4	Oracle cadence + drift triage + replay determinism	Building	Tighten parity and first-divergence operator diagnostics
PR5	IR3-lowered threaded GEMM dispatch	Planned	Primary performance milestone (split M/N/K policy)
PR6	Run-dir unification + single v7 train gate command	Planned	Operator-grade end-to-end workflow

v7 Commit Progression (Recent)

Recent v7 commits on this branch that map to the PR progression above:

Commit	Summary
`40a3b221`	land training-oracle runtime updates + regenerated docs
`20b7fcbe`	runtime checkpoints + PR4.5 roadmap/docs
`c3041b1a`	train IR/layout/codegen runtime + parity-visible tooling
`db570fe5`	train layout audit + default training kernel parity gates
`f2e30342`	execute generated C runtime for `--backend ck`
`e7b81c56`	strict train CLI + inference smoke gate
`115143a6`	run-dir training viewer + PyTorch oracle parity scaffolding
`9b98139d`	initial v7 backprop foundation

Full log command: git log --oneline -- version/v7

v6.6

Released 2026-01-15

"IR Fusion + Operator Tooling" - shipped with profiler-integrated diagnostics

✓ Fusion patterns in IR (RMSNorm+QKV+RoPE fuse)
✓ MEGA kernels: mega_fused_attention_prefill
✓ Thread pool for parallel prefill/decode
✓ IR visualizer with perf gate artifacts (perf/flamegraph/VTune integration)
✓ v6.6 tooling contracts + 3-model matrix validation targets
✓ Sliding-window attention contract test integrated into make/nightly parity flow
✓ llama.cpp parity flow completed in prepared environments (torch + llama parity lib available)

v6.5

Released 2026-01-01

"IR Foundation" - Complete IR-based code generation

✓ IR v1 → IR v2 → IR v3 lowering pipeline
✓ Full FP32 inference engine
✓ Q4_K + Q5 + Q6_K quantization support
✓ GGUF → BUMP weight format

Design Philosophy

Why This Roadmap?

Every version builds on the IR graph + kernel foundation. This means:

✓ IR continuity: v7 training reuses the same IR-first architecture, with train-specific layout and execution artifacts
✓ Fusion works for both forward and backward (doubles speed)
✓ Multi-modal is just adding new kernel types to IR
✓ Capability-first sequencing: v1-v14 expands architecture/training options before v15 embedded hardening
✓ Execution discipline: close v7.x training gates before active v15 implementation
✓ Graph-based = easy optimization passes, validation, debugging

Key Principles

○ IR First: All capabilities expressed in IR, generated to C
○ Kernel Fusion: Every version expands fusion patterns
○ No Premature Optimization: Explore fast-moving architectures first, then harden for embedded constraints
○ No GPU Dependency: Pure CPU execution, any x86_64 machine
○ Transparency: See every operation via ir_visualizer.html

Version Numbering

Version numbers follow this convention:

Current priority (2026): Keep v7 FP32 training parity stable, finish v8 fail-closed compiler and Qwen3-VL numerical hardening, then bring up Whisper tiny/base inference through the same circuit and kernel-contract path
v6.x: Inference-only (forward pass)
v7.x: Training foundation (forward + backward)
v8.x: Deterministic inference compiler, numerical contracts, X-ray attribution, and vision encoding, now in active hardening
v10.x: Whisper-style audio inference after v8 contract and parity gates; FP32 audio training follows in v11
v12.x: MoE architecture (efficient large models)
v14.x: Parameter-efficient fine-tuning (LoRA)
v15.x: Embedded AI inference on constrained hardware (enabled by v1-v14 groundwork)
v16.x: Mechanistic interpretability (SAE feature learning, IR feature mapping, causal intervention tooling)
v17.x: LLM + vision policy integration for realtime autonomy loops
v18.x: Distributed HPC training on CPUs with RDMA (InfiniBand/RoCE), gradient compression, and multi-node orchestration for 10-100B parameter models

💡

Track Progress: See Test Report for nightly results or GitHub Issues for tasks.