v7 Backprop IR Pipeline (Operator Guide)
This page explains how v7 training works end to end, with a focus on what is generated from IR, what is layout/codegen output, where logic lives, and how to debug correctness quickly.
There is no
ck_run_v7.c. The operator entrypoint is version/v7/scripts/ck_run_v7.py (and wrapper version/v7/scripts/cks-v7-run). It orchestrates generation and execution of generated C runtimes.
What Runs in C vs Python
| Layer | Primary Runtime | Why It Exists | Current Operator Command |
|---|---|---|---|
| Orchestration and run management | Python | Build artifacts, launch gates, collect reports, drive strict checks | cks-v7-run init|train|sanity|parity|profile |
| Forward and backward kernel math | C kernels | Deterministic numeric core, SIMD paths, parity-tested primitives | make test, make v7-gate-train |
| Generated training runtime | C (generated) | Dumb emitter executes IR-lowered plan and memory offsets | generated_train_runtime_v7.c, libtrain.so |
| Oracle parity and drift localization | Python + PyTorch | Reference implementation for step-level and slot-level checks | --backend both, --parity-on |
| IR report rendering | Python + HTML/JS | Load run artifacts and present explainable diagnostics | open_ir_visualizer.py --generate --run ... |
Mental Model: What Is Smart and What Is Dumb
Smart Stages
- Template selection: architecture op sequence and flags
- IR1: forward op graph with typed tensors and kernel IDs
- IR2: backward synthesis, fanout accumulation, grad edges
- Layout generation: contiguous offsets, sections, canary map
- Validation: invariants, memory audit, parity, drift checks
Dumb Stage
- Codegen only emits what lowered IR and layout already decided
- No architecture guessing in emitter
- No hidden model-family conditionals in runtime glue
- If behavior is wrong, root cause is usually upstream in IR/lowering/layout
From Init to Train
- Initialize run: choose init policy, dimensions, dtype policy, and template.
- Emit weights: write
weights.bumpandweights_manifest.json. - Build IR1: forward graph and typed tensor registry.
- Lower IR2: backward graph from IR1 + grad rules, with explicit accumulations.
- Generate layout: contiguous memory map with sections and canary ranges.
- Codegen runtime: emit train runtime C and compile to
libtrain.so. - Execute train: run CK path, optional PyTorch oracle checks, write reports.
Drift Timeline: Early Schedule Bug to CE Numeric Tightening
Yes, execution order changed in an important way. The generated train runtime now enforces accumulation-window semantics so optimizer updates only happen at the true boundary of CK_GRAD_ACCUM_STEPS micro-steps.
- Before: optimizer timing could drift relative to the intended accumulation window.
- Now: each window does
zero_gradonce, accumulates grads over micro-steps, then applies one optimizer step at boundary. - Impact: early drift signal (~step 65 in stress) was removed; remaining drift moved to much later horizons and required CE semantic/numeric tightening.
Current Numeric Status
- Historical baseline showed late drift in full-C stress; CE parity fixes now remove that trigger in the 850-step repro.
- All three loss backends (
c,c_ptref,torch) pass the same deterministic 850-step parity run. - Training-side production gates (drift smoke, realistic long-horizon, optimizer parity, FD-grad, replay) are now passing in this validation pass.
- Interpretation rule: treat generated-runtime parity artifacts (
train_runtime_parity_*_latest.json) as release truth; older step-65/step-800 harness failures are historical diagnostics.
Schedule Pseudocode (Expected)
if g_accum_step == 0: zero_grad()
forward()
backward()
g_accum_step += 1
if g_accum_step >= CK_GRAD_ACCUM_STEPS:
scale_grads(1 / g_accum_step) # avg over accumulation window
optimizer_step()
g_accum_step = 0
Grad-Accum Math (N vs K)
For an exact explanation of CE token normalization (/N) versus accumulation-window normalization (/K), plus CPU effective-batch examples and variable-length caveats, see v7-grad-accum-windows.html.
Cross-Entropy Semantics
C cross-entropy uses stable log-sum-exp loss math and PyTorch-aligned index-target mean reduction semantics. For the full CE derivation (p - one_hot), ignore-index denominator details, and long-horizon repro data, read v7-cross-entropy-parity.html.
Production Readiness Guidance
For v7 training backprop, keep production safety enabled (--max-grad-norm, enforced LR guardrails) and retain long-horizon parity gates as blockers. This now gives a production-grade path for the validated v7 harness profile while preserving reproducibility checks.
Template Flexibility
You can use built-ins (qwen3, qwen2, gemma3) or a custom template file. If a template op has kernel coverage and bindings, it can be stitched automatically.
# Built-in version/v7/scripts/cks-v7-run init \ --run /tmp/v7_exp1 \ --template qwen3 \ --init xavier_uniform \ --generate-ir --generate-runtime --strict # Custom template version/v7/scripts/cks-v7-run init \ --run /tmp/v7_exp_custom \ --template my_arch \ --template-file /absolute/path/to/my_template.json \ --init xavier_uniform \ --generate-ir --generate-runtime --strict
IR1, IR2, Layout, Codegen Responsibilities
| Stage | Input | Output | Core Responsibility |
|---|---|---|---|
| IR1 (train-forward) | template + manifest + kernel registry + grad rules | ir1_train_forward.json |
Build forward graph, classify weight vs activation tensors, attach save-for-backward contract |
| IR2 (backward) | IR1 + grad rules + bindings | ir2_train_backward.json |
Synthesize backward ops, add gradient fanout accumulation, preserve producer-consumer chain |
| Layout + Audit (memory-lowered) | IR2 + manifest | layout_train.json, layout_train_audit.json |
Finalize contiguous offsets, section ownership, canary layout, bounds/overlap validation |
| Codegen | IR2 + layout | generated_train_runtime_v7.c, generated_train_runtime_summary_v7.json |
Emit deterministic C calls and memory references only |
Why Train Builder Is Separate From Inference Builder (For Now)
- Inference IR is decode/prefill-oriented and optimized around quantized runtime contracts.
- Training IR adds backward-only semantics: grad fanout merges, loss seed ops, optimizer state, clip/update boundaries.
- Training layout needs persistent regions inference does not own (
grad.weight.*,optimizer.m.*,optimizer.v.*). - Current policy: keep inference pipeline stable, iterate training pipeline quickly, then unify once parity + diagnostics are robust.
So today: inference uses build_ir_v7.py; training uses build_ir_train_v7.py + lower_ir2_backward_v7.py. Both share IR types and kernel registry contracts.
How Residual, GQA, and Splits Stitch Backward
The forward graph can branch. IR2 makes the reverse merge explicit by emitting accumulation ops where gradients meet.
Key Rule
If one forward tensor feeds multiple consumers, IR2 inserts explicit gradient accumulation in backward. This is where residual and attention branch merges stay correct.
Memory Layout and Canary Diagnostics
Training layout is contiguous and sectioned. Canary guards and runtime checks detect out-of-bounds writes and readonly violations.
| Section | Examples | Written In |
|---|---|---|
| weights | weight.layer.* |
optimizer step (not forward) |
| activations + saved | act.*, saved.* |
forward |
| grad_activations | grad.act.* |
backward |
| grad_weights | grad.weight.*, tmp.grad.weight.* |
backward |
| optimizer state | AdamW m, v |
optimizer step |
| temporaries + aux | scratch, loss buffers, diagnostics | multi-phase |
Runbook: End-to-End Commands
1) Initialize and generate IR/runtime
version/v7/scripts/cks-v7-run init \ --run /tmp/v7_exp1 \ --template qwen3 \ --init xavier_uniform \ --layers 2 --embed-dim 128 --hidden-dim 256 \ --num-heads 8 --num-kv-heads 4 \ --generate-ir --generate-runtime --strict
2) CK-only training (generated C runtime path)
version/v7/scripts/cks-v7-run train \ --run /tmp/v7_exp1 \ --backend ck \ --prompt "hello" \ --train-epochs 3 --train-seq-len 16 --train-total-tokens 1024 --train-grad-accum 8 \ --train-strict
3) CK + PyTorch oracle parity
version/v7/scripts/cks-v7-run train \ --run /tmp/v7_exp1 \ --backend both \ --prompt "hello" \ --train-epochs 1 --train-seq-len 16 --train-total-tokens 1024 --train-grad-accum 8 \ --parity-on --parity-profile balanced --dump-on-drift --drift-topk 8
4) Generate visual report from run directory
python3 version/v7/tools/open_ir_visualizer.py \ --generate --run /tmp/v7_exp1 --html-only
Open /tmp/v7_exp1/ir_report.html.
New: One-Command Data -> Tokenizer -> Train Pipeline
For operator workflows, use the pipeline wrapper to avoid manual wiring errors such as missing data files or token/vocab mismatches.
Byte tokenizer pipeline (with torch reference)
make v7-train-data-pipeline \ RUN=/tmp/v7_pipeline_byte \ V7_PIPELINE_TOKENIZER=byte \ V7_PIPELINE_WITH_TORCH=1
BPE pipeline
make v7-train-data-pipeline \ RUN=/tmp/v7_pipeline_bpe \ V7_PIPELINE_TOKENIZER=bpe \ V7_PIPELINE_WITH_TORCH=0
| Knob | Purpose | Typical values |
|---|---|---|
V7_PIPELINE_TOKENIZER |
Select tokenization path | byte, bpe |
V7_PIPELINE_EPOCHS, V7_PIPELINE_SEQ_LEN, V7_PIPELINE_TOTAL_TOKENS |
Train budget | 10, 32, 1024 |
V7_PIPELINE_GRAD_ACCUM, V7_PIPELINE_LR |
Update scale | 1..8, 5e-4 |
V7_PIPELINE_OPEN_VIS |
Generate visualizer report automatically | 0, 1 |
Token-file training now fails fast if token IDs exceed run-dir vocab size. This prevents NaN runs from silent token/vocab mismatch.
New: Runtime Stitch Graph in IR Visualizer
The Training tab now includes a function-level runtime graph fed by train_exec_plan.json with explicit stages:
- forward ops
- backward ops
- gradient accumulation ops
- optimizer/update transition
How to inspect stitching
- Generate report from run-dir:
python3 version/v7/tools/open_ir_visualizer.py --generate --run /tmp/v7_pipeline_byte --html-only - Open
/tmp/v7_pipeline_byte/ir_report.html - Go to Training tab
- Use Runtime Stitch section filters:
Stage=forward|backward|accum|optimizerandLayer=all|Lx - Click a node to inspect dispatch/reduction metadata and phase stitching details
Use this graph when first loss is correct but later updates diverge. It makes boundary mistakes visible at function level without opening generated C manually.
Verification Matrix: Epoch 1 to 10 + Checkpoints + Oracle
.venv/bin/python for strict PyTorch snapshot oracleIf you use system
python3 without torch, training falls back to tiny reference telemetry instead of strict slot-level snapshot checks.
# 0) init .venv/bin/python version/v7/scripts/ck_run_v7.py init \ --run /tmp/v7_oracle_10ep \ --init xavier_uniform \ --layers 2 --vocab-size 256 --embed-dim 128 --hidden-dim 256 \ --num-heads 8 --num-kv-heads 4 --context-len 128 \ --generate-ir --generate-runtime --strict # 1) epoch-1 strict oracle (every step) + replay + periodic checkpoints .venv/bin/python version/v7/scripts/ck_run_v7.py train \ --run /tmp/v7_oracle_10ep \ --backend ck \ --train-epochs 1 --train-seq-len 8 --train-total-tokens 64 --train-grad-accum 2 \ --train-vocab 256 --train-d-model 64 --train-hidden 128 \ --parity-on --oracle pytorch --parity-profile debug --parity-every 1 \ --parity-replay-on-check --train-save-every 4 # 2) epoch-10 sweep strict oracle + replay + checkpoint cadence .venv/bin/python version/v7/scripts/ck_run_v7.py train \ --run /tmp/v7_oracle_10ep \ --backend ck \ --train-epochs 10 --train-seq-len 8 --train-total-tokens 64 --train-grad-accum 2 \ --train-vocab 256 --train-d-model 64 --train-hidden 128 \ --parity-on --oracle pytorch --parity-profile balanced --parity-every 1 \ --parity-replay-on-check --train-save-every 20 # 3) memory verification suite (canary toggle + fault injection + ASan agreement) .venv/bin/python version/v7/scripts/ck_run_v7.py train \ --run /tmp/v7_oracle_10ep \ --backend ck \ --train-epochs 1 --train-seq-len 8 --train-total-tokens 64 --train-grad-accum 2 \ --train-vocab 256 --train-d-model 64 --train-hidden 128 \ --train-verify-memory --train-verify-steps 4
| Artifact | Expected | Location |
|---|---|---|
| train summary | pass_parity: true, strict oracle source + replay checks |
/tmp/v7_oracle_10ep/train_e2e_latest.json |
| checkpoints | weights_step_*.bump + per-step manifest |
/tmp/v7_oracle_10ep/checkpoints/ |
| memory verification | ok: true (toggle diff, intentional +1 catch, ASan agreement, bounds) |
/tmp/v7_oracle_10ep/memory_verification_latest.json |
| viewer report | training tabs populated from run-dir artifacts | /tmp/v7_oracle_10ep/ir_report.html |
Backprop + PyTorch Parity Test Suite
This is the concrete v7 inventory used to keep backprop parity green. By default, artifacts write to version/v7/.cache/reports/ (override with V7_REPORT_DIR).
Canonical Regression Ledger
Use this as the single source of truth for historical failures and fixes so new model architectures do not repeat old bugs:
version/v7/reports/REGRESSION_LEDGER.md(human-readable)version/v7/reports/REGRESSION_LEDGER.json(IR visualizer/operator-readable)
| Script | Primary Entry | What It Verifies | Main Artifact |
|---|---|---|---|
run_parity_1token_v7.py |
kernelmake v7-parity-1tok |
T=1 parity for RMSNorm (fwd/bwd), SwiGLU (fwd/bwd), CE (loss + dlogits), and QK-norm backward hook. | parity_1token_latest.json |
check_qk_norm_backward_parity_v7.py |
kernelmake v7-qk-norm-backward-parity* |
QK-norm backward parity, ISA matrix sweep (scalar/avx/avx2/avx_vnni), and strict no-fallback mode. | qk_norm_backward_parity*_latest.json |
check_rms_swiglu_backward_parity_v7.py |
kernelmake v7-rms-swiglu-backward-parity |
RMSNorm backward default+strict paths, SwiGLU exact path, and optional fast-path gate. | rms_swiglu_backward_parity_latest.json |
check_gemm_backward_parity_v7.py |
kernelmake v7-kernel-parity-train |
GEMM backward shape-sweep parity for v7 training kernel bindings. | gemm_backward_shape_sweep_latest.json |
check_optimizer_parity_v7.py |
kernelmake v7-kernel-parity-train |
Optimizer kernels: adamw_update_f32, gradient_clip_norm_f32, adamw_clip_update_multi_f32, gradient_accumulate_f32. |
optimizer_parity_latest.json |
check_fd_gradients_v7.py |
auditmake v7-grad-fd |
Finite-difference sanity against analytical gradients in the tiny training stack. | fd_gradients_latest.json |
check_replay_determinism_v7.py |
runtimemake v7-replay |
Deterministic replay across identical training runs (same losses + parameter deltas). | replay_determinism_latest.json |
check_runtime_replay_accum_v7.py |
runtimemake v7-replay-accum |
Replay-on-check correctness with grad_accum > 1, including accumulation snapshot restoration. |
replay_accum_latest.json |
train_parity_epochs_v7.py |
driftmake v7-train-parity-3/5 and long-horizon targets |
Epoch parity, drift smoke, drift localization, and long-horizon CK-vs-PyTorch trajectory checks. | train_parity_*_latest.json |
check_backprop_plumbing_v7.py |
auditmake v7-backprop-plumbing |
Static IR2/layout/manifest audit: grad coverage, writer coverage, per-layer flow, saved-tensor + layout integrity. | backprop_plumbing_latest.json |
check_backprop_stitch_runtime_v7.py |
runtimemake v7-backprop-stitch-runtime* |
Runtime stitch smoke: manifest-dim wiring, first checked parity step cleanliness, and all checked-step consistency. | backprop_stitch_runtime*_latest.json |
test_bpe_train_parity_v7.py |
runtimemake test-v7-bpe-train-parity |
BPE trainer parity: HF tokenizer JSON vs CK binary tokenizer artifacts on deterministic text probes. | v7_bpe_train_parity_latest.json |
test_svg_overfit_regression_v7.py |
runtimemake test-v7-svg-overfit-regression |
CK runtime vs PyTorch reference overfit regression on deterministic SVG-derived dataset. | svg_overfit_regression_latest.json |
ck_run_v7.py |
runtimemake v7-train-runtime-parity-* |
Generated runtime long-horizon parity checks (stress + realistic) with --parity-on. |
train_runtime_parity_*_latest.json |
Exhaustive Test-Case Matrix (Backprop + Parity)
| Case Family | Exact Command(s) | Pass Condition | Artifact(s) |
|---|---|---|---|
| Single-token parity baseline | make v7-parity-1tok |
All T=1 kernel checks pass under strict tolerances. | parity_1token_latest.json |
| QK-norm backward (default + ISA + strict ISA) | make v7-qk-norm-backward-paritymake v7-qk-norm-backward-parity-isamake v7-qk-norm-backward-parity-isa-strict |
Parity holds across ISA modes; strict mode fails any fallback mismatch. | qk_norm_backward_parity_latest.jsonqk_norm_backward_parity_isa_latest.jsonqk_norm_backward_parity_isa_strict_latest.json |
| RMSNorm/SwiGLU backward parity | make v7-rms-swiglu-backward-parity |
RMSNorm default+strict and SwiGLU exact/fast checks all stay within tolerance. | rms_swiglu_backward_parity_latest.json |
| Kernel parity bundle | make v7-kernel-parity-train |
Optimizer + QK-norm ISA + RMS/SwiGLU + GEMM-backward sweeps all pass. | optimizer_parity_latest.jsonqk_norm_backward_parity_isa*_latest.jsonrms_swiglu_backward_parity_latest.jsongemm_backward_shape_sweep_latest.json |
| Finite-difference gradients | make v7-grad-fd |
Central-difference numerical grads agree with analytical grads. | fd_gradients_latest.json |
| Deterministic replay | make v7-replay |
Two identical runs produce identical replay outputs. | replay_determinism_latest.json |
| Replay with grad-accum snapshots | make v7-replay-accum |
Replay-on-check is enabled and accumulation state restores within tolerance. | replay_accum_latest.json |
| Backprop plumbing static audit | make v7-backprop-plumbing |
IR2 graph/dataflow, grad writers, saved tensors, layout, and manifest wiring all pass audit checks. | backprop_plumbing_latest.json |
| Runtime stitch smoke (step-1) | make v7-backprop-stitch-runtimemake v7-backprop-stitch-runtime-accum |
Manifest dims override conflicting requests; first checked parity step is clean; checked steps stay clean. | backprop_stitch_runtime_latest.jsonbackprop_stitch_runtime_accum_latest.json |
| Epoch parity sweeps | make v7-train-parity-3make v7-train-parity-5 |
Multi-epoch CK-vs-PyTorch parity stays within configured tolerances. | train_parity_epochs_3_latest.jsontrain_parity_epochs_5_latest.json |
| Drift smoke and localization | make v7-train-parity-drift-smokemake v7-train-parity-drift-localize |
Smoke run reports bounded drift; localization emits same-state stage diagnostics near target step. | train_parity_drift_smoke_latest.jsontrain_parity_drift_localize_latest.json |
| Long-horizon parity (stress + realistic) | make v7-train-parity-long-horizonmake v7-train-parity-long-horizon-realistic |
Long-run loss and parameter parity stay within blocker thresholds. | train_parity_long_horizon_latest.jsontrain_parity_realistic_long_horizon_latest.json |
| Generated-runtime long-horizon parity | make v7-train-runtime-parity-stressmake v7-train-runtime-parity-realisticmake v7-train-runtime-parity-long-horizon |
Generated C runtime with --parity-on stays aligned with oracle over stress and realistic prompts. |
train_runtime_parity_stress_latest.jsontrain_runtime_parity_realistic_latest.json |
| BPE train parity gate | make test-v7-bpe-train-parity |
Tokenizer IDs match between HF tokenizer JSON and CK true-BPE binary path. | v7_bpe_train_parity_latest.json |
| SVG overfit regression gate | make test-v7-svg-overfit-regression |
CK and PyTorch overfit curves satisfy configured first/min/final loss thresholds. | svg_overfit_regression_latest.json |
| Full train gate bundle | make v7-gate-train |
Contract checks, IR/layout/codegen smoke, parity kernels, FD, replay, and enabled optional gates all pass. | contract_report_latest.json + parity/runtime gate artifacts above |
| Production-ready nightly bundle | make v7-backprop-production-ready |
v7-gate-train passes with production safety, then realistic long-horizon blocker + stress monitor run. |
train_parity_realistic_long_horizon_latest.json + gate artifacts |
PR4.5 Throughput Track
- Add explicit threaded GEMM dispatch for v7 training runtime (not serial-only blocked calls).
- Compile runtime with consistent parallel flags and deterministic thread control.
- Prepack/plan GEMM weight access to reduce cache misses.
- Reduce backward buffer traffic hotspots (
grad_accumulate,memset,memmove). - Keep strict parity tolerances unchanged while optimizing.
Performance Workflow (After Parity Is Green)
- Run CK baseline without oracle:
--backend ck --parity-on=false. - Run reference harness timing:
--backend both(captures CK vs torch step metrics). - Use
make profile-v7-fullormake v7-perf-gatefor perf + flamegraph artifacts (seev7-profiling.html). - Use VTune for microarchitecture + memory bottlenecks (
hotspots,uarch-exploration,memory-access), and Advisor for roofline analysis. - Use ASan for bounds/UB; use Valgrind when you need slower but deeper leak/use-after-free analysis.
Threading Sanity
Before optimization work, confirm CK runtime is saturating cores and not running in an accidental single-thread path. Use train step timing plus profiler traces to validate threadpool/OpenMP utilization, then optimize kernel hotspots (usually GEMM backward and attention backward first).
How You Know It Is Actually Training
train_e2e_latest.jsonshows loss curve with decreasing trendtraining_loss_curve.jsonandtraining_grad_norms.jsonare populatedtraining_parity.jsonreports PASS for oracle checkpoints (when enabled)layout_train_audit.jsonand memory diagnostics show no bounds/overlap/canary failures- Generated runtime summary includes expected forward/backward/optimizer kernel calls
Debug Fast: First-Divergence and Memory Safety
| Symptom | First Artifact | Likely Root |
|---|---|---|
| Parity drift appears after a few steps | drift_report.json |
Specific op mismatch, accumulation edge, or dimension binding issue |
| Crash or invalid pointer in CK backend | memory_diagnostic_latest.json |
Out-of-bounds write, incorrect slot size/offset, or bad kernel args |
| Layout audit fails | layout_train_audit.json |
Overlap, section bounds error, alignment or ownership mismatch |
| Generated runtime compiles but output is unstable | generated_train_runtime_summary_v7.json |
Wrong kernel binding or call-order mismatch from upstream IR |
What Changes for Backprop vs Inference
- Inference complexity is dominated by quantized weight paths and decode-time KV behavior.
- Backprop complexity is dominated by memory pressure and graph bookkeeping: saved activations, gradient routing, accumulation, optimizer state.
- Training usually runs float paths (fp32/bf16 policy) rather than quantized weight formats.
- Keeping training pipeline partially separated while contracts stabilize is a pragmatic step, not architectural drift.
Artifact Checklist in a Healthy Run Directory
/tmp/v7_exp1/ weights.bump weights_manifest.json ir1_train_forward.json ir2_train_backward.json ir_train_invariants.json layout_train.json layout_train_audit.json generated_train_runtime_v7.c generated_train_runtime_summary_v7.json libtrain.so train_e2e_latest.json training_loss_curve.json training_grad_norms.json training_parity.json memory_diagnostic_latest.json ir_report.html
Use this page as the operator baseline: if these artifacts line up and the gates pass, your v7 backprop pipeline is stitched correctly.