v7 Backprop IR Pipeline (Operator Guide)

This page explains how v7 training works end to end, with a focus on what is generated from IR, what is layout/codegen output, where logic lives, and how to debug correctness quickly.

One important clarification
There is no ck_run_v7.c. The operator entrypoint is version/v7/scripts/ck_run_v7.py (and wrapper version/v7/scripts/cks-v7-run). It orchestrates generation and execution of generated C runtimes.

v7 training flow from init to IR1 to IR2 to layout and codegen

What Runs in C vs Python

Layer	Primary Runtime	Why It Exists	Current Operator Command
Orchestration and run management	Python	Build artifacts, launch gates, collect reports, drive strict checks	`cks-v7-run init\|train\|sanity\|parity\|profile`
Forward and backward kernel math	C kernels	Deterministic numeric core, SIMD paths, parity-tested primitives	`make test`, `make v7-gate-train`
Generated training runtime	C (generated)	Dumb emitter executes IR-lowered plan and memory offsets	`generated_train_runtime_v7.c`, `libtrain.so`
Oracle parity and drift localization	Python + PyTorch	Reference implementation for step-level and slot-level checks	`--backend both`, `--parity-on`
IR report rendering	Python + HTML/JS	Load run artifacts and present explainable diagnostics	`open_ir_visualizer.py --generate --run ...`

Mental Model: What Is Smart and What Is Dumb

Smart Stages

Template selection: architecture op sequence and flags
IR1: forward op graph with typed tensors and kernel IDs
IR2: backward synthesis, fanout accumulation, grad edges
Layout generation: contiguous offsets, sections, canary map
Validation: invariants, memory audit, parity, drift checks

Dumb Stage

Codegen only emits what lowered IR and layout already decided
No architecture guessing in emitter
No hidden model-family conditionals in runtime glue
If behavior is wrong, root cause is usually upstream in IR/lowering/layout

From Init to Train

Initialize run: choose init policy, dimensions, dtype policy, and template.
Emit weights: write weights.bump and weights_manifest.json.
Build IR1: forward graph and typed tensor registry.
Lower IR2: backward graph from IR1 + grad rules, with explicit accumulations.
Generate layout: contiguous memory map with sections and canary ranges.
Codegen runtime: emit train runtime C and compile to libtrain.so.
Execute train: run CK path, optional PyTorch oracle checks, write reports.

Drift Timeline: Early Schedule Bug to CE Numeric Tightening

Yes, execution order changed in an important way. The generated train runtime now enforces accumulation-window semantics so optimizer updates only happen at the true boundary of CK_GRAD_ACCUM_STEPS micro-steps.

Before and after schedule for gradient accumulation and optimizer boundary

Before: optimizer timing could drift relative to the intended accumulation window.
Now: each window does zero_grad once, accumulates grads over micro-steps, then applies one optimizer step at boundary.
Impact: early drift signal (~step 65 in stress) was removed; remaining drift moved to much later horizons and required CE semantic/numeric tightening.

Current Numeric Status

Historical baseline showed late drift in full-C stress; CE parity fixes now remove that trigger in the 850-step repro.
All three loss backends (c, c_ptref, torch) pass the same deterministic 850-step parity run.
Training-side production gates (drift smoke, realistic long-horizon, optimizer parity, FD-grad, replay) are now passing in this validation pass.
Interpretation rule: treat generated-runtime parity artifacts (train_runtime_parity_*_latest.json) as release truth; older step-65/step-800 harness failures are historical diagnostics.

Schedule Pseudocode (Expected)

if g_accum_step == 0: zero_grad()
forward()
backward()
g_accum_step += 1
if g_accum_step >= CK_GRAD_ACCUM_STEPS:
    scale_grads(1 / g_accum_step)  # avg over accumulation window
    optimizer_step()
    g_accum_step = 0

Grad-Accum Math (N vs K)

For an exact explanation of CE token normalization (/N) versus accumulation-window normalization (/K), plus CPU effective-batch examples and variable-length caveats, see v7-grad-accum-windows.html.

Cross-Entropy Semantics

C cross-entropy uses stable log-sum-exp loss math and PyTorch-aligned index-target mean reduction semantics. For the full CE derivation (p - one_hot), ignore-index denominator details, and long-horizon repro data, read v7-cross-entropy-parity.html.

Production Readiness Guidance

For v7 training backprop, keep production safety enabled (--max-grad-norm, enforced LR guardrails) and retain long-horizon parity gates as blockers. This now gives a production-grade path for the validated v7 harness profile while preserving reproducibility checks.

Template Flexibility

You can use built-ins (qwen3, qwen2, gemma3) or a custom template file. If a template op has kernel coverage and bindings, it can be stitched automatically.

# Built-in
version/v7/scripts/cks-v7-run init \
  --run /tmp/v7_exp1 \
  --template qwen3 \
  --init xavier_uniform \
  --generate-ir --generate-runtime --strict

# Custom template
version/v7/scripts/cks-v7-run init \
  --run /tmp/v7_exp_custom \
  --template my_arch \
  --template-file /absolute/path/to/my_template.json \
  --init xavier_uniform \
  --generate-ir --generate-runtime --strict

IR1, IR2, Layout, Codegen Responsibilities

Stage	Input	Output	Core Responsibility
IR1 (train-forward)	template + manifest + kernel registry + grad rules	`ir1_train_forward.json`	Build forward graph, classify weight vs activation tensors, attach save-for-backward contract
IR2 (backward)	IR1 + grad rules + bindings	`ir2_train_backward.json`	Synthesize backward ops, add gradient fanout accumulation, preserve producer-consumer chain
Layout + Audit (memory-lowered)	IR2 + manifest	`layout_train.json`, `layout_train_audit.json`	Finalize contiguous offsets, section ownership, canary layout, bounds/overlap validation
Codegen	IR2 + layout	`generated_train_runtime_v7.c`, `generated_train_runtime_summary_v7.json`	Emit deterministic C calls and memory references only

Why Train Builder Is Separate From Inference Builder (For Now)

Inference IR is decode/prefill-oriented and optimized around quantized runtime contracts.
Training IR adds backward-only semantics: grad fanout merges, loss seed ops, optimizer state, clip/update boundaries.
Training layout needs persistent regions inference does not own (grad.weight.*, optimizer.m.*, optimizer.v.*).
Current policy: keep inference pipeline stable, iterate training pipeline quickly, then unify once parity + diagnostics are robust.

So today: inference uses build_ir_v7.py; training uses build_ir_train_v7.py + lower_ir2_backward_v7.py. Both share IR types and kernel registry contracts.

How Residual, GQA, and Splits Stitch Backward

The forward graph can branch. IR2 makes the reverse merge explicit by emitting accumulation ops where gradients meet.

Forward residual and GQA split with backward gradient accumulation routing

Key Rule

If one forward tensor feeds multiple consumers, IR2 inserts explicit gradient accumulation in backward. This is where residual and attention branch merges stay correct.

Memory Layout and Canary Diagnostics

Training layout is contiguous and sectioned. Canary guards and runtime checks detect out-of-bounds writes and readonly violations.

Contiguous train memory layout with canary guards and diagnostics phases

Section	Examples	Written In
weights	`weight.layer.*`	optimizer step (not forward)
activations + saved	`act.`, `saved.`	forward
grad_activations	`grad.act.*`	backward
grad_weights	`grad.weight.`, `tmp.grad.weight.`	backward
optimizer state	AdamW `m`, `v`	optimizer step
temporaries + aux	scratch, loss buffers, diagnostics	multi-phase

Runbook: End-to-End Commands

1) Initialize and generate IR/runtime

version/v7/scripts/cks-v7-run init \
  --run /tmp/v7_exp1 \
  --template qwen3 \
  --init xavier_uniform \
  --layers 2 --embed-dim 128 --hidden-dim 256 \
  --num-heads 8 --num-kv-heads 4 \
  --generate-ir --generate-runtime --strict

2) CK-only training (generated C runtime path)

version/v7/scripts/cks-v7-run train \
  --run /tmp/v7_exp1 \
  --backend ck \
  --prompt "hello" \
  --train-epochs 3 --train-seq-len 16 --train-total-tokens 1024 --train-grad-accum 8 \
  --train-strict

3) CK + PyTorch oracle parity

version/v7/scripts/cks-v7-run train \
  --run /tmp/v7_exp1 \
  --backend both \
  --prompt "hello" \
  --train-epochs 1 --train-seq-len 16 --train-total-tokens 1024 --train-grad-accum 8 \
  --parity-on --parity-profile balanced --dump-on-drift --drift-topk 8

4) Generate visual report from run directory

python3 version/v7/tools/open_ir_visualizer.py \
  --generate --run /tmp/v7_exp1 --html-only

Open /tmp/v7_exp1/ir_report.html.

New: One-Command Data -> Tokenizer -> Train Pipeline

For operator workflows, use the pipeline wrapper to avoid manual wiring errors such as missing data files or token/vocab mismatches.

Byte tokenizer pipeline (with torch reference)

make v7-train-data-pipeline \
  RUN=/tmp/v7_pipeline_byte \
  V7_PIPELINE_TOKENIZER=byte \
  V7_PIPELINE_WITH_TORCH=1

BPE pipeline

make v7-train-data-pipeline \
  RUN=/tmp/v7_pipeline_bpe \
  V7_PIPELINE_TOKENIZER=bpe \
  V7_PIPELINE_WITH_TORCH=0

Knob	Purpose	Typical values
`V7_PIPELINE_TOKENIZER`	Select tokenization path	`byte`, `bpe`
`V7_PIPELINE_EPOCHS`, `V7_PIPELINE_SEQ_LEN`, `V7_PIPELINE_TOTAL_TOKENS`	Train budget	`10`, `32`, `1024`
`V7_PIPELINE_GRAD_ACCUM`, `V7_PIPELINE_LR`	Update scale	`1..8`, `5e-4`
`V7_PIPELINE_OPEN_VIS`	Generate visualizer report automatically	`0`, `1`

Safety guard now built in
Token-file training now fails fast if token IDs exceed run-dir vocab size. This prevents NaN runs from silent token/vocab mismatch.

New: Runtime Stitch Graph in IR Visualizer

The Training tab now includes a function-level runtime graph fed by train_exec_plan.json with explicit stages:

forward ops
backward ops
gradient accumulation ops
optimizer/update transition

How to inspect stitching

Generate report from run-dir: python3 version/v7/tools/open_ir_visualizer.py --generate --run /tmp/v7_pipeline_byte --html-only
Open /tmp/v7_pipeline_byte/ir_report.html
Go to Training tab
Use Runtime Stitch section filters: Stage=forward|backward|accum|optimizer and Layer=all|Lx
Click a node to inspect dispatch/reduction metadata and phase stitching details

Use this graph when first loss is correct but later updates diverge. It makes boundary mistakes visible at function level without opening generated C manually.

Verification Matrix: Epoch 1 to 10 + Checkpoints + Oracle

Use .venv/bin/python for strict PyTorch snapshot oracle
If you use system python3 without torch, training falls back to tiny reference telemetry instead of strict slot-level snapshot checks.

# 0) init
.venv/bin/python version/v7/scripts/ck_run_v7.py init \
  --run /tmp/v7_oracle_10ep \
  --init xavier_uniform \
  --layers 2 --vocab-size 256 --embed-dim 128 --hidden-dim 256 \
  --num-heads 8 --num-kv-heads 4 --context-len 128 \
  --generate-ir --generate-runtime --strict

# 1) epoch-1 strict oracle (every step) + replay + periodic checkpoints
.venv/bin/python version/v7/scripts/ck_run_v7.py train \
  --run /tmp/v7_oracle_10ep \
  --backend ck \
  --train-epochs 1 --train-seq-len 8 --train-total-tokens 64 --train-grad-accum 2 \
  --train-vocab 256 --train-d-model 64 --train-hidden 128 \
  --parity-on --oracle pytorch --parity-profile debug --parity-every 1 \
  --parity-replay-on-check --train-save-every 4

# 2) epoch-10 sweep strict oracle + replay + checkpoint cadence
.venv/bin/python version/v7/scripts/ck_run_v7.py train \
  --run /tmp/v7_oracle_10ep \
  --backend ck \
  --train-epochs 10 --train-seq-len 8 --train-total-tokens 64 --train-grad-accum 2 \
  --train-vocab 256 --train-d-model 64 --train-hidden 128 \
  --parity-on --oracle pytorch --parity-profile balanced --parity-every 1 \
  --parity-replay-on-check --train-save-every 20

# 3) memory verification suite (canary toggle + fault injection + ASan agreement)
.venv/bin/python version/v7/scripts/ck_run_v7.py train \
  --run /tmp/v7_oracle_10ep \
  --backend ck \
  --train-epochs 1 --train-seq-len 8 --train-total-tokens 64 --train-grad-accum 2 \
  --train-vocab 256 --train-d-model 64 --train-hidden 128 \
  --train-verify-memory --train-verify-steps 4

Artifact	Expected	Location
train summary	`pass_parity: true`, strict oracle source + replay checks	`/tmp/v7_oracle_10ep/train_e2e_latest.json`
checkpoints	`weights_step_*.bump` + per-step manifest	`/tmp/v7_oracle_10ep/checkpoints/`
memory verification	`ok: true` (toggle diff, intentional +1 catch, ASan agreement, bounds)	`/tmp/v7_oracle_10ep/memory_verification_latest.json`
viewer report	training tabs populated from run-dir artifacts	`/tmp/v7_oracle_10ep/ir_report.html`

Backprop + PyTorch Parity Test Suite

This is the concrete v7 inventory used to keep backprop parity green. By default, artifacts write to version/v7/.cache/reports/ (override with V7_REPORT_DIR).

Canonical Regression Ledger

Use this as the single source of truth for historical failures and fixes so new model architectures do not repeat old bugs:

version/v7/reports/REGRESSION_LEDGER.md (human-readable)
version/v7/reports/REGRESSION_LEDGER.json (IR visualizer/operator-readable)

Script	Primary Entry	What It Verifies	Main Artifact
`run_parity_1token_v7.py`	kernel`make v7-parity-1tok`	T=1 parity for RMSNorm (fwd/bwd), SwiGLU (fwd/bwd), CE (loss + dlogits), and QK-norm backward hook.	`parity_1token_latest.json`
`check_qk_norm_backward_parity_v7.py`	kernel`make v7-qk-norm-backward-parity*`	QK-norm backward parity, ISA matrix sweep (scalar/avx/avx2/avx_vnni), and strict no-fallback mode.	`qk_norm_backward_parity*_latest.json`
`check_rms_swiglu_backward_parity_v7.py`	kernel`make v7-rms-swiglu-backward-parity`	RMSNorm backward default+strict paths, SwiGLU exact path, and optional fast-path gate.	`rms_swiglu_backward_parity_latest.json`
`check_gemm_backward_parity_v7.py`	kernel`make v7-kernel-parity-train`	GEMM backward shape-sweep parity for v7 training kernel bindings.	`gemm_backward_shape_sweep_latest.json`
`check_optimizer_parity_v7.py`	kernel`make v7-kernel-parity-train`	Optimizer kernels: `adamw_update_f32`, `gradient_clip_norm_f32`, `adamw_clip_update_multi_f32`, `gradient_accumulate_f32`.	`optimizer_parity_latest.json`
`check_fd_gradients_v7.py`	audit`make v7-grad-fd`	Finite-difference sanity against analytical gradients in the tiny training stack.	`fd_gradients_latest.json`
`check_replay_determinism_v7.py`	runtime`make v7-replay`	Deterministic replay across identical training runs (same losses + parameter deltas).	`replay_determinism_latest.json`
`check_runtime_replay_accum_v7.py`	runtime`make v7-replay-accum`	Replay-on-check correctness with `grad_accum > 1`, including accumulation snapshot restoration.	`replay_accum_latest.json`
`train_parity_epochs_v7.py`	drift`make v7-train-parity-3/5` and long-horizon targets	Epoch parity, drift smoke, drift localization, and long-horizon CK-vs-PyTorch trajectory checks.	`train_parity_*_latest.json`
`check_backprop_plumbing_v7.py`	audit`make v7-backprop-plumbing`	Static IR2/layout/manifest audit: grad coverage, writer coverage, per-layer flow, saved-tensor + layout integrity.	`backprop_plumbing_latest.json`
`check_backprop_stitch_runtime_v7.py`	runtime`make v7-backprop-stitch-runtime*`	Runtime stitch smoke: manifest-dim wiring, first checked parity step cleanliness, and all checked-step consistency.	`backprop_stitch_runtime*_latest.json`
`test_bpe_train_parity_v7.py`	runtime`make test-v7-bpe-train-parity`	BPE trainer parity: HF tokenizer JSON vs CK binary tokenizer artifacts on deterministic text probes.	`v7_bpe_train_parity_latest.json`
`test_svg_overfit_regression_v7.py`	runtime`make test-v7-svg-overfit-regression`	CK runtime vs PyTorch reference overfit regression on deterministic SVG-derived dataset.	`svg_overfit_regression_latest.json`
`ck_run_v7.py`	runtime`make v7-train-runtime-parity-*`	Generated runtime long-horizon parity checks (stress + realistic) with `--parity-on`.	`train_runtime_parity_*_latest.json`

Exhaustive Test-Case Matrix (Backprop + Parity)

Case Family	Exact Command(s)	Pass Condition	Artifact(s)
Single-token parity baseline	`make v7-parity-1tok`	All T=1 kernel checks pass under strict tolerances.	`parity_1token_latest.json`
QK-norm backward (default + ISA + strict ISA)	`make v7-qk-norm-backward-parity` `make v7-qk-norm-backward-parity-isa` `make v7-qk-norm-backward-parity-isa-strict`	Parity holds across ISA modes; strict mode fails any fallback mismatch.	`qk_norm_backward_parity_latest.json` `qk_norm_backward_parity_isa_latest.json` `qk_norm_backward_parity_isa_strict_latest.json`
RMSNorm/SwiGLU backward parity	`make v7-rms-swiglu-backward-parity`	RMSNorm default+strict and SwiGLU exact/fast checks all stay within tolerance.	`rms_swiglu_backward_parity_latest.json`
Kernel parity bundle	`make v7-kernel-parity-train`	Optimizer + QK-norm ISA + RMS/SwiGLU + GEMM-backward sweeps all pass.	`optimizer_parity_latest.json` `qk_norm_backward_parity_isa*_latest.json` `rms_swiglu_backward_parity_latest.json` `gemm_backward_shape_sweep_latest.json`
Finite-difference gradients	`make v7-grad-fd`	Central-difference numerical grads agree with analytical grads.	`fd_gradients_latest.json`
Deterministic replay	`make v7-replay`	Two identical runs produce identical replay outputs.	`replay_determinism_latest.json`
Replay with grad-accum snapshots	`make v7-replay-accum`	Replay-on-check is enabled and accumulation state restores within tolerance.	`replay_accum_latest.json`
Backprop plumbing static audit	`make v7-backprop-plumbing`	IR2 graph/dataflow, grad writers, saved tensors, layout, and manifest wiring all pass audit checks.	`backprop_plumbing_latest.json`
Runtime stitch smoke (step-1)	`make v7-backprop-stitch-runtime` `make v7-backprop-stitch-runtime-accum`	Manifest dims override conflicting requests; first checked parity step is clean; checked steps stay clean.	`backprop_stitch_runtime_latest.json` `backprop_stitch_runtime_accum_latest.json`
Epoch parity sweeps	`make v7-train-parity-3` `make v7-train-parity-5`	Multi-epoch CK-vs-PyTorch parity stays within configured tolerances.	`train_parity_epochs_3_latest.json` `train_parity_epochs_5_latest.json`
Drift smoke and localization	`make v7-train-parity-drift-smoke` `make v7-train-parity-drift-localize`	Smoke run reports bounded drift; localization emits same-state stage diagnostics near target step.	`train_parity_drift_smoke_latest.json` `train_parity_drift_localize_latest.json`
Long-horizon parity (stress + realistic)	`make v7-train-parity-long-horizon` `make v7-train-parity-long-horizon-realistic`	Long-run loss and parameter parity stay within blocker thresholds.	`train_parity_long_horizon_latest.json` `train_parity_realistic_long_horizon_latest.json`
Generated-runtime long-horizon parity	`make v7-train-runtime-parity-stress` `make v7-train-runtime-parity-realistic` `make v7-train-runtime-parity-long-horizon`	Generated C runtime with `--parity-on` stays aligned with oracle over stress and realistic prompts.	`train_runtime_parity_stress_latest.json` `train_runtime_parity_realistic_latest.json`
BPE train parity gate	`make test-v7-bpe-train-parity`	Tokenizer IDs match between HF tokenizer JSON and CK true-BPE binary path.	`v7_bpe_train_parity_latest.json`
SVG overfit regression gate	`make test-v7-svg-overfit-regression`	CK and PyTorch overfit curves satisfy configured first/min/final loss thresholds.	`svg_overfit_regression_latest.json`
Full train gate bundle	`make v7-gate-train`	Contract checks, IR/layout/codegen smoke, parity kernels, FD, replay, and enabled optional gates all pass.	`contract_report_latest.json` + parity/runtime gate artifacts above
Production-ready nightly bundle	`make v7-backprop-production-ready`	`v7-gate-train` passes with production safety, then realistic long-horizon blocker + stress monitor run.	`train_parity_realistic_long_horizon_latest.json` + gate artifacts

PR4.5 Throughput Track

Add explicit threaded GEMM dispatch for v7 training runtime (not serial-only blocked calls).
Compile runtime with consistent parallel flags and deterministic thread control.
Prepack/plan GEMM weight access to reduce cache misses.
Reduce backward buffer traffic hotspots (grad_accumulate, memset, memmove).
Keep strict parity tolerances unchanged while optimizing.

Performance Workflow (After Parity Is Green)

Run CK baseline without oracle: --backend ck --parity-on=false.
Run reference harness timing: --backend both (captures CK vs torch step metrics).
Use make profile-v7-full or make v7-perf-gate for perf + flamegraph artifacts (see v7-profiling.html).
Use VTune for microarchitecture + memory bottlenecks (hotspots, uarch-exploration, memory-access), and Advisor for roofline analysis.
Use ASan for bounds/UB; use Valgrind when you need slower but deeper leak/use-after-free analysis.

Threading Sanity

Before optimization work, confirm CK runtime is saturating cores and not running in an accidental single-thread path. Use train step timing plus profiler traces to validate threadpool/OpenMP utilization, then optimize kernel hotspots (usually GEMM backward and attention backward first).

How You Know It Is Actually Training

train_e2e_latest.json shows loss curve with decreasing trend
training_loss_curve.json and training_grad_norms.json are populated
training_parity.json reports PASS for oracle checkpoints (when enabled)
layout_train_audit.json and memory diagnostics show no bounds/overlap/canary failures
Generated runtime summary includes expected forward/backward/optimizer kernel calls

Debug Fast: First-Divergence and Memory Safety

Symptom	First Artifact	Likely Root
Parity drift appears after a few steps	`drift_report.json`	Specific op mismatch, accumulation edge, or dimension binding issue
Crash or invalid pointer in CK backend	`memory_diagnostic_latest.json`	Out-of-bounds write, incorrect slot size/offset, or bad kernel args
Layout audit fails	`layout_train_audit.json`	Overlap, section bounds error, alignment or ownership mismatch
Generated runtime compiles but output is unstable	`generated_train_runtime_summary_v7.json`	Wrong kernel binding or call-order mismatch from upstream IR

What Changes for Backprop vs Inference

Inference complexity is dominated by quantized weight paths and decode-time KV behavior.
Backprop complexity is dominated by memory pressure and graph bookkeeping: saved activations, gradient routing, accumulation, optimizer state.
Training usually runs float paths (fp32/bf16 policy) rather than quantized weight formats.
Keeping training pipeline partially separated while contracts stabilize is a pragmatic step, not architectural drift.

Artifact Checklist in a Healthy Run Directory

/tmp/v7_exp1/
  weights.bump
  weights_manifest.json
  ir1_train_forward.json
  ir2_train_backward.json
  ir_train_invariants.json
  layout_train.json
  layout_train_audit.json
  generated_train_runtime_v7.c
  generated_train_runtime_summary_v7.json
  libtrain.so
  train_e2e_latest.json
  training_loss_curve.json
  training_grad_norms.json
  training_parity.json
  memory_diagnostic_latest.json
  ir_report.html

Use this page as the operator baseline: if these artifacts line up and the gates pass, your v7 backprop pipeline is stitched correctly.