v7 Backprop IR Pipeline (Operator Guide)

This page explains how v7 training works end to end, with a focus on what is generated from IR, what is layout/codegen output, where logic lives, and how to debug correctness quickly.

One important clarification
There is no ck_run_v7.c. The operator entrypoint is version/v7/scripts/ck_run_v7.py (and wrapper version/v7/scripts/cks-v7-run). It orchestrates generation and execution of generated C runtimes.
v7 training flow from init to IR1 to IR2 to layout and codegen

What Runs in C vs Python

Layer Primary Runtime Why It Exists Current Operator Command
Orchestration and run management Python Build artifacts, launch gates, collect reports, drive strict checks cks-v7-run init|train|sanity|parity|profile
Forward and backward kernel math C kernels Deterministic numeric core, SIMD paths, parity-tested primitives make test, make v7-gate-train
Generated training runtime C (generated) Dumb emitter executes IR-lowered plan and memory offsets generated_train_runtime_v7.c, libtrain.so
Oracle parity and drift localization Python + PyTorch Reference implementation for step-level and slot-level checks --backend both, --parity-on
IR report rendering Python + HTML/JS Load run artifacts and present explainable diagnostics open_ir_visualizer.py --generate --run ...

Mental Model: What Is Smart and What Is Dumb

Smart Stages

  • Template selection: architecture op sequence and flags
  • IR1: forward op graph with typed tensors and kernel IDs
  • IR2: backward synthesis, fanout accumulation, grad edges
  • Layout generation: contiguous offsets, sections, canary map
  • Validation: invariants, memory audit, parity, drift checks

Dumb Stage

  • Codegen only emits what lowered IR and layout already decided
  • No architecture guessing in emitter
  • No hidden model-family conditionals in runtime glue
  • If behavior is wrong, root cause is usually upstream in IR/lowering/layout

From Init to Train

  1. Initialize run: choose init policy, dimensions, dtype policy, and template.
  2. Emit weights: write weights.bump and weights_manifest.json.
  3. Build IR1: forward graph and typed tensor registry.
  4. Lower IR2: backward graph from IR1 + grad rules, with explicit accumulations.
  5. Generate layout: contiguous memory map with sections and canary ranges.
  6. Codegen runtime: emit train runtime C and compile to libtrain.so.
  7. Execute train: run CK path, optional PyTorch oracle checks, write reports.

Drift Timeline: Early Schedule Bug to CE Numeric Tightening

Yes, execution order changed in an important way. The generated train runtime now enforces accumulation-window semantics so optimizer updates only happen at the true boundary of CK_GRAD_ACCUM_STEPS micro-steps.

Before and after schedule for gradient accumulation and optimizer boundary

Current Numeric Status

Schedule Pseudocode (Expected)

if g_accum_step == 0: zero_grad()
forward()
backward()
g_accum_step += 1
if g_accum_step >= CK_GRAD_ACCUM_STEPS:
    scale_grads(1 / g_accum_step)  # avg over accumulation window
    optimizer_step()
    g_accum_step = 0

Grad-Accum Math (N vs K)

For an exact explanation of CE token normalization (/N) versus accumulation-window normalization (/K), plus CPU effective-batch examples and variable-length caveats, see v7-grad-accum-windows.html.

Cross-Entropy Semantics

C cross-entropy uses stable log-sum-exp loss math and PyTorch-aligned index-target mean reduction semantics. For the full CE derivation (p - one_hot), ignore-index denominator details, and long-horizon repro data, read v7-cross-entropy-parity.html.

Production Readiness Guidance

For v7 training backprop, keep production safety enabled (--max-grad-norm, enforced LR guardrails) and retain long-horizon parity gates as blockers. This now gives a production-grade path for the validated v7 harness profile while preserving reproducibility checks.

Template Flexibility

You can use built-ins (qwen3, qwen2, gemma3) or a custom template file. If a template op has kernel coverage and bindings, it can be stitched automatically.

# Built-in
version/v7/scripts/cks-v7-run init \
  --run /tmp/v7_exp1 \
  --template qwen3 \
  --init xavier_uniform \
  --generate-ir --generate-runtime --strict

# Custom template
version/v7/scripts/cks-v7-run init \
  --run /tmp/v7_exp_custom \
  --template my_arch \
  --template-file /absolute/path/to/my_template.json \
  --init xavier_uniform \
  --generate-ir --generate-runtime --strict

IR1, IR2, Layout, Codegen Responsibilities

Stage Input Output Core Responsibility
IR1 (train-forward) template + manifest + kernel registry + grad rules ir1_train_forward.json Build forward graph, classify weight vs activation tensors, attach save-for-backward contract
IR2 (backward) IR1 + grad rules + bindings ir2_train_backward.json Synthesize backward ops, add gradient fanout accumulation, preserve producer-consumer chain
Layout + Audit (memory-lowered) IR2 + manifest layout_train.json, layout_train_audit.json Finalize contiguous offsets, section ownership, canary layout, bounds/overlap validation
Codegen IR2 + layout generated_train_runtime_v7.c, generated_train_runtime_summary_v7.json Emit deterministic C calls and memory references only

Why Train Builder Is Separate From Inference Builder (For Now)

So today: inference uses build_ir_v7.py; training uses build_ir_train_v7.py + lower_ir2_backward_v7.py. Both share IR types and kernel registry contracts.

How Residual, GQA, and Splits Stitch Backward

The forward graph can branch. IR2 makes the reverse merge explicit by emitting accumulation ops where gradients meet.

Forward residual and GQA split with backward gradient accumulation routing

Key Rule

If one forward tensor feeds multiple consumers, IR2 inserts explicit gradient accumulation in backward. This is where residual and attention branch merges stay correct.

Memory Layout and Canary Diagnostics

Training layout is contiguous and sectioned. Canary guards and runtime checks detect out-of-bounds writes and readonly violations.

Contiguous train memory layout with canary guards and diagnostics phases
Section Examples Written In
weights weight.layer.* optimizer step (not forward)
activations + saved act.*, saved.* forward
grad_activations grad.act.* backward
grad_weights grad.weight.*, tmp.grad.weight.* backward
optimizer state AdamW m, v optimizer step
temporaries + aux scratch, loss buffers, diagnostics multi-phase

Runbook: End-to-End Commands

1) Initialize and generate IR/runtime

version/v7/scripts/cks-v7-run init \
  --run /tmp/v7_exp1 \
  --template qwen3 \
  --init xavier_uniform \
  --layers 2 --embed-dim 128 --hidden-dim 256 \
  --num-heads 8 --num-kv-heads 4 \
  --generate-ir --generate-runtime --strict

2) CK-only training (generated C runtime path)

version/v7/scripts/cks-v7-run train \
  --run /tmp/v7_exp1 \
  --backend ck \
  --prompt "hello" \
  --train-epochs 3 --train-seq-len 16 --train-total-tokens 1024 --train-grad-accum 8 \
  --train-strict

3) CK + PyTorch oracle parity

version/v7/scripts/cks-v7-run train \
  --run /tmp/v7_exp1 \
  --backend both \
  --prompt "hello" \
  --train-epochs 1 --train-seq-len 16 --train-total-tokens 1024 --train-grad-accum 8 \
  --parity-on --parity-profile balanced --dump-on-drift --drift-topk 8

4) Generate visual report from run directory

python3 version/v7/tools/open_ir_visualizer.py \
  --generate --run /tmp/v7_exp1 --html-only

Open /tmp/v7_exp1/ir_report.html.

New: One-Command Data -> Tokenizer -> Train Pipeline

For operator workflows, use the pipeline wrapper to avoid manual wiring errors such as missing data files or token/vocab mismatches.

Byte tokenizer pipeline (with torch reference)

make v7-train-data-pipeline \
  RUN=/tmp/v7_pipeline_byte \
  V7_PIPELINE_TOKENIZER=byte \
  V7_PIPELINE_WITH_TORCH=1

BPE pipeline

make v7-train-data-pipeline \
  RUN=/tmp/v7_pipeline_bpe \
  V7_PIPELINE_TOKENIZER=bpe \
  V7_PIPELINE_WITH_TORCH=0
Knob Purpose Typical values
V7_PIPELINE_TOKENIZER Select tokenization path byte, bpe
V7_PIPELINE_EPOCHS, V7_PIPELINE_SEQ_LEN, V7_PIPELINE_TOTAL_TOKENS Train budget 10, 32, 1024
V7_PIPELINE_GRAD_ACCUM, V7_PIPELINE_LR Update scale 1..8, 5e-4
V7_PIPELINE_OPEN_VIS Generate visualizer report automatically 0, 1
Safety guard now built in
Token-file training now fails fast if token IDs exceed run-dir vocab size. This prevents NaN runs from silent token/vocab mismatch.

New: Runtime Stitch Graph in IR Visualizer

The Training tab now includes a function-level runtime graph fed by train_exec_plan.json with explicit stages:

How to inspect stitching

  1. Generate report from run-dir: python3 version/v7/tools/open_ir_visualizer.py --generate --run /tmp/v7_pipeline_byte --html-only
  2. Open /tmp/v7_pipeline_byte/ir_report.html
  3. Go to Training tab
  4. Use Runtime Stitch section filters: Stage=forward|backward|accum|optimizer and Layer=all|Lx
  5. Click a node to inspect dispatch/reduction metadata and phase stitching details

Use this graph when first loss is correct but later updates diverge. It makes boundary mistakes visible at function level without opening generated C manually.

Verification Matrix: Epoch 1 to 10 + Checkpoints + Oracle

Use .venv/bin/python for strict PyTorch snapshot oracle
If you use system python3 without torch, training falls back to tiny reference telemetry instead of strict slot-level snapshot checks.
# 0) init
.venv/bin/python version/v7/scripts/ck_run_v7.py init \
  --run /tmp/v7_oracle_10ep \
  --init xavier_uniform \
  --layers 2 --vocab-size 256 --embed-dim 128 --hidden-dim 256 \
  --num-heads 8 --num-kv-heads 4 --context-len 128 \
  --generate-ir --generate-runtime --strict

# 1) epoch-1 strict oracle (every step) + replay + periodic checkpoints
.venv/bin/python version/v7/scripts/ck_run_v7.py train \
  --run /tmp/v7_oracle_10ep \
  --backend ck \
  --train-epochs 1 --train-seq-len 8 --train-total-tokens 64 --train-grad-accum 2 \
  --train-vocab 256 --train-d-model 64 --train-hidden 128 \
  --parity-on --oracle pytorch --parity-profile debug --parity-every 1 \
  --parity-replay-on-check --train-save-every 4

# 2) epoch-10 sweep strict oracle + replay + checkpoint cadence
.venv/bin/python version/v7/scripts/ck_run_v7.py train \
  --run /tmp/v7_oracle_10ep \
  --backend ck \
  --train-epochs 10 --train-seq-len 8 --train-total-tokens 64 --train-grad-accum 2 \
  --train-vocab 256 --train-d-model 64 --train-hidden 128 \
  --parity-on --oracle pytorch --parity-profile balanced --parity-every 1 \
  --parity-replay-on-check --train-save-every 20

# 3) memory verification suite (canary toggle + fault injection + ASan agreement)
.venv/bin/python version/v7/scripts/ck_run_v7.py train \
  --run /tmp/v7_oracle_10ep \
  --backend ck \
  --train-epochs 1 --train-seq-len 8 --train-total-tokens 64 --train-grad-accum 2 \
  --train-vocab 256 --train-d-model 64 --train-hidden 128 \
  --train-verify-memory --train-verify-steps 4
Artifact Expected Location
train summary pass_parity: true, strict oracle source + replay checks /tmp/v7_oracle_10ep/train_e2e_latest.json
checkpoints weights_step_*.bump + per-step manifest /tmp/v7_oracle_10ep/checkpoints/
memory verification ok: true (toggle diff, intentional +1 catch, ASan agreement, bounds) /tmp/v7_oracle_10ep/memory_verification_latest.json
viewer report training tabs populated from run-dir artifacts /tmp/v7_oracle_10ep/ir_report.html

Backprop + PyTorch Parity Test Suite

This is the concrete v7 inventory used to keep backprop parity green. By default, artifacts write to version/v7/.cache/reports/ (override with V7_REPORT_DIR).

Canonical Regression Ledger

Use this as the single source of truth for historical failures and fixes so new model architectures do not repeat old bugs:

Script Primary Entry What It Verifies Main Artifact
run_parity_1token_v7.py kernelmake v7-parity-1tok T=1 parity for RMSNorm (fwd/bwd), SwiGLU (fwd/bwd), CE (loss + dlogits), and QK-norm backward hook. parity_1token_latest.json
check_qk_norm_backward_parity_v7.py kernelmake v7-qk-norm-backward-parity* QK-norm backward parity, ISA matrix sweep (scalar/avx/avx2/avx_vnni), and strict no-fallback mode. qk_norm_backward_parity*_latest.json
check_rms_swiglu_backward_parity_v7.py kernelmake v7-rms-swiglu-backward-parity RMSNorm backward default+strict paths, SwiGLU exact path, and optional fast-path gate. rms_swiglu_backward_parity_latest.json
check_gemm_backward_parity_v7.py kernelmake v7-kernel-parity-train GEMM backward shape-sweep parity for v7 training kernel bindings. gemm_backward_shape_sweep_latest.json
check_optimizer_parity_v7.py kernelmake v7-kernel-parity-train Optimizer kernels: adamw_update_f32, gradient_clip_norm_f32, adamw_clip_update_multi_f32, gradient_accumulate_f32. optimizer_parity_latest.json
check_fd_gradients_v7.py auditmake v7-grad-fd Finite-difference sanity against analytical gradients in the tiny training stack. fd_gradients_latest.json
check_replay_determinism_v7.py runtimemake v7-replay Deterministic replay across identical training runs (same losses + parameter deltas). replay_determinism_latest.json
check_runtime_replay_accum_v7.py runtimemake v7-replay-accum Replay-on-check correctness with grad_accum > 1, including accumulation snapshot restoration. replay_accum_latest.json
train_parity_epochs_v7.py driftmake v7-train-parity-3/5 and long-horizon targets Epoch parity, drift smoke, drift localization, and long-horizon CK-vs-PyTorch trajectory checks. train_parity_*_latest.json
check_backprop_plumbing_v7.py auditmake v7-backprop-plumbing Static IR2/layout/manifest audit: grad coverage, writer coverage, per-layer flow, saved-tensor + layout integrity. backprop_plumbing_latest.json
check_backprop_stitch_runtime_v7.py runtimemake v7-backprop-stitch-runtime* Runtime stitch smoke: manifest-dim wiring, first checked parity step cleanliness, and all checked-step consistency. backprop_stitch_runtime*_latest.json
test_bpe_train_parity_v7.py runtimemake test-v7-bpe-train-parity BPE trainer parity: HF tokenizer JSON vs CK binary tokenizer artifacts on deterministic text probes. v7_bpe_train_parity_latest.json
test_svg_overfit_regression_v7.py runtimemake test-v7-svg-overfit-regression CK runtime vs PyTorch reference overfit regression on deterministic SVG-derived dataset. svg_overfit_regression_latest.json
ck_run_v7.py runtimemake v7-train-runtime-parity-* Generated runtime long-horizon parity checks (stress + realistic) with --parity-on. train_runtime_parity_*_latest.json

Exhaustive Test-Case Matrix (Backprop + Parity)

Case Family Exact Command(s) Pass Condition Artifact(s)
Single-token parity baseline make v7-parity-1tok All T=1 kernel checks pass under strict tolerances. parity_1token_latest.json
QK-norm backward (default + ISA + strict ISA) make v7-qk-norm-backward-parity
make v7-qk-norm-backward-parity-isa
make v7-qk-norm-backward-parity-isa-strict
Parity holds across ISA modes; strict mode fails any fallback mismatch. qk_norm_backward_parity_latest.json
qk_norm_backward_parity_isa_latest.json
qk_norm_backward_parity_isa_strict_latest.json
RMSNorm/SwiGLU backward parity make v7-rms-swiglu-backward-parity RMSNorm default+strict and SwiGLU exact/fast checks all stay within tolerance. rms_swiglu_backward_parity_latest.json
Kernel parity bundle make v7-kernel-parity-train Optimizer + QK-norm ISA + RMS/SwiGLU + GEMM-backward sweeps all pass. optimizer_parity_latest.json
qk_norm_backward_parity_isa*_latest.json
rms_swiglu_backward_parity_latest.json
gemm_backward_shape_sweep_latest.json
Finite-difference gradients make v7-grad-fd Central-difference numerical grads agree with analytical grads. fd_gradients_latest.json
Deterministic replay make v7-replay Two identical runs produce identical replay outputs. replay_determinism_latest.json
Replay with grad-accum snapshots make v7-replay-accum Replay-on-check is enabled and accumulation state restores within tolerance. replay_accum_latest.json
Backprop plumbing static audit make v7-backprop-plumbing IR2 graph/dataflow, grad writers, saved tensors, layout, and manifest wiring all pass audit checks. backprop_plumbing_latest.json
Runtime stitch smoke (step-1) make v7-backprop-stitch-runtime
make v7-backprop-stitch-runtime-accum
Manifest dims override conflicting requests; first checked parity step is clean; checked steps stay clean. backprop_stitch_runtime_latest.json
backprop_stitch_runtime_accum_latest.json
Epoch parity sweeps make v7-train-parity-3
make v7-train-parity-5
Multi-epoch CK-vs-PyTorch parity stays within configured tolerances. train_parity_epochs_3_latest.json
train_parity_epochs_5_latest.json
Drift smoke and localization make v7-train-parity-drift-smoke
make v7-train-parity-drift-localize
Smoke run reports bounded drift; localization emits same-state stage diagnostics near target step. train_parity_drift_smoke_latest.json
train_parity_drift_localize_latest.json
Long-horizon parity (stress + realistic) make v7-train-parity-long-horizon
make v7-train-parity-long-horizon-realistic
Long-run loss and parameter parity stay within blocker thresholds. train_parity_long_horizon_latest.json
train_parity_realistic_long_horizon_latest.json
Generated-runtime long-horizon parity make v7-train-runtime-parity-stress
make v7-train-runtime-parity-realistic
make v7-train-runtime-parity-long-horizon
Generated C runtime with --parity-on stays aligned with oracle over stress and realistic prompts. train_runtime_parity_stress_latest.json
train_runtime_parity_realistic_latest.json
BPE train parity gate make test-v7-bpe-train-parity Tokenizer IDs match between HF tokenizer JSON and CK true-BPE binary path. v7_bpe_train_parity_latest.json
SVG overfit regression gate make test-v7-svg-overfit-regression CK and PyTorch overfit curves satisfy configured first/min/final loss thresholds. svg_overfit_regression_latest.json
Full train gate bundle make v7-gate-train Contract checks, IR/layout/codegen smoke, parity kernels, FD, replay, and enabled optional gates all pass. contract_report_latest.json + parity/runtime gate artifacts above
Production-ready nightly bundle make v7-backprop-production-ready v7-gate-train passes with production safety, then realistic long-horizon blocker + stress monitor run. train_parity_realistic_long_horizon_latest.json + gate artifacts

PR4.5 Throughput Track

Performance Workflow (After Parity Is Green)

  1. Run CK baseline without oracle: --backend ck --parity-on=false.
  2. Run reference harness timing: --backend both (captures CK vs torch step metrics).
  3. Use make profile-v7-full or make v7-perf-gate for perf + flamegraph artifacts (see v7-profiling.html).
  4. Use VTune for microarchitecture + memory bottlenecks (hotspots, uarch-exploration, memory-access), and Advisor for roofline analysis.
  5. Use ASan for bounds/UB; use Valgrind when you need slower but deeper leak/use-after-free analysis.

Threading Sanity

Before optimization work, confirm CK runtime is saturating cores and not running in an accidental single-thread path. Use train step timing plus profiler traces to validate threadpool/OpenMP utilization, then optimize kernel hotspots (usually GEMM backward and attention backward first).

How You Know It Is Actually Training

Debug Fast: First-Divergence and Memory Safety

Symptom First Artifact Likely Root
Parity drift appears after a few steps drift_report.json Specific op mismatch, accumulation edge, or dimension binding issue
Crash or invalid pointer in CK backend memory_diagnostic_latest.json Out-of-bounds write, incorrect slot size/offset, or bad kernel args
Layout audit fails layout_train_audit.json Overlap, section bounds error, alignment or ownership mismatch
Generated runtime compiles but output is unstable generated_train_runtime_summary_v7.json Wrong kernel binding or call-order mismatch from upstream IR

What Changes for Backprop vs Inference

Artifact Checklist in a Healthy Run Directory

/tmp/v7_exp1/
  weights.bump
  weights_manifest.json
  ir1_train_forward.json
  ir2_train_backward.json
  ir_train_invariants.json
  layout_train.json
  layout_train_audit.json
  generated_train_runtime_v7.c
  generated_train_runtime_summary_v7.json
  libtrain.so
  train_e2e_latest.json
  training_loss_curve.json
  training_grad_norms.json
  training_parity.json
  memory_diagnostic_latest.json
  ir_report.html

Use this page as the operator baseline: if these artifacts line up and the gates pass, your v7 backprop pipeline is stitched correctly.

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close