Correctness -> Intuition IR + Kernels + Gradients Checkpoint-Efficient

Deep Training Intuition Playbook

A practical roadmap for moving from pass/fail parity gates to gradient-level intuition: what to log, what to inspect, what to adjust, and how to learn from recurring failure patterns.

Companion Guide

For full operator details on v7 init, IR1/IR2 lowering, memory layout/canary diagnostics, and CK-vs-oracle parity commands, open v7-backprop-ir.html.

Infographic Snapshot

Deep training intuition infographic: visibility stack, checkpoint strategy, curriculum, and triage loops
Core idea: keep full model checkpoints rare for resume, but write dense analysis checkpoints (statistics + sampled grids) for insight.

Visibility Stack

Loss Curves Surface behavior
Per-Param Grad Norms Tensor-level magnitude
IR2 Backward Graph Op-by-op gradient path
Generated C + Kernel ID Exact compute path
SIMD + Memory Layout Hardware-level truth

This stack is your advantage: anomalies can be traced from metric -> tensor -> op -> kernel -> runtime.

Correctness vs Intuition

Already Strong (Correctness)

  • Parity and deterministic replay gates
  • Contract checks and kernel coverage
  • Single-run scalar dashboards
  • IR and codegen traceability

Next Layer (Intuition)

  • Activation histograms per layer over time
  • Weight and gradient heatmaps (spatial)
  • Gradient flow waterfall (layer boundary view)
  • Attention entropy + sampled QK pattern evolution
  • Run-to-run diff views, not only single-run views

Checkpoint Strategy That Fits Real Disk

Raw full dump (weights + grads + opt + activations) ~5.7 GB per checkpoint
Analysis checkpoint (stats + sampled grids) ~200 KB per checkpoint
Rare resume checkpoint (weights + optimizer state) ~3 GB each, keep a small rolling window

Size values are planning estimates for a Qwen3-0.6B-like setup and vary with precision, sequence length, optimizer state, and which tensors you retain.

Logarithmic Cadence + Event Triggers

Steps 1-100

Analysis checkpoint every 10 steps

Steps 100-1000

Analysis checkpoint every 50 steps

Steps 1000-10000

Analysis checkpoint every 200 steps

Steps 10000+

Analysis checkpoint every 1000 steps

loss spike > 2x moving average global grad norm threshold exceeded dead-parameter alert (near-zero grad)

Analysis Checkpoint Data Contract

{
  "schema_version": "ck.analysis.v1",
  "model_scale_hint": "qwen3_0.6b_like",
  "step": 500,
  "loss": 2.341,
  "weights": {
    "layer.0.wq": {
      "mean": 0.0012,
      "std": 0.045,
      "min": -0.23,
      "max": 0.19,
      "percentiles": [0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99],
      "histogram": { "bins": [...], "counts": [...] },
      "sampled_grid_32x32": [[...], ...]
    }
  },
  "gradients": { "...": { "norm": 0.045, "histogram": {...} } },
  "activations": { "...": { "mean": 0.12, "std": 0.45, "sparsity": 0.32 } },
  "attention": {
    "layer.0": {
      "entropy_per_head": [...],
      "max_attn_per_head": [...],
      "sampled_qk_grid_32x32_per_head": [...]
    }
  }
}

Keep this contract versioned. Dashboards should reject incompatible schema versions early rather than silently misrendering.

12-Experiment Learning Curriculum

Phase 1

Learn Healthy Baselines

  1. Run a stable baseline and memorize normal gradients.
  2. Vary init scale (tiny to huge) and inspect layerwise drift.
Phase 2

Break Gradient Flow On Purpose

  1. Sweep LR too high and too low.
  2. Zero one projection path and observe blocked upstream grads.
  3. Scale weights 5x, then recover with clipping.
Phase 3

Understand Optimizer Dynamics

  1. Compare SGD vs Adam using weight-delta statistics.
  2. Add warmup and inspect early-step stability.
Phase 4

Read Attention Behavior

  1. Track head entropy over time.
  2. Visualize sampled QK heatmaps and head redundancy.
Phase 5

Scale To Real Models

  1. Build layer gradient waterfalls on Qwen-scale runs.
  2. Rank params by relative movement from step 0.
  3. Perturb QK-norm and RoPE to identify load-bearing paths.
Execution Rule

One Change, Full Observability

  1. Change one variable per run.
  2. Compare against baseline at fixed steps.
  3. Record diagnosis as a repeatable failure pattern.

Gradient Triage Cheat Sheet

What You See Likely Cause First Knob To Adjust Where To Inspect
Layer waterfall pinches mid-depth Vanishing path or norm bottleneck Init scale, residual path, norm params activation_stats + per-layer grad norms
Loss spikes after smooth phase LR too high for local curvature LR decay or longer warmup Weight delta magnitudes vs grad norms
Frequent clipping Single tensor dominates global norm Find offending tensor; adjust LR/regularization Per-param gradient histogram outliers
Attention entropy collapses too early Saturated logits / unstable attention scale Check QK-norm path, init, LR Head entropy timeline + sampled QK grids
Some params barely move across many steps Dead or weakly-coupled path Inspect upstream gradients and mask logic Relative movement ranking + grad flow graph

Implementation Order (Minimal, High Impact)

Step 1-3: Stats Instrumentation

  • Buffer stats for weights, grads, activations
  • Attention entropy and sparsity per head
  • Weight-delta stats after optimizer step

Step 4-6: Visual Layers

  • Weight and gradient heatmaps (sampled grid)
  • Gradient waterfall and run-diff overlays
  • Attention inspector with entropy timeline
train loop -> write analysis_checkpoint_*.json
          -> open_ir_visualizer.py embeds data
          -> ir_report training tabs render:
             [gradient flow] [weights + activations] [attention]

Run This Now (v7)

1) Generate deterministic training reports

Creates core parity/replay artifacts in version/v7/reports/.

make v7-train-parity-3
make v7-replay

2) Generate visual report HTML

Build an IR report from your latest artifacts without rerunning probes.

python3 version/v7/tools/open_ir_visualizer.py --generate --html-only

3) Iterate with one change at a time

Change exactly one knob (LR, init scale, clipping, warmup), rerun, and compare snapshots at fixed steps.

artifacts: version/v7/reports/*.json
report:    version/v7/tools/ir_visualizer.html (or generated ir_report*.html)

Full Diagnostic Matrix (Phases 1-7)

This section is meant to be operational. Keep it open during training and use it as a live checklist.

Phase 1: Is It Even Working?
Question What to look at Good Bad
Is it training at all? Loss curve, first 10 steps Loss drops from random baseline (for 151K vocab, ln(vocab_size) is about 11.9) Loss flat or NaN on step 1
Are gradients flowing to every layer? Gradient waterfall (layer 0 -> layer N) All layers nonzero, typically within 10x range Early layers near 1e-12 while late layers are around 1e-1
Is any parameter dead? Gradient Health tab, sort by norm All trainable params above ~1e-7 Repeated exact 0.0 or near-zero (<1e-10) norms
Is implementation correct? Parity tracker (CK vs PyTorch) loss_diff < 1e-6, param_diff < 1e-5 Diverges after a few steps
Are weights updating? Weight delta statistics, ||w_new - w_old||_2 Nonzero deltas for expected trainable tensors Near-zero deltas despite nonzero gradients
Is it deterministic? Replay determinism gate Two identical runs produce identical losses/metrics Any unexplained drift indicates race/uninitialized state
Phase 2: Is The Model Learning?
Question What to look at Good Bad
Is loss decreasing over real horizon? Loss curve over 100+ steps Clear downward trend (allowing noise) Immediate plateau or pure oscillation
Is it memorizing training data? Inference on training samples, training perplexity Perplexity steadily drops toward 1.0 on repeated data Perplexity remains high on same seen samples
How fast is memorization? Early epoch loss slope Steep initial drop Shallow slope points to LR/capacity/data bottleneck
Has train set been mostly absorbed? Training loss near data entropy floor Plateau near expected entropy floor Flat but still high means stuck optimization
Is it learning tokens or patterns? Attention heatmaps on seen examples Positional + semantic structure appears Uniform attention or always position-0 attention
Are all layers contributing? Per-layer movement, ||w_t - w_0||_2 / ||w_0||_2 Broad movement across multiple layers Only small subset moves while others remain frozen
Phase 3: What Is It Learning?
Question What to look at Good Bad
What are attention heads doing? Per-head heatmaps across layers Head specialization (positional, semantic, induction-like) All heads uniform or all heads nearly identical
Which heads are redundant? Head similarity matrix (cosine similarity) Diversity within a layer Many heads with similarity above 0.9
Content vs position dependence? Same tokens at different positions Attention follows content as well as position Purely diagonal regardless of content changes
Which weights changed most? Global ranking by relative movement Core projections and MLP weights move significantly Only embeddings/norms move while core blocks stay static
Are MLP neurons specializing? MLP hidden activation distributions Distributions become structured/multimodal No shape evolution over training
Is embedding space organizing? Embedding heatmap step 0 vs step N Visible structure/clustering emerges Matrix remains random-looking with little change
Phase 4: Is Something Going Wrong?
Question What to look at Good Bad
Why did loss spike? Checkpoint at spike step: grads, deltas, activations Transient spike that recovers quickly No recovery; basin ejection requiring earlier restart
Vanishing gradient? Layer waterfall at fixed step Roughly stable magnitude across depth (about 2-3x variance) 100x+ collapse from late to early layers
Exploding gradient? Global + per-param norm trends Stable or slowly decaying norm Exponential norm growth across steps
Is clipping hiding root cause? Per-param norms when clipping active Rare clipping on few tensors Clipping every step from same offenders
Are activations healthy? Per-layer min/max/mean/std Stable mean/std with bounded ranges Drift, exploding std, or extreme outliers
Is LR schedule right? Loss, LR, and delta magnitude overlay Warmup stable, peak controlled, decay helpful Spike at LR peak or no gain during decay phase
Phase 5: Memorization -> Generalization
Question What to look at Good Bad
Is it overfitting? Training loss vs held-out validation loss Both decrease with a small and stable gap Train improves while validation stalls/rises
When to stop pretraining? Validation loss plateau window Minimal improvement for long horizon (for example 100+ steps) Still meaningful downward trend, or clear overfit trend
Is model capacity sufficient? Final train loss vs entropy estimate Approaches expected floor Plateaus far above floor
Has it seen enough data? Tokens processed + slope decay Diminishing returns clearly visible Still in steep drop phase
Any catastrophic forgetting? Perplexity on early training probes over time Remains low at late steps Rises while training on newer slices
Phase 6: Transition Readiness (Pretrain -> SFT -> RLHF/GRPO)
Question What to look at Good signal to transition Not ready
When is pretraining done? Validation slope over last 1K steps Very low slope and coherent next-token quality on unseen text Validation still dropping clearly
Should SFT start? Instruction-like generation probes Fluent language but weak instruction following Still produces unstable or broken language
How to verify SFT works? Instruction-format loss + held-out instruction eval Format adherence and response quality improve Loss drops but behavior still ignores instruction format
When move SFT -> preference tuning? Consistency and quality profile Follows format reliably, needs preference shaping Still fails basic instruction compliance
How evaluate pretraining quality before SFT? Half-prompt continuation on seen and held-out samples Strong continuation on train + reasonable held-out perplexity Cannot reliably continue seen samples
Phase 7: Memorization Test Protocol

Step 1: Reserve probes before training

Set aside 10-20 fixed probe samples from training data for longitudinal memorization checks.

Step 2: Evaluate at resume checkpoints

Feed first half of each probe, generate continuation, and score exact-match, overlap, and perplexity on the true continuation.

Step 3: Plot memorization curve

Track progression from near-random accuracy toward high completion accuracy as steps grow.

Step 4: Compare with held-out curve

Training-probe vs held-out accuracy gap is your best generalization indicator for transition decisions.

Step Probe accuracy (rough) Interpretation
0 about 0% Untrained baseline
100 5-15% Learning token statistics
500 30-50% Pattern acquisition phase
2000 70-90% Strong memorization of seen structures
5000+ 90%+ Train probes mostly memorized
The Map: Question -> Tool
Question category Primary tool Secondary tool
Is it training? Loss curve + parity tracker Gradient Health
Are gradients correct? Gradient waterfall Finite-difference gradient checks
What is it learning? Attention heatmaps Weight movement ranking
Is it memorizing? Inference on probe set Loss slope + perplexity trends
Is something wrong? Event checkpoint autopsy Triage cheat sheet
Is pretraining done? Validation plateau Held-out quality probes
Ready for SFT? Instruction-following eval Format compliance tracking

UI framing rule: each dashboard tab should be phrased as a question, not just a metric name.

Outcome: this shifts the workflow from "did it pass?" to "why did it move this way?" with fast, affordable, and repeatable evidence.
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close