v7 Gradient Accumulation Windows on CPU
This page explains one of the most common training confusions in v7: CE token normalization (/N) versus gradient-accumulation window normalization (/K), and how they combine to set the true optimizer update scale.
N and K are different: N is valid targets inside one micro-step; K is the number of micro-steps accumulated before one optimizer update.
Terminology
| Term | Meaning | Typical v7 knob |
|---|---|---|
| Micro-batch | One forward+backward pass worth of tokens | seq_len (and token stream chunking) |
N (CE denominator) |
Valid targets in one micro-step (ignore rows excluded) | Derived at runtime |
K (accumulation window) |
Micro-steps before one optimizer update | --train-grad-accum |
| Effective tokens per optimizer update | Total tokens represented by one update | micro_tokens * K |
The Two Normalizations You Need
Per micro-step CE with mean reduction gives:
\[ \nabla z = \frac{p - \text{one\_hot}}{N} \]
After accumulating K micro-steps, the optimizer must consume the average gradient:
\[ g_{\text{step}} = \frac{1}{K}\sum_{m=1}^{K} g_m \]
If each micro-step has similar N, update scale is effectively proportional to 1 / (N * K).
What Was Missing in the Drift Repro
CE /N was already present. The missing piece was window averaging /K before AdamW in generated runtime when grad_accum > 1.
Concrete CPU Examples
| Micro tokens | K (grad_accum) |
Effective tokens / optimizer step |
|---|---|---|
| 8 | 4 | 32 |
| 8 | 8 | 64 |
| 16 | 8 | 128 |
# Example: simulate larger batch on CPU memory budget
# micro-step tokens = 8
# grad_accum K = 4
# effective tokens/update = 32
for each micro-step:
forward()
backward() # accumulate gradients
at accumulation boundary:
grad *= 1/K # or equivalently, backward(loss/K) each micro-step
adamw_step()
zero_grad()
Why This Mattered at lr = 1e-3
If accumulated gradients are not divided by K, optimizer input is approximately K times too large.
K=8withlr=1e-3behaves closer to an effective step size like~8e-3in magnitude.- This can look fine early, then destabilize later as AdamW moments integrate the oversized updates.
- Restoring
/Kmoved the long-horizon run from late blow-up to stable descent in the same repro setup.
Where It Lives in v7
- Train runtime codegen source:
version/v7/scripts/codegen_train_runtime_v7.py - Generated optimizer boundary in run-dir:
generated_train_runtime_v7.c - Related deep dives:
v7-backprop-ir.html,v7-cross-entropy-parity.html
Variable Context Length Caveat (Important on CPU)
When sequence lengths vary per micro-step, N can vary too. That means naive fixed scaling can bias updates.
Robust Strategy for Variable-Length Windows
- Track valid-token counts per micro-step (post-ignore filtering).
- Accumulate gradients in numerator form or keep per-step weighted scaling.
- Normalize by total valid targets in the full accumulation window, not a fixed constant.
Operator Checklist
- Set
grad_accumfrom memory budget and desired effective batch. - Confirm one optimizer update per full window.
- Confirm accumulated gradient averaging (
/K) before AdamW. - Use long-horizon parity checks with fixed token stream for drift triage.
Bottom Line
CE math correctness (p - one_hot, /N) is necessary but not sufficient. Production-stable backprop also requires window-correct optimizer scaling (/K) so CPU batch simulation behaves like true larger-batch training.