v7 Gradient Accumulation Windows on CPU

This page explains one of the most common training confusions in v7: CE token normalization (/N) versus gradient-accumulation window normalization (/K), and how they combine to set the true optimizer update scale.

Key Point
N and K are different: N is valid targets inside one micro-step; K is the number of micro-steps accumulated before one optimizer update.

Gradient accumulation window and optimizer boundary in v7

Terminology

Term	Meaning	Typical v7 knob
Micro-batch	One forward+backward pass worth of tokens	`seq_len` (and token stream chunking)
`N` (CE denominator)	Valid targets in one micro-step (ignore rows excluded)	Derived at runtime
`K` (accumulation window)	Micro-steps before one optimizer update	`--train-grad-accum`
Effective tokens per optimizer update	Total tokens represented by one update	`micro_tokens * K`

The Two Normalizations You Need

Per micro-step CE with mean reduction gives:

\[ \nabla z = \frac{p - \text{one\_hot}}{N} \]

After accumulating K micro-steps, the optimizer must consume the average gradient:

\[ g_{\text{step}} = \frac{1}{K}\sum_{m=1}^{K} g_m \]

If each micro-step has similar N, update scale is effectively proportional to 1 / (N * K).

What Was Missing in the Drift Repro

CE /N was already present. The missing piece was window averaging /K before AdamW in generated runtime when grad_accum > 1.

Concrete CPU Examples

Micro tokens	`K` (`grad_accum`)	Effective tokens / optimizer step
8	4	32
8	8	64
16	8	128

# Example: simulate larger batch on CPU memory budget
# micro-step tokens = 8
# grad_accum K      = 4
# effective tokens/update = 32

for each micro-step:
    forward()
    backward()                 # accumulate gradients

at accumulation boundary:
    grad *= 1/K               # or equivalently, backward(loss/K) each micro-step
    adamw_step()
    zero_grad()

Why This Mattered at lr = 1e-3

If accumulated gradients are not divided by K, optimizer input is approximately K times too large.

K=8 with lr=1e-3 behaves closer to an effective step size like ~8e-3 in magnitude.
This can look fine early, then destabilize later as AdamW moments integrate the oversized updates.
Restoring /K moved the long-horizon run from late blow-up to stable descent in the same repro setup.

Where It Lives in v7

Train runtime codegen source: version/v7/scripts/codegen_train_runtime_v7.py
Generated optimizer boundary in run-dir: generated_train_runtime_v7.c
Related deep dives: v7-backprop-ir.html, v7-cross-entropy-parity.html

Variable Context Length Caveat (Important on CPU)

When sequence lengths vary per micro-step, N can vary too. That means naive fixed scaling can bias updates.

Robust Strategy for Variable-Length Windows

Track valid-token counts per micro-step (post-ignore filtering).
Accumulate gradients in numerator form or keep per-step weighted scaling.
Normalize by total valid targets in the full accumulation window, not a fixed constant.

Operator Checklist

Set grad_accum from memory budget and desired effective batch.
Confirm one optimizer update per full window.
Confirm accumulated gradient averaging (/K) before AdamW.
Use long-horizon parity checks with fixed token stream for drift triage.

Bottom Line

CE math correctness (p - one_hot, /N) is necessary but not sufficient. Production-stable backprop also requires window-correct optimizer scaling (/K) so CPU batch simulation behaves like true larger-batch training.