v7 Gradient Accumulation Windows on CPU

This page explains one of the most common training confusions in v7: CE token normalization (/N) versus gradient-accumulation window normalization (/K), and how they combine to set the true optimizer update scale.

Key Point
N and K are different: N is valid targets inside one micro-step; K is the number of micro-steps accumulated before one optimizer update.
Gradient accumulation window and optimizer boundary in v7

Terminology

Term Meaning Typical v7 knob
Micro-batch One forward+backward pass worth of tokens seq_len (and token stream chunking)
N (CE denominator) Valid targets in one micro-step (ignore rows excluded) Derived at runtime
K (accumulation window) Micro-steps before one optimizer update --train-grad-accum
Effective tokens per optimizer update Total tokens represented by one update micro_tokens * K

The Two Normalizations You Need

Per micro-step CE with mean reduction gives:

\[ \nabla z = \frac{p - \text{one\_hot}}{N} \]

After accumulating K micro-steps, the optimizer must consume the average gradient:

\[ g_{\text{step}} = \frac{1}{K}\sum_{m=1}^{K} g_m \]

If each micro-step has similar N, update scale is effectively proportional to 1 / (N * K).

What Was Missing in the Drift Repro

CE /N was already present. The missing piece was window averaging /K before AdamW in generated runtime when grad_accum > 1.

Concrete CPU Examples

Micro tokens K (grad_accum) Effective tokens / optimizer step
8 4 32
8 8 64
16 8 128
# Example: simulate larger batch on CPU memory budget
# micro-step tokens = 8
# grad_accum K      = 4
# effective tokens/update = 32

for each micro-step:
    forward()
    backward()                 # accumulate gradients

at accumulation boundary:
    grad *= 1/K               # or equivalently, backward(loss/K) each micro-step
    adamw_step()
    zero_grad()

Why This Mattered at lr = 1e-3

If accumulated gradients are not divided by K, optimizer input is approximately K times too large.

Where It Lives in v7

Variable Context Length Caveat (Important on CPU)

When sequence lengths vary per micro-step, N can vary too. That means naive fixed scaling can bias updates.

Robust Strategy for Variable-Length Windows

Operator Checklist

  1. Set grad_accum from memory budget and desired effective batch.
  2. Confirm one optimizer update per full window.
  3. Confirm accumulated gradient averaging (/K) before AdamW.
  4. Use long-horizon parity checks with fixed token stream for drift triage.

Bottom Line

CE math correctness (p - one_hot, /N) is necessary but not sufficient. Production-stable backprop also requires window-correct optimizer scaling (/K) so CPU batch simulation behaves like true larger-batch training.

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close