Operator Spectrum
The full pipeline from raw data to inference — a continuous refinement loop. Click any stage to see what it does, which docs pages cover it, and what to tune next.
SVG, text, JSON
encode ↔ decode
train/val split
IR → C kernels
layer-by-layer
weight updates
convergence signal
stop at </svg> or <eos>
📦 Stage 1 — Data
Everything starts here. The quality and structure of your raw corpus determines the ceiling for everything downstream.
For SVG training: validate ASCII-only, deduplicate near-identical rows, add explicit <eos> boundaries between samples.
🔤 Stage 2 — Tokenizer
Converts raw text into integer token sequences the model can process. BPE learns merge rules from the corpus; vocab size directly affects model capacity. The encode→decode roundtrip must be lossless — any artifact here propagates through the entire pipeline.
<eos>, <pad>) · roundtrip parity test
⚙️ Stage 3 — Data Preparation
Token sequences are chunked into fixed-length windows, batched, and optionally shuffled. Sequence length determines how much context the model sees per step; batch size × grad accumulation steps = effective batch. A train/val split lets you detect memorization vs. generalization.
🧠 Stage 4 — AI Algorithm (Model Architecture)
The Transformer architecture — attention heads, feed-forward layers, embeddings. The IR defines the computational graph; codegen compiles it to C kernels. This is where you choose how many layers, heads, embed dimensions, and what quantization format weights live in.
▶️ Stage 5 — Forward Pass
Input tokens flow through embedding → attention → FFN → output logits. Each kernel (matmul, softmax, layernorm, RoPE) runs as optimized C with SIMD intrinsics. Memory layout (row-major vs. transposed) and tiling strategy determine throughput.
◀️ Stage 6 — Backward Pass
Gradients flow backward through the chain rule. Each forward kernel has a corresponding backward kernel that computes dL/dW and dL/dx.
Gradient accumulation lets you simulate larger batches on limited memory. Watch for numerical drift — the cross-entropy parity test catches it.
📉 Stage 7 — Loss & Evaluation
Cross-entropy loss measures how far the model's predictions are from ground truth. Decreasing loss ≠ good model; you need eval metrics: valid SVG rate, exact-match rate, prompt-to-SVG success. Compare train loss vs. val loss to detect overfitting.
✨ Stage 8 — Inference
The payoff. Feed a prompt, generate tokens auto-regressively until </svg> or <eos>.
Temperature and top-k control creativity vs. coherence. If the output loops or generates artifacts, the fix is usually upstream: better data boundaries, more diverse training samples, or explicit stop tokens.
Refinement Checklist
Each iteration through the loop targets a different part of the spectrum. The operator's job is to identify which stage is the bottleneck, fix it, and re-enter the loop.
| Symptom | Bottleneck Stage | Action |
|---|---|---|
| UTF artifacts in output | ② Tokenizer | Validate ASCII corpus, check roundtrip parity |
| Output is random noise | ④ Model / ⑥ Backward | Check weight init, verify gradient flow, reduce LR |
| Loops / repeats templates | ① Data | Add <eos> boundaries, deduplicate rows |
Boundary glitches (/svg>, 00") |
③ Data Prep | Explicit end markers, stop at first </svg> |
| Loss plateaus | ⑦ Loss | LR schedule, more diverse data, increase seq-len |
| Memorizes but won't generalize | ① Data / ③ Data Prep | Train/val split, more SVG variety, check overfitting |
| Inference is slow | ⑤ Forward | Profile hotspots, check SIMD paths, quantize weights |
| Numerical drift (C vs PyTorch) | ⑥ Backward | Cross-entropy parity test, check accumulation order |
Controlled Recipe Sweep
Once correctness gates pass, systematically sweep these knobs (one at a time):