Operator Spectrum

The full pipeline from raw data to inference — a continuous refinement loop. Click any stage to see what it does, which docs pages cover it, and what to tune next.

1 📦 Data Raw corpus
SVG, text, JSON

→

2 🔤 Tokenizer BPE / WordPiece
encode ↔ decode

→

3 ⚙️ Data Prep Batching, seq-len
train/val split

→

4 🧠 AI Algorithm Transformer / GPT
IR → C kernels

→

5 ▶️ Forward Matmul, attention
layer-by-layer

→

6 ◀️ Backward Gradients, chain rule
weight updates

→

7 📉 Loss Cross-entropy
convergence signal

→

8 ✨ Inference Prompt → output
stop at </svg> or <eos>

📦 Stage 1 — Data

Everything starts here. The quality and structure of your raw corpus determines the ceiling for everything downstream. For SVG training: validate ASCII-only, deduplicate near-identical rows, add explicit <eos> boundaries between samples.

📖 v7 Runbook 📖 Tokenizer Docs 📖 Training Intuition

Tune next: ASCII validation gate · row deduplication · explicit end markers · train/val split ratio

🔤 Stage 2 — Tokenizer

Converts raw text into integer token sequences the model can process. BPE learns merge rules from the corpus; vocab size directly affects model capacity. The encode→decode roundtrip must be lossless — any artifact here propagates through the entire pipeline.

📖 Tokenizer Deep Dive 📖 SentencePiece 📖 BPE Training (Runbook)

Tune next: vocab size (640/1024) · merge count · special tokens (<eos>, <pad>) · roundtrip parity test

⚙️ Stage 3 — Data Preparation

Token sequences are chunked into fixed-length windows, batched, and optionally shuffled. Sequence length determines how much context the model sees per step; batch size × grad accumulation steps = effective batch. A train/val split lets you detect memorization vs. generalization.

📖 v7 Runbook 📖 Grad Accum Windows 📖 Training Intuition

Tune next: seq-len (96/128) · batch size · grad accumulation (4) · shuffle seed · train/val split (90/10)

🧠 Stage 4 — AI Algorithm (Model Architecture)

The Transformer architecture — attention heads, feed-forward layers, embeddings. The IR defines the computational graph; codegen compiles it to C kernels. This is where you choose how many layers, heads, embed dimensions, and what quantization format weights live in.

📖 Architecture 📖 Codegen 📖 IR v2 Format 📖 v7 Backprop IR 📖 Concepts (RoPE, GQA) 📖 Quantization

IR Visualizer: Load your model's IR JSON → Architecture card, Dataflow tab, Weight Dtype Audit

▶️ Stage 5 — Forward Pass

Input tokens flow through embedding → attention → FFN → output logits. Each kernel (matmul, softmax, layernorm, RoPE) runs as optimized C with SIMD intrinsics. Memory layout (row-major vs. transposed) and tiling strategy determine throughput.

📖 Kernel Reference 📖 GEMM Memory Layout 📖 GEMM Optimization 📖 SIMD Architecture 📖 Flash Attention 📖 Profiling

IR Visualizer: Dataflow tab (kernel graph), Weight Dtype Audit (quant math hover popups)

◀️ Stage 6 — Backward Pass

Gradients flow backward through the chain rule. Each forward kernel has a corresponding backward kernel that computes dL/dW and dL/dx. Gradient accumulation lets you simulate larger batches on limited memory. Watch for numerical drift — the cross-entropy parity test catches it.

📖 v7 Backprop IR 📖 Cross-Entropy Parity 📖 Grad Accum Windows 📖 PyTorch Parity 📖 Memory Safety

Tune next: learning rate (2e-4 → 1e-4) · grad clipping · accumulation steps (4) · weight decay

📉 Stage 7 — Loss & Evaluation

Cross-entropy loss measures how far the model's predictions are from ground truth. Decreasing loss ≠ good model; you need eval metrics: valid SVG rate, exact-match rate, prompt-to-SVG success. Compare train loss vs. val loss to detect overfitting.

📖 Cross-Entropy Parity 📖 Training Intuition 📖 Scaling 📖 v7 Runbook

Track: train loss · val loss · valid SVG rate · exact-match rate · prompt→SVG success rate

✨ Stage 8 — Inference

The payoff. Feed a prompt, generate tokens auto-regressively until </svg> or <eos>. Temperature and top-k control creativity vs. coherence. If the output loops or generates artifacts, the fix is usually upstream: better data boundaries, more diverse training samples, or explicit stop tokens.

📖 v7 Runbook (Inference) 📖 Training Intuition 📖 v7 Profiling

Tune next: temperature · top-k · stop tokens · repetition penalty · max-gen-len

🔬 Pipeline Mastery — where the refinement loop tightens

Random noise Coherent tokens Structured output Correct SVGs Generalizes

Refinement Checklist

Each iteration through the loop targets a different part of the spectrum. The operator's job is to identify which stage is the bottleneck, fix it, and re-enter the loop.

Symptom	Bottleneck Stage	Action
UTF artifacts in output	② Tokenizer	Validate ASCII corpus, check roundtrip parity
Output is random noise	④ Model / ⑥ Backward	Check weight init, verify gradient flow, reduce LR
Loops / repeats templates	① Data	Add `<eos>` boundaries, deduplicate rows
Boundary glitches (`/svg>`, `00"`)	③ Data Prep	Explicit end markers, stop at first `</svg>`
Loss plateaus	⑦ Loss	LR schedule, more diverse data, increase seq-len
Memorizes but won't generalize	① Data / ③ Data Prep	Train/val split, more SVG variety, check overfitting
Inference is slow	⑤ Forward	Profile hotspots, check SIMD paths, quantize weights
Numerical drift (C vs PyTorch)	⑥ Backward	Cross-entropy parity test, check accumulation order

Controlled Recipe Sweep

Once correctness gates pass, systematically sweep these knobs (one at a time):

Dimension │ Values to try │ Measured by ─────────────────┼───────────────────────┼────────────────────────── vocab │ 640 / 1024 │ roundtrip parity, loss seq-len │ 96 / 128 │ loss curve, SVG validity epochs │ 3 / 5 │ train vs val loss gap lr │ 2e-4 → 1e-4 │ convergence speed grad-accum │ 4 │ effective batch stability temperature │ 0.7 / 0.9 / 1.0 │ output diversity top-k │ 10 / 40 / 80 │ coherence vs novelty