Operator Spectrum

The full pipeline from raw data to inference — a continuous refinement loop. Click any stage to see what it does, which docs pages cover it, and what to tune next.

1 📦 Data Raw corpus
SVG, text, JSON
2 🔤 Tokenizer BPE / WordPiece
encode ↔ decode
3 ⚙️ Data Prep Batching, seq-len
train/val split
4 🧠 AI Algorithm Transformer / GPT
IR → C kernels
5 ▶️ Forward Matmul, attention
layer-by-layer
6 ◀️ Backward Gradients, chain rule
weight updates
7 📉 Loss Cross-entropy
convergence signal
8 Inference Prompt → output
stop at </svg> or <eos>
← refine data, retrain, evaluate — the loop keeps getting tighter →

📦 Stage 1 — Data

Everything starts here. The quality and structure of your raw corpus determines the ceiling for everything downstream. For SVG training: validate ASCII-only, deduplicate near-identical rows, add explicit <eos> boundaries between samples.

Tune next: ASCII validation gate · row deduplication · explicit end markers · train/val split ratio

🔤 Stage 2 — Tokenizer

Converts raw text into integer token sequences the model can process. BPE learns merge rules from the corpus; vocab size directly affects model capacity. The encode→decode roundtrip must be lossless — any artifact here propagates through the entire pipeline.

Tune next: vocab size (640/1024) · merge count · special tokens (<eos>, <pad>) · roundtrip parity test

⚙️ Stage 3 — Data Preparation

Token sequences are chunked into fixed-length windows, batched, and optionally shuffled. Sequence length determines how much context the model sees per step; batch size × grad accumulation steps = effective batch. A train/val split lets you detect memorization vs. generalization.

Tune next: seq-len (96/128) · batch size · grad accumulation (4) · shuffle seed · train/val split (90/10)

🧠 Stage 4 — AI Algorithm (Model Architecture)

The Transformer architecture — attention heads, feed-forward layers, embeddings. The IR defines the computational graph; codegen compiles it to C kernels. This is where you choose how many layers, heads, embed dimensions, and what quantization format weights live in.

IR Visualizer: Load your model's IR JSON → Architecture card, Dataflow tab, Weight Dtype Audit

▶️ Stage 5 — Forward Pass

Input tokens flow through embedding → attention → FFN → output logits. Each kernel (matmul, softmax, layernorm, RoPE) runs as optimized C with SIMD intrinsics. Memory layout (row-major vs. transposed) and tiling strategy determine throughput.

IR Visualizer: Dataflow tab (kernel graph), Weight Dtype Audit (quant math hover popups)

◀️ Stage 6 — Backward Pass

Gradients flow backward through the chain rule. Each forward kernel has a corresponding backward kernel that computes dL/dW and dL/dx. Gradient accumulation lets you simulate larger batches on limited memory. Watch for numerical drift — the cross-entropy parity test catches it.

Tune next: learning rate (2e-4 → 1e-4) · grad clipping · accumulation steps (4) · weight decay

📉 Stage 7 — Loss & Evaluation

Cross-entropy loss measures how far the model's predictions are from ground truth. Decreasing loss ≠ good model; you need eval metrics: valid SVG rate, exact-match rate, prompt-to-SVG success. Compare train loss vs. val loss to detect overfitting.

Track: train loss · val loss · valid SVG rate · exact-match rate · prompt→SVG success rate

✨ Stage 8 — Inference

The payoff. Feed a prompt, generate tokens auto-regressively until </svg> or <eos>. Temperature and top-k control creativity vs. coherence. If the output loops or generates artifacts, the fix is usually upstream: better data boundaries, more diverse training samples, or explicit stop tokens.

Tune next: temperature · top-k · stop tokens · repetition penalty · max-gen-len

🔬 Pipeline Mastery — where the refinement loop tightens

Random noise Coherent tokens Structured output Correct SVGs Generalizes

Refinement Checklist

Each iteration through the loop targets a different part of the spectrum. The operator's job is to identify which stage is the bottleneck, fix it, and re-enter the loop.

Symptom Bottleneck Stage Action
UTF artifacts in output ② Tokenizer Validate ASCII corpus, check roundtrip parity
Output is random noise ④ Model / ⑥ Backward Check weight init, verify gradient flow, reduce LR
Loops / repeats templates ① Data Add <eos> boundaries, deduplicate rows
Boundary glitches (/svg>, 00") ③ Data Prep Explicit end markers, stop at first </svg>
Loss plateaus ⑦ Loss LR schedule, more diverse data, increase seq-len
Memorizes but won't generalize ① Data / ③ Data Prep Train/val split, more SVG variety, check overfitting
Inference is slow ⑤ Forward Profile hotspots, check SIMD paths, quantize weights
Numerical drift (C vs PyTorch) ⑥ Backward Cross-entropy parity test, check accumulation order

Controlled Recipe Sweep

Once correctness gates pass, systematically sweep these knobs (one at a time):

Dimension │ Values to try │ Measured by ─────────────────┼───────────────────────┼────────────────────────── vocab │ 640 / 1024 │ roundtrip parity, loss seq-len │ 96 / 128 │ loss curve, SVG validity epochs │ 3 / 5 │ train vs val loss gap lr │ 2e-4 → 1e-4 │ convergence speed grad-accum │ 4 │ effective batch stability temperature │ 0.7 / 0.9 / 1.0 │ output diversity top-k │ 10 / 40 / 80 │ coherence vs novelty
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close