Methodical DSL Training Methods
A spec is not just a dataset or a model run. It is a contract between an asset library,
a scene DSL, and a compiler. The model should learn structure against a stable contract, not absorb
random changes in vocabulary, content, and rendering all at once.
spec02 through spec16: broad spec-level progress has been real,
but once a line already has a strong raw winner, post-hoc raw repair rungs have regressed more often than they helped.
The default policy is now: predict likely failure frontiers before training, encode them as clean positive coverage in the
initial curriculum, freeze strong raw winners, and route syntax-only residue to deterministic decode/repair instead of
reopening the same brittle CE surface.
gen1 method update: the first true broad-contract scene-DSL run fit the widened surface far more effectively than the older narrow spec/rung loop. That does not prove every scaling-law claim, but it does justify the current policy shift: broaden the clean compiler-backed corpus first, then harden the eval, then change model size only after the frozen broad contract has been tested under recombination pressure.
Why This Method Exists
This curriculum is not just local preference. It follows the same broad logic seen in scaling-law work, compute-optimal training research, and capability-predictability discussions: use smaller and cheaper experiments to validate the contract, the data, the tokenizer surface, and the evaluation loop before spending serious compute on larger runs.
Scaling Laws
OpenAI's scaling-laws work is the clearest public reference for why small and medium runs are useful first: they help forecast how loss changes with model size, data size, and compute before larger training runs.
Outcome: larger trends can be forecast from smaller controlled experiments.
Applied here: use cheap spec runs to validate DSL, tokenizer, and eval contracts before spending serious training compute.
Compute-Optimal Training
DeepMind's Chinchilla result is the classic argument against blindly scaling parameters without checking data and token budgets. It supports the habit of using many smaller runs to find the recipe before the expensive run.
Outcome: better results often come from the right model/data/token balance, not just a bigger model.
Applied here: compute packed token budgets and effective epochs per spec instead of copying old totals into new DSLs.
Predictability vs Surprise
Anthropic's paper is a useful reminder that broad trends can be predictable even while specific capabilities and outputs remain surprising. That is why this project insists on probes, canaries, and run reports rather than loss alone.
Outcome: aggregate scaling behavior can be smooth while concrete behaviors still surprise operators.
Applied here: every run needs split-aware probes, failure galleries, and promotion rules instead of trusting train loss.
Anthropic — Predictability and Surprise in Large Generative Models
Large-Scale Recipe Practice
Meta's Llama 3 paper is a useful public example of a model family trained with scaling-law thinking across multiple sizes, rather than one single blind flagship run.
Outcome: a family of models is often used to refine the recipe, not just to ship different sizes.
Applied here: keep `spec` lines interpretable at small scale first, then widen to mixed training and larger CK-native runs later.
Project Curriculum
The CK-native version ladder explains how the current visual DSL work fits into the broader path from compiler-backed training to page DSLs, code/data/tool IR, and eventually mixed coding/scientific tasks.
Outcome: visual DSL work is treated as foundation-building, not the final destination.
Applied here: move from narrow-family SVG work toward page DSLs, code/data/tool IR, and broader mixed models.
Failure Visibility
The local intuition page explains why this method emphasizes canaries, parity gates, checkpoints, and structured post-run reports instead of random trial-and-error.
Outcome: visibility turns failed runs into reusable knowledge instead of wasted compute.
Applied here: every spec should leave behind a fixed HTML report with hypothesis, deltas, failures, and a decision.
Three Separate Contracts
1. Asset/Data Contract
The asset library is the visual reference set and the context library is the semantic reference set.
- Source from
docs/site/assets/*.svgor later public asset libraries. - Strip literal text into placeholders first.
- Track layout family, composition, theme, color system, shapes, connectors, charts, and text roles.
- Do not let real copy hide layout mistakes.
2. DSL Contract
The DSL is the model boundary. It should describe composition and roles, not raw SVG bookkeeping.
- Emit scene choices such as layout, theme, rail, density, gap, background, and components.
- Use placeholders like
heading_1,paragraph_1,badge_1. - Keep canonical token order and fixed arity.
- Remove inferable fields from the model surface.
3. Compiler Contract
The compiler owns geometry, defs, gradients, markers, wrapping, and final render semantics.
- Compile scene DSL into SVG with deterministic layout rules.
- Guarantee round-trip reconstruction before any serious training run.
- Keep rendering policy stable across runs unless the run is explicitly a compiler experiment.
- Treat compiler regressions separately from model regressions.
These are real assets from the project library — the same asset families the training pipeline learns to plan and compose.
What We Are Actually Training
In this line, the model is not being trained to be a free-form SVG artist. It is being trained to choose structured visual intent against a stable compiler contract.
Composition
Composition means the overall information shape: compare two systems, show a pipeline, build a poster stack, or arrange dashboard cards.
Examples: poster_stack, comparison_span_chart, pipeline_lane.
Layout Family
Layout is the reusable geometry family inside the composition. It decides where headers, panels, tables, bars, and connectors belong.
Examples: one tall poster, two compare panels, three-column dashboard, staged pipeline lane.
Theme And Tone
Theme is the visual language. Tone is the accent family inside that language.
Examples: theme:infra_dark, theme:paper_editorial, tone:amber, tone:green.
Content Binding
The content is not the same thing as the scene structure. The model should choose where content goes; a separate payload should provide what the text and values actually are.
This is why later lines should move from literal prose inside scene tokens to keyed refs plus content.json.
| Layer | Example | Who Owns It |
|---|---|---|
| Composition | comparison_span_chart |
Model |
| Theme | infra_dark |
Model |
| Tone | amber |
Model |
| Content role | @section_card.0.title |
Model chooses the slot, external content provides the value |
| Exact SVG path, gradient, shadow, marker, wrap | <path ...>, <linearGradient ...> |
Compiler |
Worked Example
A concrete keyed-scene example should look like this:
Request prompt
[task:svg]
[layout:poster_stack]
[topic:memory_reality]
[theme:infra_dark]
[tone:green]
[density:compact]
[OUT]
Scene DSL emitted by the model
[scene]
[canvas:tall]
[layout:poster_stack]
[theme:infra_dark]
[tone:green]
[frame:card]
[density:compact]
[inset:md]
[gap:sm]
[hero:center]
[columns:1]
[emphasis:top]
[rail:accent]
[background:rings]
[connector:line]
[topic:memory_reality]
[header_band:@header_band.0.kicker|@header_band.0.headline|@header_band.0.subtitle]
[section_card:@section_card.0.title|@section_card.0.value|@section_card.0.caption|variant=hero|accent=amber]
[compare_bar:@compare_bar.0.label|@compare_bar.0.value|@compare_bar.0.caption|accent=red]
[table_row:@table_row.0.column_1|@table_row.0.column_2|@table_row.0.column_3|state=highlight|accent=amber]
[/scene]
content.json bound by the compiler
{
"header_band": [
{
"kicker": "First Principle",
"headline": "LLM Memory Reality",
"subtitle": "The math marketing will not show you"
}
],
"section_card": [
{
"title": "Memory Capacity",
"value": "25x more memory capacity",
"caption": "Capacity sets the real context boundary"
}
],
"compare_bar": [
{
"label": "GPU VRAM",
"value": "80 GB",
"caption": "single device"
}
],
"table_row": [
{
"column_1": "128K",
"column_2": "3x GPUs",
"column_3": "Fits"
}
]
}
The training target here is the scene decision, not the literal prose. The compiler can now render the same scene with different content payloads without retraining the model.
Topology Refs vs Content Refs
Some richer layout families need more than content slots. Trees, maps, and graph-style layouts also need stable structural handles so the compiler knows what connects to what.
| Ref Type | Example | Used For |
|---|---|---|
| Topology ref | [node_id:start], [from_ref:start], [to_ref:l2] |
Identity, routing, graph layout, branch placement, segment attachment |
| Content ref | [title_ref:nodes.start.title], [branch_label_ref:edges.start_l2] |
Visible text and numeric payload from content.json |
[decision_node] [node_id:start] [title_ref:nodes.start.title] [/decision_node] [decision_edge] [from_ref:start] [to_ref:l2] [branch_label_ref:edges.start_l2] [/decision_edge]
This dual-reference pattern is not a bug. It is the right compiler boundary for tree, map, and topology layouts.
Why Explicit Tokens Instead Of Generic BPE Right Now?
The short answer is: because this output language is small, formal, and brittle, and the current training line is optimizing for control and interpretability before token-efficiency.
Why not plain BPE first?
If the tokenizer is free to break tags arbitrarily, a tiny model must learn both the scene language and the spelling of the scene language at the same time.
What explicit reserved tokens buy
They make the contract visible. A probe miss can be read as the wrong scene decision instead of arbitrary subword drift across brackets, separators, and role markers.
Why this is not a universal rule
Frontier models absolutely use learned tokenizers like BPE or SentencePiece at base-model scale. The explicit token surface here is a local engineering choice for a narrow formal DSL.
What should happen later
Once the DSL is stable, revisit the tokenizer boundary as its own spec question: smaller reserved surface, learned merges, or mixed strategy. Do that after the contract is proven, not before.
A useful rule is:
Early line: prefer explicit structural tokens so the contract is obvious. Later line: shrink or relax the reserved surface only after the model, compiler, and evals agree on the language.
One mistake to avoid is treating a whole component row as the permanent atomic token surface.
| Shape | What It Buys | What It Risks |
|---|---|---|
[compare_bar:@compare_bar.0.label|@compare_bar.0.value|@compare_bar.0.caption|accent=amber|note=@compare_bar.0.note] |
Short sequences and strong format control. | Too brittle if kept forever. Similar component variants do not share enough structure, so the vocabulary becomes overly specific. |
[compare_bar][label_ref:compare_bar.0.label][value_ref:compare_bar.0.value][caption_ref:compare_bar.0.caption][accent:amber][note_ref:compare_bar.0.note][/compare_bar] |
More compositional reuse across fields and component variants. | Longer sequences and a slightly harder grammar, but much better long-term generalization. |
The first shape can be acceptable as a transition step. It should not be treated as the final production tokenizer boundary.
[compare_bar:@compare_bar.0.label|@compare_bar.0.value|@compare_bar.0.caption|accent=amber] into
single atomic tokens again. This is not training-token packing in the data pipeline. It is tokenizer-boundary
packing, and it makes the DSL brittle.
| Failure Mode | Why It Is Wrong | Corrective Direction |
|---|---|---|
Whole bracketed component rows become reserved tokens because the tokenizer harvests every whitespace-delimited [...] chunk. |
The model no longer learns a compositional scene language. It learns oversized one-off control ids. | Keep only small structural tokens reserved and stop reserving payload-bearing ref tokens like @compare_bar.0.label. |
| Dense packing is confused with normal training-token packing. | The packed-window step is not the issue here. The issue is the token boundary chosen before packing. | Debug tokenizer boundaries first, then debug total-token budgets and effective epochs second. |
| Keyed refs still sit inside one monolithic component token. | The content moved out of literal prose, but the structure is still too atomic. | Move to block-style components such as [compare_bar], field-role tokens, and separate ref tokens or unreserved ref payloads. |
Bad tokenizer boundary [compare_bar:@compare_bar.0.label|@compare_bar.0.value|@compare_bar.0.caption|accent=amber] Better boundary [compare_bar] [field:label] [@compare_bar.0.label] [field:value] [@compare_bar.0.value] [field:caption] [@compare_bar.0.caption] [accent:amber] [/compare_bar]
The rule is simple: reserve structure, not payload. If a token contains too much scene-specific data or too many keyed refs, it is probably the wrong token boundary.
Spec Lifecycle
State the hypothesis
Write down exactly what the next spec is trying to prove: richer DSL, better compiler, stronger asset family, or narrower repair.
Snapshot the assets
Copy the chosen asset family into the cache-backed run workspace so the training contract is frozen and reproducible.
Replace literal text
Strip shipped copy and replace it with placeholders. Train structure first, then bind content from a separate library.
Extract vocabulary
Build the scene vocabulary from layouts, components, composition patterns, style families, and semantic text roles.
Prove the compiler
Compile the DSL with dummy text and verify that the output is acceptably close to the reference asset family before training.
Freeze a gold asset gate
Require a real gold pack, ideally 5 to 10 assets spanning comparison, poster, table, and pipeline families. Treat compiler fidelity as a gate, not a note.
Freeze the tokenizer surface
Tokenize the DSL vocabulary plus placeholder text roles. Do not let arbitrary literal text dominate the tokenizer early, and do not keep whole component rows as permanent atomic tokens once the structure-content split is proven.
Train with one primary axis
Change one main thing per run: DSL, compiler, data mix, tokenizer budget, or capacity. Hold the rest stable.
Write the decision
After every run, decide: promote, reject, repair, or branch. Every run ends with a next action, not just a score.
Operational Training Process
The practical mistake is to let a rung answer too many questions at once. The better process is narrower: freeze measurement, name one failure class, teach family structure, and let validation close the last deterministic syntax gaps.
Freeze measurement first
Before changing the curriculum, freeze the probe prompt lists and keep them balanced across cases, forms, and prompt surfaces.
Rule: if the probe contract changes, version it. Do not compare old and new runs as if they were the same measurement.
One rung, one question
Every rung should target one named failure class: missing terminal tail, wrong family choice, wrong style bundle, stop-boundary spill, or broken exactness.
Rule: if a rung changes grammar, probe shape, and repair mix together, it is not a clean experiment.
Train structure, not boilerplate
The model should learn the family program: layout, components, refs, style controls, and counts. Deterministic tails and canonical ordering should be pushed into the compiler or repair layer whenever possible.
Rule: do not waste multiple rungs trying to make SGD rediscover a fixed footer, terminal block, or closing sequence.
Add repair early
Validation and repair are part of the system, not an admission of failure. Use them as soon as the family structure is mostly right and only a small deterministic residual remains.
Rule: always report raw and repaired probe scores separately.
| Process gate | Required rule | Why it exists |
|---|---|---|
| Probe freeze | Keep balanced prompt lists on disk and version the probe contract when they change. | Otherwise a probe bug or skewed selector can make a healthy line look broken. |
| Rung scope | Declare one failure class target and one allowed intervention family per rung. | Clean attribution matters more than squeezing one more point from a noisy run. |
| Repair budget | Keep broad meta-repair rows capped. Default budget: 10-15% unless the rung is explicitly a syntax rung. |
Too much repair prose teaches the warning language itself instead of the underlying scene contract. |
| Compiler-first closure | If a missing region is mechanically implied by form and counts, prefer validation or repair before adding another broad training rung. | This keeps the model focused on family structure rather than deterministic boilerplate. |
| Raw vs repaired reporting | Publish both raw probe metrics and repaired probe metrics. | This separates training quality from system quality and prevents false conclusions. |
| Seed and probe integrity | Fail preflight if seed staging, tokenizer sidecars, or probe-path hashes drift unexpectedly. | Many apparent curriculum regressions are really measurement or staging bugs. |
A useful default wording for the rung brief is: "Fix one named failure class while preserving the last good baseline on the frozen probe." That sentence is narrow enough to govern the materializer, the probe, and the post-run decision.
Freeze the raw winner
Keep the current raw champion as the comparison anchor. New runs compete against that run, not against train loss or operator intuition.
Rule: promotion gates should reference one frozen probe report and one frozen best rung.
Pilot before full rung
If a repair idea touches shared-family behavior, run a small pilot first. Scale the token budget down and treat the result as a gate, not an automatic successor rung.
Rule: a full rung stays blocked until the pilot improves the target family with no family or hidden-split regression.
Decode first after strong raw runs
Once the raw line is mostly correct, push stop-boundary, prompt-spill, and deterministic cleanup into decode, validation, and repair before spending another broad training cycle.
Rule: if the miss set is mostly repairable, train less and validate more.
Ban literal prompt junk repair rows
Do not teach the model the exact wrapper junk you want it to avoid. That turns the warning language itself into the target distribution.
Rule: use control-agnostic clean-stop prompts rather than rows that spell out [OUT], duplicated prompt blocks, or schema-noise tokens.
Context Budget Is A Gate
Do not guess the next context length. Measure the gold scenes with the current tokenizer family before training.
| Rule | Interpretation |
|---|---|
p95(prompt + output) < 400 |
ctx=512 is probably still enough. |
400-600 |
Move to ctx=768 only if the gold pack actually lives here. |
> 768 |
Use ctx=1024 only when the gold scenes really need it. |
The important lesson from the keyed-scene line was that a layout can fail because of decode or context budget even when the model already learned the structure correctly. Budget diagnostics should therefore be part of the probe contract, not an afterthought.
How To Build Frontier-Style Intuition
No serious engineer would laugh at the core idea here. The separation of structure, content, compiler, and eval is exactly the kind of thinking that makes systems reliable. What strong teams would challenge is not the direction, but the discipline of the method.
Good instinct
Separating asset library, scene DSL, content, compiler, and probe contracts is a strong instinct. It reduces ambiguity and makes failures legible.
What frontier teams would push harder on
Cleaner ablations, stricter gates, more automatic validation, less manual drift between specs, and fewer simultaneous changes per run.
What to copy from them
Treat every run like an experiment with a written hypothesis, a fixed baseline, a canary, a non-regression gate, and a clear promotion or rejection rule.
What not to imitate blindly
Do not jump to giant-scale training intuition too early. On narrow compiler-backed tasks, clarity of interface and eval quality matter more than trying to mimic a frontier pretraining stack.
| If a frontier engineer reviewed this | Likely reaction | What to improve |
|---|---|---|
| Separating DSL and content | Good systems instinct | Push it fully through the dataset, compiler, and probe path |
| Compiler-first validation | Correct | Keep the gold asset round-trip gate strict |
| Custom explicit tokens | Reasonable for a narrow formal language | Revisit only after the DSL stabilizes |
| Many specs in sequence | Fine if each run is interpretable | Make the hypothesis and held-constant set even more explicit |
| Weak or drifting evals | Unacceptable | Keep the report contract and canary gate mandatory |
One Primary Axis Per Run
The model can only teach intuition if each run is interpretable. Every run should declare one primary axis and list everything held constant.
| Primary Axis | What Changes | What Stays Fixed | Question Answered |
|---|---|---|---|
| DSL run | Scene grammar, token order, component vocabulary, canonicalization rules | Asset set, compiler, tokenizer family, model size, training budgets | Did the representation become easier and cleaner to learn? |
| Compiler run | Layout engine, gradients, wrappers, markers, defs, text wrapping, style packs | DSL, prompts, tokenizer, model config | Did the rendering surface become richer without widening the model surface? |
| Data run | Asset coverage, placeholder catalog, contrast pairs, holdout balance, context library | DSL, compiler, model size | Is the model underfit on coverage, or was the contract itself weak? |
| Budget run | Packed token budgets, effective epochs, stage mix | DSL, compiler, dataset rows, model config | Was failure caused by under-training or over-training rather than representation? |
| Capacity run | Layers, embed dim, hidden dim, context, optimizer scaling | DSL, compiler, dataset mix, evaluation contract | Has the representation stabilized enough that more capacity is the next lever? |
| Repair run | Narrow contrast slices for specific misses | Everything else, especially the winning baseline | Can a specific failure cluster be closed without global regression? |
A run is only methodical when it can be summarized as one sentence: “This run changed X, held Y constant, and answered Z.”
Restart-Safe Agent Handoff
Yes, this page is the right umbrella documentation to keep around for restarts. It now has a companion handoff file so a future agent does not need to reconstruct the current plan from scattered reports or stale run logs.
Why this helps
It preserves the method baseline after a reboot: keep the last good spec[x] rung[y] as the training-method champion, move to a compiler-first successor DSL, and add one new family at a time.
What to copy after restart
Use the markdown handoff file below as the first message to the next agent. It points directly at the live contract and the autopilot policy.
What it prevents
It prevents blind resumption of stale runs, tokenizer guesswork, and launching training before the compiler and tokenizer contract are ready.
| Reference | Purpose |
|---|---|
docs/site/_pages/spec-training-method.html |
Human-readable umbrella method page. |
docs/site/_pages/agent-handoff-template.md |
Copy-paste restart prompt for future agents. |
version/v7/reports/SPEC[X]_EXECUTION_CONTRACT_YYYY-MM-DD.md |
Example shape of the current internal execution contract. |
version/v7/reports/spec_family_autopilot_policy.json |
Machine-readable rule for when autonomy is allowed and when it must stop. |
Suggested restart prompt Read these first and treat them as the live source of truth: 1. docs/site/_pages/spec-training-method.html 2. docs/site/_pages/agent-handoff-template.md 3. version/v7/reports/SPEC[X]_EXECUTION_CONTRACT_YYYY-MM-DD.md 4. version/v7/reports/spec_family_autopilot_policy.json Current intent: - keep the last good spec[x] rung[y] as the training-method baseline, not the tokenizer ceiling - build the successor DSL/compiler/tokenizer path explicitly - keep payload/content external to the model DSL - add one capability at a time - make the next family[z] the active family-construction line - do not launch training until the compiler, tokenizer corpus, launcher, and rung policy checklist are complete Then continue from the current execution contract checklist and update the repo artifacts before starting any background training or autopilot.
Recommended Asset-to-DSL Workflow for SVG
Asset intake
Track every reference SVG in the public asset set and copy it into the run-local cache workspace.
- Keep original assets immutable.
- Record family, source file, and intended holdout split.
- Use run-local copies for extraction and experiments.
Dummy-text normalization
Replace all literal text with placeholders before vocabulary design.
heading_1,heading_2,paragraph_1,callout_1- Keep text roles, not the original prose.
- Make layout errors visible by removing content noise.
Vocabulary extraction
Infer the reusable scene vocabulary from the normalized assets.
- layout family
- component families such as panels, charts, connectors, legend, band, poster stack
- theme, tone, spacing, background motif, frame, emphasis, hero alignment
Compiler validation
Compile the new DSL back into placeholder SVG and compare it against the normalized asset.
- Require deterministic output.
- Require acceptable visual round-trip on the gold asset subset.
- Reject the spec if the compiler cannot express the asset family cleanly.
Training corpus
Train on the DSL and placeholder roles, not on raw shipped copy.
- Tokenize scene vocabulary plus placeholder text roles.
- Add content/context libraries later as a separate layer.
- Keep placeholder slot resolution external and deterministic when possible.
Probe discipline
Score both DSL exactness and compiled render exactness.
- A wrong DSL with right render means hygiene is bad but the contract may still be close.
- A right DSL with wrong render means the compiler is the problem.
- Do not read train loss as the final answer.
HTML Report Contract After Every Spec or Run
Every run should leave behind a readable HTML report that answers the same questions in the same order. The point is not decoration. The point is to make the next decision obvious.
| Report Section | Question Answered | Recommended Content |
|---|---|---|
| Run card | What is this run? | Spec name, run id, date, model config, dataset version, primary axis changed, baseline compared against. |
| Hypothesis | Why was the run executed? | One paragraph stating what changed, why it should help, and what would count as success or rejection. |
| Held constant | What was intentionally not changed? | DSL/compiler/data/tokenizer/capacity matrix with one primary axis highlighted. |
| Data + tokenizer | What did the model actually see? | Row counts, packed token counts, effective epochs, tokenizer size, reserved tokens, placeholder coverage. |
| Compiler gate | Could the compiler express the target family? | Round-trip gallery on gold assets, render diffs, unsupported primitives, fallback paths. |
| Probe summary | Did the run work? | Exact, renderable, materialized exact, split metrics, per-layout metrics, delta vs baseline. |
| Product scorecard | Did it get closer to usable output? | Content-binding success, gold-asset parity status, family non-regression, and whether the output is visibly in-family. |
| Failure gallery | Where did it fail? | Show the remaining misses with prompt, expected DSL, actual DSL, content JSON or refs, compiled SVG, and a short diagnosis. |
| Lessons | What was learned? | Two or three concrete takeaways about representation, compiler, data, or budgeting. |
| Decision | What happens next? | Promote, reject, repair, or branch. Include the next run shape and the intended axis change. |
Beautiful means legible
A good report should show the metric headline, the delta from baseline, and the actual failure cards above the fold.
Use color sparingly: green for proven gain, amber for ambiguity, red for rejection, blue for structural notes.
Show the evidence
Every claim should anchor to an artifact: probe JSON, tested prompts report, compiler validation gallery, dataset profile, tokenizer stats.
End with a decision
The report is incomplete if it only says what happened. It must say what the next run should do and what should stay frozen.
Run summary template Spec: Run: Primary axis changed: Held constant: Baseline: Hypothesis: Data delta: DSL delta: Compiler delta: Tokenizer delta: Budget delta: Capacity delta: Preflight result: Compiler round-trip result: Probe result: Failure clusters: Decision: Next run:
What The Experiments Proved
These conclusions did not come from theory alone. They came from repeated runs that failed, partially worked, regressed, recovered, and exposed the actual pressure points in the system. Across the full run history, the stable wins came from better contracts and better full curricula. Narrow raw repair churn became less reliable once a line was already mostly learned.
1. Evaluation bugs can masquerade as model collapse
We learned that decode/stop-marker mistakes and bad probe contracts can make a healthy model look broken. Evaluation infrastructure is part of the system, not an afterthought.
2. Freeze the winning baseline
Once a run becomes strong on the current contract, freeze it. Do not erase the baseline with broad speculative follow-ups.
3. Front-load predictable failures
Family drift, sibling-form confusion, style attractors, topology/count defaults, and stop leakage are usually visible in the contract before the first run. Teach those boundaries as clean positive coverage from day one instead of relying on post-hoc warning-language rows.
4. Use decode/repair before raw churn
When the remaining miss set is mostly syntax hygiene or mechanically repairable near-misses, deterministic decode/repair is the default path. Reopen raw training only for a true new branch such as a redesigned curriculum or a capacity test.
5. Separate asset, DSL, compiler, and content work
Runs became interpretable only after we started isolating which layer changed: richer scene vocabulary, richer compiler, broader asset family, or better structure-content separation.
6. Token granularity is its own spec axis
Whole-component tokens were useful as training wheels, but they should not remain the long-term boundary. Once keyed structure works, shrinking token granularity deserves a dedicated run.
How We Reached The Current Design
| Observed pattern | What it taught us | Design conclusion |
|---|---|---|
| Raw or flat structural targets could render, but were brittle and hard to steer. | The model was spending too much capacity on low-level surface decisions. | Move upward into a scene DSL. |
| A richer compiler improved visual output without requiring the model to emit raw gradients, paths, or markers. | Beauty and fidelity can often be improved at the compiler layer first. | Keep low-level SVG machinery compiler-owned. |
| Some runs got stronger exact-match by baking visible text directly into component tokens. | That can help short-term contract accuracy, but it mixes structure and content in the wrong place. | Separate scene structure from content payloads. |
| Narrow fixes sometimes improved the target slice and damaged solved slices elsewhere. | Repair runs need strong anchor replay and non-regression checks by family, not just aggregate score. | Gate every repair against the frozen baseline. |
| Loss often looked healthy even when contract behavior regressed. | Train loss is not the product metric. | Use probe, canary, compiler round-trip, and failure galleries as the real decision surface. |
How To Decide What Is Next
Diagnostic Matrix
After a run finishes, use the probe report, loss curve, and per-layout breakdown to decide what to do next. Do not guess — match the symptom pattern to the action.
| Symptom | Likely Cause | Action | Example |
|---|---|---|---|
| Low exact across all layouts | Undertrained (too few epochs or too little curriculum pressure) | Raise midtrain epochs from 1 to 2–3. Keep everything else frozen. | Example: 2/35 exact after 1 midtrain epoch, with all 5 layouts weak |
| Low exact on some layouts, others strong | Curriculum imbalance — weak families got less edit pressure | Add targeted negative/repair rows for the weak families. Anchor the strong families with replay. | Example: one over-weighted slice caused cascading failure across the weak families |
High exact match, but rendered_svg_ok: null |
Probe/compiler plumbing bug — model output is right but the render path fails silently | Fix the probe pipeline first. Do not retrain until the eval is trustworthy. | Example: exact=true but rendered_svg_ok=null because the accounting path was wrong |
| Loss drops below 0.1 but exact match stays low | The model memorized training distribution but not the DSL contract | Check tokenizer coverage, holdout prompt diversity, and whether edits cover all layout × topic combinations. | Example: loss converged but capability stayed near 0% because the representation was wrong |
| Loss stuck above 0.5 after full midtrain | Capacity wall, LR too low, or data packing issue | Check grad norms for saturation. Try LR sweep. Verify token packing fill rate (>0.85). | Example: 81% loss reduction but final loss stayed at 0.905, signaling a capacity or token-budget wall |
| Good exact on train prompts, bad on dev/test | Overfitting to train distribution | Widen holdout prompt coverage. Add more topic × layout combinations. Reduce epoch count if loss is very low. | Example: 100% train, 91.7% dev, 83.3% test, showing a visible but still manageable overfit gradient |
| Good exact on one spec, regression on next | DSL contract changed silently, or tokenizer/compiler mismatch | Diff the DSL grammar, tokenizer vocab, and compiler output between specs. Use canary probes from the previous best. | Example: new layout families were introduced and legacy families still passed, but the tokenizer/compiler contract drifted |
| 100% syntax valid, 0% semantic match | Model learned token order but not composition choices | Richer curriculum with edit pairs (wrong→right), not just direct generation examples. | Example: 100% syntactically valid DSL, 0% correct scene composition |
| Stage labels wrong in loss curve | Telemetry bookkeeping bug in pipeline — stage transitions not recorded | Fix the pipeline stage labeling. Do not trust source_stage field until verified. Use step counts to infer boundaries. | Example: all 665 steps were labeled pretrain even though midtrain steps were present |
| training_plan.json says "active" after run completed | Pipeline did not finalize stage status | Fix the plan update logic. Probe report should not depend on plan status for results. | Example: training_plan.json still reported midtrain after the run finished |
| One layout perfect until truncated — missing closing tags | Output exceeds context window — model learned the content but runs out of tokens | Measure output token count vs context length. Either compress the DSL (remove inferable fields) or raise context. Do not add more training data — the model already knows it. | Example: one family matched 2208/2606 characters perfectly, then truncated at the same point on every failure |
| Gold mapping token budget far exceeds context window | DSL is too verbose — carrying fields the compiler could infer | Run a DSL compression pass before defining the tokenizer. Target: output tokens < 80% of context length. Measure with a gold-budget report tool. | Example gold output: 2793 tokens at ctx = 512, or 5.5x over budget |
Critical — fix before retraining Warning — likely cause of weak results Info — plumbing or telemetry issue
Reading the Probe Report
The probe report is the real decision surface, not the loss curve. A healthy loss curve with bad probe results means the representation or curriculum is wrong. A noisy loss curve with good probe results means the model learned despite training-signal noise.
| Metric | What It Tells You | Healthy Range | Red Flag |
|---|---|---|---|
exact_match |
Model output matches expected DSL token-for-token | >70% overall, >50% per layout | <20% after full curriculum = representation or data problem |
renderable |
Output parses and compiles to valid SVG | 100% (compiler contract must not break) | <90% = compiler or tokenizer bug, not model bug |
rendered_svg_ok |
Compiled SVG matches expected visual | Should track exact_match closely | null = probe pipeline broken, fix before interpreting other metrics |
| Per-layout breakdown | Which families the model learned vs. missed | All families above 50% | One family at 0% with others at 80%+ = curriculum imbalance |
| Train vs dev vs test split | Generalization gradient | Train≥dev≥test, gap <20pp | Train 100%, test 0% = severe overfitting |
Reading the Loss Curve
| Pattern | What It Means | Action |
|---|---|---|
| Smooth descent, final <0.05 | Model is memorizing the training set well | Check probe to confirm generalization, not just memorization |
| Descent plateaus at 0.1–0.3 | Model hit a capacity or data ceiling | If probe is also stuck, try more epochs, richer edits, or LR restart |
| Sudden spike at stage boundary | New data distribution — expected at pretrain→midtrain transition | Not a bug. Watch if it recovers within 50–100 steps |
| CK loss ≠ PT loss | Numerical parity broken | Stop training. Fix kernel parity before continuing. This should never happen. |
| Grad norms spike or collapse | Training instability — exploding or vanishing gradients | Check max_grad_norm clipping. If persistent, reduce LR or check data packing for degenerate sequences. |
| Loss oscillates without converging | LR too high, or data has conflicting signals (e.g., same prompt → different targets) | Verify dataset uniqueness. Try lower LR. Check for duplicate or contradictory training rows. |
Decision Actions
Use deterministic decode/repair first when the contract is almost closed
If the current representation is strong and the remaining misses are syntax-only or mechanically repairable, keep the raw winner frozen and improve the decode/repair layer first. Do not default to another raw repair rung on the same brittle surface.
Use a compiler step when visuals are still weak
If the generated infographic is structurally correct but still visually simple, improve the compiler and asset vocabulary before spending more model compute.
Use a tokenizer step when outputs are too brittle
If the model only behaves well when whole component rows are reserved atomically, run a token-granularity spec next instead of adding more layouts or model size.
Use a DSL step when the representation is still too literal or too loose
If the model surface still carries inferable fields, arbitrary prose, or non-canonical structure, fix the DSL before scaling the model.
Use a structure-content split when generalization is the problem
If the scene language is still carrying visible prose or one-off values, branch to keyed structure plus separate content.json. The model should emit refs like @section_card.0.title, not literal asset prose inside the component token.
Use clean contrast coverage when planning is still wrong
If a failure is truly still in training, add compiler-valid contrast sets and boundary cases, not warning-language rows about [OUT], wrappers, or singleton tags. The curriculum should teach the right choice, not narrate what not to do.
Promotion Rules
- Promote a run only if it improves the primary target metric without regressing the already-solved slices.
- Reject a run if it changes too many axes to interpret, even if the final number looks better.
- When a frozen raw winner exists, default to decode/repair and block more same-surface raw repair rungs until the next training idea is a true new branch.
- Branch a new spec only when the current spec ceiling is real and well measured.
- Make compiler fidelity a hard gate: use a gold pack, not just a couple of hand-picked examples.
- Scale model size only after the DSL and compiler contracts are stable.
- Keep run artifacts in
~/.cache/ck-engine-v7/models/train/; keep curated project docs in the repo. - Measure output token budget against context window before training. If the longest gold mapping exceeds 80% of context length, compress the DSL or raise context first.
Spec Versus Run
The project uses spec[x] and r[y] for a reason. A spec is a new training contract. A run revision is a controlled iteration inside the same contract.
| Level | Meaning | What May Change | What Should Stay Fixed |
|---|---|---|---|
spec[x] |
A new experiment question and a new learning boundary | DSL, prompt surface, compiler ownership, token granularity, output family, structure/content split | The project goal and the evaluation discipline |
r[y] |
A revision inside one spec | Repair rows, replay ratios, token budgets, epochs, balance, decode hygiene, capacity after the contract is stable | The DSL contract, the probe target, and the main question |
Start a new spec when the current ceiling is real and measured. Stay inside the same spec when the failure is narrow and the representation still matches the intended product boundary.
Working Definitions
| Term | Meaning Here | Example |
|---|---|---|
| Trainable contract | The exact input/output behavior the model is being taught, with a form that can be evaluated reliably. | prompt -> scene.dsl with exact-match and render checks. |
| Representation | The model-facing format for the task: the grammar, fields, token shapes, and structure it must predict. | Flat SVG atoms versus a scene DSL with layout, theme, and component blocks. |
| Data | The actual examples used to teach the contract, including prompts, targets, repairs, anchors, and holdouts. | Direct gold rows, topic-swap rows, close-tag continuation rows, or negative contrast rows. |
| Compiler | The deterministic layer that turns structured outputs into the final product artifact. | scene.dsl + content.json -> SVG. |
| Eval | The measurement layer that decides whether the run actually improved the desired behavior. | Exact match, renderability, materialized exactness, and per-family breakdowns. |
| Canary | A tiny, cheap, high-signal test slice run before a full job to catch obvious format or compiler failures. | 12 prompt cases that must parse and render before the real training run starts. |
| Eval gate | A pass/fail condition that must hold before a run is accepted or promoted. | 100% renderability on the canary or no regression on solved families. |
| Ablation | A controlled experiment where one factor is changed while the rest stays fixed, so the effect can be interpreted. | Keep the DSL fixed and change only midtrain epochs from 1 to 3. |
| Data mixture control | Deliberately choosing how much of each example type the model sees. | 40% anchor replay, 40% direct rows, 20% repair rows. |
| Failure taxonomy | A named breakdown of failure classes so the next action is chosen from evidence instead of guesswork. | Grammar failure versus compiler failure versus capacity failure. |
| Repair run | A narrow run that targets a known failure slice without redefining the overall contract. | Add transition rows to fix one broken family in the current baseline line. |
| Replay / anchor rows | Stable rows from already-solved behavior that are kept in the curriculum to prevent regressions. | Keep strong decision_tree rows present while repairing table_matrix. |
| Prompt surface | The information exposed on the input side of the task. | Explicit [layout:...] prompts versus intent prompts with only topic + goal + audience. |
| System boundary | The line between what the model must learn and what deterministic systems should own. | The model chooses scene structure; the compiler owns exact geometry and gradients. |
| Canonicalization | Forcing one stable legal form for equivalent outputs so exact-match metrics mean something. | One legal field order for scene attributes, not many interchangeable spellings. |
Use a new spec when the question changed
Examples: move from flat atoms to scene DSL, move from explicit layout prompts to intent prompts, or split visible content out into content.json.
Use a new run when the contract is right but weak
Examples: add closure repairs, strengthen replay, rebalance families, or raise effective epochs for the same grammar.
Reject blurry runs
If a run changes grammar, curriculum, and capacity at once, it may still improve a metric, but it will not teach much. That is wasted research signal.
Asset Scaling Strategy
Do not try to train on a large asset library immediately. Prove each capability level first, then widen. The pattern is: memorize → generalize → expand.
| Phase | Gold Assets | Goal | Pass Criteria | What Failure Means |
|---|---|---|---|---|
| Memorization | 3 hand-mapped | Can the model reproduce 3 exact gold scenes from their prompts? | 100% exact on train, compiler round-trips all 3 | DSL, tokenizer, or context window is wrong — fix before adding more data |
| Held-out generalization | 3 gold + 3 synthetic variants | Can the model produce correct scenes for unseen topic × layout combinations? | >70% exact on dev/test splits | Curriculum needs more edit diversity — add topic swaps, density changes, theme variations |
| Family expansion | 7–10 gold across 5+ families | Does adding new layout families break already-learned ones? | >70% exact overall, no per-family regression below 50% | Anchor replay too weak — increase replay ratio for stable families |
| Compositional generalization | 10+ gold, compositional tokens | Can the model compose components it has seen in new combinations? | >60% exact on novel layout × component combinations | Token granularity too coarse — break monolithic tokens into compositional pieces |
| Open planning | 20+ gold, underspecified prompts | Can the model choose layout family from an ambiguous request? | Reasonable family choice >80%, compiler renders successfully >90% | Model needs more prompt diversity and possibly larger capacity |
Each phase should be a separate spec or run. Do not skip phases — a failure at phase 2 means phase 3 data will be wasted compute. The gold asset count is a guide, not a rule. What matters is that each phase answers its specific question before the next one starts.
How To Build Training Intuition Without Frontier Compute
The useful lesson from frontier work is not that every internal circuit is understood. The useful lesson is that model behavior can still be shaped in predictable directions through disciplined control of data, interfaces, budgets, and evaluation.
| Layer | Question | What To Learn |
|---|---|---|
| Distribution | What experiences is the model compressing? | Data mix, replay pressure, repair rows, holdouts, contradiction checks |
| Interface | What problem is the model actually being asked to solve? | Prompt contract, DSL scope, structure/content split, canonical ordering |
| Budget | Did the model see enough clean signal to learn the task? | Effective epochs, packed token budgets, context usage, canary gates |
| Capacity | Is the model too small, or is the task still badly shaped? | Only scale after grammar, compiler, data, and probe paths are stable; large rung-to-rung swings on one small model usually mean recipe trouble before capacity trouble |
| Evaluation | Are the right things being measured? | Probe exactness, renderability, materialized exactness, per-family breakdowns, non-regression |
| System boundary | What belongs in the model versus the compiler or content system? | Keep deterministic rendering and data retrieval out of the model when possible |
In other words, practical intuition comes from asking: what changed, why did behavior change, and which layer actually caused it? This is why the spec/run method matters. It turns training into a ledger of answers instead of a collection of lucky outcomes.
What Frontier Labs Usually Know
The claim that "no one knows how LLMs work" is too broad to be useful. A better version is this: full mechanistic theory is still incomplete, but empirical control is strong, and product/system control is often stronger still.
In practice, serious teams may not be able to explain every internal circuit, but they can still learn, with real discipline, that a particular data mixture, architecture, scale, objective, post-training recipe, and evaluation set tends to produce particular kinds of behavior. That is not total understanding, but it is real engineering knowledge.
Why Data Curation Still Matters
Scaling does not erase the training distribution. The model still compresses what it sees. Bad data creates bad priors, noisy mixtures create unstable behavior, narrow data creates narrow generalization, and well-shaped data creates cleaner abstractions.
This is why data curation remains central even without a complete theory of generalization. The practical loop is still: define the contract, shape the data to teach that contract, measure the right behavior, and repair the actual failure layer.
The Failure-To-Repair Loop
In practice, much of the progress comes from a simple discipline: observe the weakness, classify the weakness, teach that weakness directly, and rerun from the best relevant checkpoint when the spec has not changed.
This is not the same as "keep adding more data." Before adding rows, decide whether the failure is mainly in the data, DSL, compiler, tokenizer, token budget, decode hygiene, or capacity. Then fix the right layer.
- If the model misses
[/scene], add closure and continuation rows and tighten stop hygiene. - If the model confuses two families, add contrast rows and family-local anchors.
- If the DSL is too verbose or ambiguous, start a new spec instead of piling on more data.
- If the compiler cannot express the target asset, fix the compiler before retraining.
- If a seeded checkpoint suddenly probes as all-zero before any new training, stop and verify seed/probe integrity before blaming the curriculum.
- If "negative" cleanup rows are added under ordinary cross-entropy supervision, remember they are still positive targets for the model; treat contamination rows as hazardous, not automatically corrective.
A practical rule is: same spec -> usually continue or rerun inside the same family; new spec -> usually start a new line. This is how failures become supervision instead of wasted compute.
Run Failure Matrix
The most useful next step is not "train more." It is to name the failure class correctly. The table below is the default matrix for interpreting a run before deciding whether to repair the data, change the DSL, fix the compiler, or scale capacity.
| Failure Class | What It Looks Like | What To Track | Likely Cause | Typical Fix |
|---|---|---|---|---|
scene_prefix_failure |
Missing [scene], missing [layout:...], duplicated top-level attrs |
start-valid rate, missing-layout rate, duplicate-attr rate | weak canonical anchors, too much fragment training | add full-scene canonical rows and prefix-only repair rows |
scene_suffix_failure |
Missing [/scene] |
close-tag miss rate | weak termination training, decode stop/budget issues | add close-tag rows, verify stop markers, verify decode budget |
block_nesting_failure |
wrong closing tag, invalid nesting, repeated block open | nesting error rate by block type | weak block grammar, fragment-heavy repair rows | add balanced block rows, transition rows, canonical block-order rows |
budget_truncation |
Output is a correct prefix but cut off | truncated_at_budget rate, prompt/output token counts |
decode budget too small or context too small | raise decode budget first, then context only if needed |
special_token_leak |
<|bos|>, <|eos|>, or prompt tokens leaking inside scene output |
special-token leak rate | tokenizer boundary contamination, bad row boundaries | strip/control special tokens and strengthen scene-only targets |
contamination_supervision |
cleanup or restart rows teach junk surfaces directly; outputs collapse to empty strings, stray single tokens, or contaminated prefixes after a repair push | empty-response rate, missing-scene-start rate, contamination-token frequency in training rows, before/after repair-row deltas | synthetic corruption was mixed into ordinary CE targets and learned as part of the distribution | remove contamination rows, protect full-scene replay anchors, and rerun a tiny canary before any broader repair pass |
layout_drift |
Wrong family or empty layout | per-layout confusion matrix | family overlap, weak family anchors | more direct family rows and layout-class repair rows |
theme_tone_drift |
wrong or duplicated theme/tone attrs | theme/tone confusion matrix, duplicate-attr rate | weak top-level canonicalization, over-repair | canonical scene-header rows and dedupe rules |
renderable_but_not_exact |
SVG compiles but scene DSL is off | exact vs renderable gap | semantic drift, ordering drift | targeted exactness repair rows |
exact_but_not_materialized |
scene matches but final SVG differs | materialized-exact gap | compiler or content-binding bug | fix renderer/probe path, not training |
family_imbalance |
one family learns, one collapses | per-family exact/renderable/materialized | data imbalance or family-specific grammar difficulty | family-specific anchors and family-weight tuning |
undertraining |
high loss, broad failure everywhere | loss curve, steps per epoch, token budget | too little budget for the grammar difficulty | raise epochs or total tokens |
over_repair_fragmentation |
valid local fragments but corrupted full scenes after a repair push | renderable drop after repair-row increase, local grammar error counts | too many fragment rows relative to clean full scenes | reduce fragment ratio and add more clean full-scene anchors |
compiler_parity_gap |
the model may be fine but the target family still looks weak | gold asset parity score | compiler not expressive enough | do a compiler pass before more training |
probe_accounting_bug |
obviously good outputs score wrong | mismatch between exact, renderable, and materialized evidence | reporting or probe bug | fix probe/report path first |
seed_probe_integrity_gap |
a copied "good" checkpoint probes as broken before any new training | seed-only probe result, artifact hash equality, tokenizer/template sidecar parity, live-vs-historic baseline repro | seed staging bug, probe-runner drift, or decode/runtime drift rather than new training damage | add a seed-only probe gate and repair staging/probe integrity before launching another rung |
capacity_misdiagnosis |
operators want a bigger model because a few repair runs failed, but the same architecture already moved from near-zero to strong probe scores on cleaner recipes | best-vs-worst rung spread on the same architecture, family stability under recipe changes, plateau across multiple clean canaries | recipe or evaluation instability is being mistaken for a hard parameter limit | stabilize the curriculum and probes first; scale only after several clean recipes plateau on the same failure class |
Minimum Run Scoreboard
Every run should publish the same small scoreboard. This keeps comparisons honest and makes failure classes visible without reading raw prompt dumps first.
| Metric | Why It Matters |
|---|---|
exact_rate |
scene contract fidelity |
renderable_rate |
structural validity |
materialized_exact_rate |
final compiler truth |
budget_truncation_rate |
separates learning failure from budget failure |
missing_scene_start_rate |
top-level grammar health |
missing_scene_end_rate |
termination health |
duplicate_attr_rate |
canonicalization health |
block_nesting_error_rate |
nested grammar health |
special_token_leak_rate |
tokenizer/output contamination |
| per-layout exact/renderable/materialized | family-specific diagnosis |
| train/dev/test split rates | overfit detection |
gold_asset_parity_score |
compiler readiness |
A simple interpretation rule helps keep the read honest: low exact with high renderable usually means semantic or ordering drift; low renderable with low truncation usually means grammar corruption; high truncation is a budget problem; high exact with low materialized exact is a compiler or probe bug.
Repair Ladder Lessons
A recurring mistake in small-to-medium DSL runs is to treat every bad output as a direct template for the next repair row. That is too naive. The model does not understand that a row was meant as a warning; under ordinary supervised training it only sees another target distribution to imitate.
- Do not treat synthetic corruption rows as automatically "negative data." Under cross-entropy, they are still positive supervision unless a different objective is used.
- Protect full-scene replay anchors from the last good rung. Local continuation repairs should stay subordinate to full canonical scenes.
- Separate three gates before training: probe-path integrity, seed-copy integrity, and then curriculum quality. A broken probe can look exactly like a broken rung.
- If one small architecture already spans a wide quality range across runs, the next bottleneck is usually recipe shape, not model size.
- Use tiny seeded canaries after a failure. If a
0.1xor0.2xcorrective run still destabilizes the same contract, pause rung-chasing and rethink the materializer itself.
In other words: do not use bigger models to paper over a broken supervision surface. Scale capacity when the contract is stable and the same clean failure persists across multiple disciplined canaries. Until then, the higher-signal move is to improve the curriculum, replay anchors, and measurement path.
What To Copy From Frontier Practice
- Use proxy runs before expensive runs.
- Design evals aggressively instead of trusting loss alone.
- Run controlled ablations instead of relying on folklore.
- Keep explicit failure taxonomies.
- Control data mixtures carefully.
- Separate base learning, post-training, and system scaffolding.
- Do not expect one run to answer five different questions at once.
Ethical Scaling Path
Going from something simple to something bigger should not mean removing constraints faster than understanding improves. The safer path is to widen capability in stages while keeping the boundaries of responsibility clear.
| Stage | Ethical Rule | Why It Matters |
|---|---|---|
| Bounded contract | Start with a narrow, measurable task. | Prevents vague demos from being mistaken for robust capability. |
| Honest boundaries | State clearly what the model does, what the compiler does, and what external data does. | Keeps the system understandable and avoids false claims about generality. |
| Bridge prompts | Move from explicit prompts to weaker prompts gradually. | Avoids turning one experiment into language learning, planning, retrieval, and rendering all at once. |
| Hard gates | Require non-regression, canaries, and probe honesty before widening scope. | Stops capability drift from being hidden behind bigger compute. |
| Human oversight | Keep humans in the loop for high-stakes domains and claims. | Capability should expand with accountability, not just with confidence. |
Ethical scaling is not only about policy. It is also about research honesty: know what the model is really doing, know what the compiler is doing, know what the content system is doing, and report those boundaries clearly.
Pre-Training Checklist
Before launching any training run in a new spec, verify all of these gates. If any fails, fix it before spending compute.
| Gate | Check | Tool |
|---|---|---|
| Compiler parity | Every gold scene.dsl + content.json compiles to valid SVG that matches the reference asset | Compiler round-trip test against gold pack |
| Token budget | Longest gold output fits within 80% of context window after tokenization | Gold-budget report tool or equivalent |
| Tokenizer coverage | All DSL tokens in the gold pack are in the tokenizer vocabulary — no <unk> fallbacks |
Tokenize each gold scene and check for unknown tokens |
| Parity regimen | CK ↔ PyTorch forward/backward/optimizer parity passes at the target model size | training_parity_regimen_latest.json |
| Dataset QC | All training rows parse, no duplicates, holdout prompts are disjoint from train | Dataset QC step in training pipeline |
| Failure-frontier forecast | The curriculum blueprint names the predictable failure classes and shows which surfaces teach each one before the first run starts | spec17_curriculum_blueprint.json + audit_curriculum_blueprint_v7.py |
What A Solved Structured-DSL Run Actually Proves
A solved spec[x] rung[y] does not mean the model learned open-ended infographic design. It means something narrower and more useful: the model can reliably map a bounded prompt contract to a compiler-facing scene DSL, and that DSL can then be combined with external content.json to produce exact SVG.
- The model is responsible for the scene program: layout family, scene components, refs, and top-level style fields.
- The compiler is responsible for deterministic layout and SVG assembly.
- The external content payload is responsible for visible copy, data values, and topic-specific facts.
Keep those responsibilities separate. When adding new DSL families, do not blend asset identity into the output grammar unless the structure truly requires it. Case ids, topic-specific facts, and payload-specific copy should stay in prompt text, routing metadata, and external content.json wherever possible. The output DSL should stay as generic as the renderer contract allows.
This boundary matters. A strong spec[x] rung[y] result proves the training method can stabilize structured program generation. It does not yet prove arbitrary topic generalization, arbitrary tree depth, free-written infographic copy, or unconstrained scene planning.
Prompt-Surface And Decode-Boundary Discipline
Hidden eval should cover at least two different generalization surfaces:
- Prompt-surface robustness: paraphrases, wording shifts, and bridge prompts that preserve the same underlying task.
- Semantic breadth: new topics, new assets, or new content payloads that test whether the contract scales beyond the first few gold mappings.
Solving prompt paraphrases is not the same as solving new semantics. Treat those as different milestones and report them separately.
Also keep decode-boundary hygiene separate from model quality. In raw CLI inference, a scene DSL model may generate a correct [scene] ... [/scene] block and then continue into the next training-style prompt if no structural stop marker is supplied. That is a decode configuration issue, not automatically a training failure. Measure the first valid scene block, then fix stop hygiene at the inference boundary.
Next Bridge After A Solved Explicit-Layout Line
The next step after a solved explicit-layout line is not open-ended prose. It is a bounded intent bridge where the prompt still names the topic and goal, but stops prescribing the scene directly.
The short rule is:
- keep
content.jsonexternal - keep the compiler deterministic
- let the model infer
layout,theme,tone,density, and related scene-planning fields fromtopic + goal + audience - measure planning quality separately from syntax and renderability
Example intent-bridge prompt:
[task:svg] [topic:topic_id] [goal:goal_id] [audience:audience_id] [OUT]
This keeps the task bounded. The model is not being asked to learn open-domain language or free-written infographic copy. It is being asked to choose a good scene plan under weaker prompt control.
spec17 established the bounded-intent bridge shape and
spec18 tested a routing-first curriculum on the same frozen stack, but held-out exactness still stayed at
zero. The next public recommendation is therefore spec19: a textbook-routing branch with named mixture
buckets, denser minimal-pair routing coverage, and capacity held back as a separate fallback lever.
See spec19-textbook-routing-mixture.html.
Progress Goal After The Last Good Baseline
The immediate goal after the last good baseline is not “teach the model English.” It is narrower and more useful: preserve compiler-backed precision while adding more SVG families and better within-family generalization.
- Keep the renderer deterministic and visually strong.
- Keep semantic payloads external until the family contract is stable.
- Add new families one at a time so failure causes stay legible.
- Judge progress by exact/materialized SVG quality and held-out family transfer, not by conversational fluency.
Related Pages
v7 Runbook
The concrete operator commands, parity gates, and train/infer workflow.
Training Intuition
The deeper failure-analysis and checkpointing page with the Phase 1–7 diagnostic matrix for training dynamics (gradients, attention, weights).
Training Curriculum
The long-range CK-native learning ladder from v7 through later versions.
Spec17 Curriculum
The bounded intent-bridge blueprint, failure-frontier map, and stage mix planned after the frozen spec16 winner.
Spec19 Routing Mixture
The current recommended next branch: named mixture buckets, textbook-style routing rows, and a clean capacity fallback if held-out exactness stays flat.
Version History
The public roadmap showing how each version track feeds the next capability layer.
Spec / Run Discipline
The versioned internal note explaining why the project uses spec[x] and r[y], and how that method builds training intuition.
version/v7/reports/SPEC_RUN_DISCIPLINE_AND_TRAINING_INTUITION_2026-03-18.md