Training Method

Methodical DSL Training Methods

A spec is not just a dataset or a model run. It is a contract between an asset library, a scene DSL, and a compiler. The model should learn structure against a stable contract, not absorb random changes in vocabulary, content, and rendering all at once.

Asset library Placeholder content Scene DSL Compiler gate Tokenizer gate Probe report

Core rule: the model always sees both structure and content placeholders, but an experiment should have only one primary teaching target. If the DSL, the compiler, the asset library, the tokenizer, the token budget, and the model size all move at once, the run becomes uninterpretable.

Page split: keep this page for the stable method only. Use spec-training-results.html for the generated rung history, best-run ladder, and recent regressions, and keep bulky raw artifacts under the run archive rather than turning this page into a log dump.

Local lesson from spec02 through spec16: broad spec-level progress has been real, but once a line already has a strong raw winner, post-hoc raw repair rungs have regressed more often than they helped. The default policy is now: predict likely failure frontiers before training, encode them as clean positive coverage in the initial curriculum, freeze strong raw winners, and route syntax-only residue to deterministic decode/repair instead of reopening the same brittle CE surface.

Post-gen1 method update: the first true broad-contract scene-DSL run fit the widened surface far more effectively than the older narrow spec/rung loop. That does not prove every scaling-law claim, but it does justify the current policy shift: broaden the clean compiler-backed corpus first, then harden the eval, then change model size only after the frozen broad contract has been tested under recombination pressure.

Why This Method Exists

This curriculum is not just local preference. It follows the same broad logic seen in scaling-law work, compute-optimal training research, and capability-predictability discussions: use smaller and cheaper experiments to validate the contract, the data, the tokenizer surface, and the evaluation loop before spending serious compute on larger runs.

Scaling Laws

OpenAI's scaling-laws work is the clearest public reference for why small and medium runs are useful first: they help forecast how loss changes with model size, data size, and compute before larger training runs.

Outcome: larger trends can be forecast from smaller controlled experiments.

Applied here: use cheap spec runs to validate DSL, tokenizer, and eval contracts before spending serious training compute.

OpenAI — Scaling Laws for Neural Language Models

Compute-Optimal Training

DeepMind's Chinchilla result is the classic argument against blindly scaling parameters without checking data and token budgets. It supports the habit of using many smaller runs to find the recipe before the expensive run.

Outcome: better results often come from the right model/data/token balance, not just a bigger model.

Applied here: compute packed token budgets and effective epochs per spec instead of copying old totals into new DSLs.

DeepMind — Training Compute-Optimal Large Language Models

Predictability vs Surprise

Anthropic's paper is a useful reminder that broad trends can be predictable even while specific capabilities and outputs remain surprising. That is why this project insists on probes, canaries, and run reports rather than loss alone.

Outcome: aggregate scaling behavior can be smooth while concrete behaviors still surprise operators.

Applied here: every run needs split-aware probes, failure galleries, and promotion rules instead of trusting train loss.

Anthropic — Predictability and Surprise in Large Generative Models

Large-Scale Recipe Practice

Meta's Llama 3 paper is a useful public example of a model family trained with scaling-law thinking across multiple sizes, rather than one single blind flagship run.

Outcome: a family of models is often used to refine the recipe, not just to ship different sizes.

Applied here: keep `spec` lines interpretable at small scale first, then widen to mixed training and larger CK-native runs later.

Meta — The Llama 3 Herd of Models

Project Curriculum

The CK-native version ladder explains how the current visual DSL work fits into the broader path from compiler-backed training to page DSLs, code/data/tool IR, and eventually mixed coding/scientific tasks.

Outcome: visual DSL work is treated as foundation-building, not the final destination.

Applied here: move from narrow-family SVG work toward page DSLs, code/data/tool IR, and broader mixed models.

Open training-curriculum.html

Failure Visibility

The local intuition page explains why this method emphasizes canaries, parity gates, checkpoints, and structured post-run reports instead of random trial-and-error.

Outcome: visibility turns failed runs into reusable knowledge instead of wasted compute.

Applied here: every spec should leave behind a fixed HTML report with hypothesis, deltas, failures, and a decision.

Open training-intuition.html

Three Separate Contracts

1. Asset/Data Contract

The asset library is the visual reference set and the context library is the semantic reference set.

Source from docs/site/assets/*.svg or later public asset libraries.
Strip literal text into placeholders first.
Track layout family, composition, theme, color system, shapes, connectors, charts, and text roles.
Do not let real copy hide layout mistakes.

2. DSL Contract

The DSL is the model boundary. It should describe composition and roles, not raw SVG bookkeeping.

Emit scene choices such as layout, theme, rail, density, gap, background, and components.
Use placeholders like heading_1, paragraph_1, badge_1.
Keep canonical token order and fixed arity.
Remove inferable fields from the model surface.

3. Compiler Contract

The compiler owns geometry, defs, gradients, markers, wrapping, and final render semantics.

Compile scene DSL into SVG with deterministic layout rules.
Guarantee round-trip reconstruction before any serious training run.
Keep rendering policy stable across runs unless the run is explicitly a compiler experiment.
Treat compiler regressions separately from model regressions.

Memory layout map — model weight geometry

Asset Memory layout map — what the compiler owns

IR lowering pipeline — deterministic compilation stages

Compiler IR lowering pipeline — deterministic stages

Failure decision tree — diagnostic branching

DSL Decision tree — gate failure diagnostics

These are real assets from the project library — the same asset families the training pipeline learns to plan and compose.

What We Are Actually Training

In this line, the model is not being trained to be a free-form SVG artist. It is being trained to choose structured visual intent against a stable compiler contract.

Composition

Composition means the overall information shape: compare two systems, show a pipeline, build a poster stack, or arrange dashboard cards.

Examples: poster_stack, comparison_span_chart, pipeline_lane.

Layout Family

Layout is the reusable geometry family inside the composition. It decides where headers, panels, tables, bars, and connectors belong.

Examples: one tall poster, two compare panels, three-column dashboard, staged pipeline lane.

Theme And Tone

Theme is the visual language. Tone is the accent family inside that language.

Examples: theme:infra_dark, theme:paper_editorial, tone:amber, tone:green.

Content Binding

The content is not the same thing as the scene structure. The model should choose where content goes; a separate payload should provide what the text and values actually are.

This is why later lines should move from literal prose inside scene tokens to keyed refs plus content.json.

Layer	Example	Who Owns It
Composition	`comparison_span_chart`	Model
Theme	`infra_dark`	Model
Tone	`amber`	Model
Content role	`@section_card.0.title`	Model chooses the slot, external content provides the value
Exact SVG path, gradient, shadow, marker, wrap	`<path ...>`, `<linearGradient ...>`	Compiler

Activation memory infographic — poster_stack layout family

poster_stack layout family — the model chooses composition, theme, tone, and content slots. The compiler handles gradients, geometry, text wrapping, and final SVG rendering.

Worked Example

A concrete keyed-scene example should look like this:

Request prompt

[task:svg]
[layout:poster_stack]
[topic:memory_reality]
[theme:infra_dark]
[tone:green]
[density:compact]
[OUT]

Scene DSL emitted by the model

[scene]
[canvas:tall]
[layout:poster_stack]
[theme:infra_dark]
[tone:green]
[frame:card]
[density:compact]
[inset:md]
[gap:sm]
[hero:center]
[columns:1]
[emphasis:top]
[rail:accent]
[background:rings]
[connector:line]
[topic:memory_reality]
[header_band:@header_band.0.kicker|@header_band.0.headline|@header_band.0.subtitle]
[section_card:@section_card.0.title|@section_card.0.value|@section_card.0.caption|variant=hero|accent=amber]
[compare_bar:@compare_bar.0.label|@compare_bar.0.value|@compare_bar.0.caption|accent=red]
[table_row:@table_row.0.column_1|@table_row.0.column_2|@table_row.0.column_3|state=highlight|accent=amber]
[/scene]

content.json bound by the compiler

{
  "header_band": [
    {
      "kicker": "First Principle",
      "headline": "LLM Memory Reality",
      "subtitle": "The math marketing will not show you"
    }
  ],
  "section_card": [
    {
      "title": "Memory Capacity",
      "value": "25x more memory capacity",
      "caption": "Capacity sets the real context boundary"
    }
  ],
  "compare_bar": [
    {
      "label": "GPU VRAM",
      "value": "80 GB",
      "caption": "single device"
    }
  ],
  "table_row": [
    {
      "column_1": "128K",
      "column_2": "3x GPUs",
      "column_3": "Fits"
    }
  ]
}

The training target here is the scene decision, not the literal prose. The compiler can now render the same scene with different content payloads without retraining the model.

Topology Refs vs Content Refs

Some richer layout families need more than content slots. Trees, maps, and graph-style layouts also need stable structural handles so the compiler knows what connects to what.

Ref Type	Example	Used For
Topology ref	`[node_id:start]`, `[from_ref:start]`, `[to_ref:l2]`	Identity, routing, graph layout, branch placement, segment attachment
Content ref	`[title_ref:nodes.start.title]`, `[branch_label_ref:edges.start_l2]`	Visible text and numeric payload from `content.json`

[decision_node]
[node_id:start]
[title_ref:nodes.start.title]
[/decision_node]

[decision_edge]
[from_ref:start]
[to_ref:l2]
[branch_label_ref:edges.start_l2]
[/decision_edge]

This dual-reference pattern is not a bug. It is the right compiler boundary for tree, map, and topology layouts.

Why Explicit Tokens Instead Of Generic BPE Right Now?

Tokenizer architecture — hash vs trie lookup

Tokenizer architecture — the CK-Engine tokenizer uses explicit reserved tokens for the DSL surface. This makes every probe miss interpretable as a scene decision error, not subword spelling drift.

The short answer is: because this output language is small, formal, and brittle, and the current training line is optimizing for control and interpretability before token-efficiency.

Why not plain BPE first?

If the tokenizer is free to break tags arbitrarily, a tiny model must learn both the scene language and the spelling of the scene language at the same time.

What explicit reserved tokens buy

They make the contract visible. A probe miss can be read as the wrong scene decision instead of arbitrary subword drift across brackets, separators, and role markers.

Why this is not a universal rule

Frontier models absolutely use learned tokenizers like BPE or SentencePiece at base-model scale. The explicit token surface here is a local engineering choice for a narrow formal DSL.

What should happen later

Once the DSL is stable, revisit the tokenizer boundary as its own spec question: smaller reserved surface, learned merges, or mixed strategy. Do that after the contract is proven, not before.

A useful rule is:

Early line:
prefer explicit structural tokens so the contract is obvious.

Later line:
shrink or relax the reserved surface only after the model, compiler, and evals agree on the language.

One mistake to avoid is treating a whole component row as the permanent atomic token surface.

Shape	What It Buys	What It Risks
`[compare_bar:@compare_bar.0.label\|@compare_bar.0.value\|@compare_bar.0.caption\|accent=amber\|note=@compare_bar.0.note]`	Short sequences and strong format control.	Too brittle if kept forever. Similar component variants do not share enough structure, so the vocabulary becomes overly specific.
`[compare_bar]` `[label_ref:compare_bar.0.label]` `[value_ref:compare_bar.0.value]` `[caption_ref:compare_bar.0.caption]` `[accent:amber]` `[note_ref:compare_bar.0.note]` `[/compare_bar]`	More compositional reuse across fields and component variants.	Longer sequences and a slightly harder grammar, but much better long-term generalization.

The first shape can be acceptable as a transition step. It should not be treated as the final production tokenizer boundary.

What went wrong in practice: one of the keyed-scene lines still inherited the old tokenizer builder, which collected every whitespace-delimited bracket chunk as a reserved control token. That turned rows like [compare_bar:@compare_bar.0.label|@compare_bar.0.value|@compare_bar.0.caption|accent=amber] into single atomic tokens again. This is not training-token packing in the data pipeline. It is tokenizer-boundary packing, and it makes the DSL brittle.

Failure Mode	Why It Is Wrong	Corrective Direction
Whole bracketed component rows become reserved tokens because the tokenizer harvests every whitespace-delimited `[...]` chunk.	The model no longer learns a compositional scene language. It learns oversized one-off control ids.	Keep only small structural tokens reserved and stop reserving payload-bearing ref tokens like `@compare_bar.0.label`.
Dense packing is confused with normal training-token packing.	The packed-window step is not the issue here. The issue is the token boundary chosen before packing.	Debug tokenizer boundaries first, then debug total-token budgets and effective epochs second.
Keyed refs still sit inside one monolithic component token.	The content moved out of literal prose, but the structure is still too atomic.	Move to block-style components such as `[compare_bar]`, field-role tokens, and separate ref tokens or unreserved ref payloads.

Bad tokenizer boundary

[compare_bar:@compare_bar.0.label|@compare_bar.0.value|@compare_bar.0.caption|accent=amber]

Better boundary

[compare_bar]
[field:label]
[@compare_bar.0.label]
[field:value]
[@compare_bar.0.value]
[field:caption]
[@compare_bar.0.caption]
[accent:amber]
[/compare_bar]

The rule is simple: reserve structure, not payload. If a token contains too much scene-specific data or too many keyed refs, it is probably the wrong token boundary.

Spec Lifecycle

v6.6 evolution timeline — spec progression milestones

Evolution timeline — each spec passes through the same lifecycle stages below. The milestones above show how the v6.6 line matured through this exact process.

State the hypothesis

Write down exactly what the next spec is trying to prove: richer DSL, better compiler, stronger asset family, or narrower repair.

Snapshot the assets

Copy the chosen asset family into the cache-backed run workspace so the training contract is frozen and reproducible.

Replace literal text

Strip shipped copy and replace it with placeholders. Train structure first, then bind content from a separate library.

Extract vocabulary

Build the scene vocabulary from layouts, components, composition patterns, style families, and semantic text roles.

Prove the compiler

Compile the DSL with dummy text and verify that the output is acceptably close to the reference asset family before training.

Freeze a gold asset gate

Require a real gold pack, ideally 5 to 10 assets spanning comparison, poster, table, and pipeline families. Treat compiler fidelity as a gate, not a note.

Freeze the tokenizer surface

Tokenize the DSL vocabulary plus placeholder text roles. Do not let arbitrary literal text dominate the tokenizer early, and do not keep whole component rows as permanent atomic tokens once the structure-content split is proven.

Train with one primary axis

Change one main thing per run: DSL, compiler, data mix, tokenizer budget, or capacity. Hold the rest stable.

Write the decision

After every run, decide: promote, reject, repair, or branch. Every run ends with a next action, not just a score.

Operational Training Process

The practical mistake is to let a rung answer too many questions at once. The better process is narrower: freeze measurement, name one failure class, teach family structure, and let validation close the last deterministic syntax gaps.

Freeze measurement first

Before changing the curriculum, freeze the probe prompt lists and keep them balanced across cases, forms, and prompt surfaces.

Rule: if the probe contract changes, version it. Do not compare old and new runs as if they were the same measurement.

One rung, one question

Every rung should target one named failure class: missing terminal tail, wrong family choice, wrong style bundle, stop-boundary spill, or broken exactness.

Rule: if a rung changes grammar, probe shape, and repair mix together, it is not a clean experiment.

Train structure, not boilerplate

The model should learn the family program: layout, components, refs, style controls, and counts. Deterministic tails and canonical ordering should be pushed into the compiler or repair layer whenever possible.

Rule: do not waste multiple rungs trying to make SGD rediscover a fixed footer, terminal block, or closing sequence.

Add repair early

Validation and repair are part of the system, not an admission of failure. Use them as soon as the family structure is mostly right and only a small deterministic residual remains.

Rule: always report raw and repaired probe scores separately.

Process gate	Required rule	Why it exists
Probe freeze	Keep balanced prompt lists on disk and version the probe contract when they change.	Otherwise a probe bug or skewed selector can make a healthy line look broken.
Rung scope	Declare one failure class target and one allowed intervention family per rung.	Clean attribution matters more than squeezing one more point from a noisy run.
Repair budget	Keep broad meta-repair rows capped. Default budget: `10-15%` unless the rung is explicitly a syntax rung.	Too much repair prose teaches the warning language itself instead of the underlying scene contract.
Compiler-first closure	If a missing region is mechanically implied by form and counts, prefer validation or repair before adding another broad training rung.	This keeps the model focused on family structure rather than deterministic boilerplate.
Raw vs repaired reporting	Publish both raw probe metrics and repaired probe metrics.	This separates training quality from system quality and prevents false conclusions.
Seed and probe integrity	Fail preflight if seed staging, tokenizer sidecars, or probe-path hashes drift unexpectedly.	Many apparent curriculum regressions are really measurement or staging bugs.

A useful default wording for the rung brief is: "Fix one named failure class while preserving the last good baseline on the frozen probe." That sentence is narrow enough to govern the materializer, the probe, and the post-run decision.

Operator rule from the live spec lines: once a raw rung becomes the champion on a frozen contract, freeze it. The next training change should default to decode-first work or a small pilot, not an unconstrained full rung. A full rung should be blocked until a pilot proves the target family improves without hidden-holdout or family regression.

Freeze the raw winner

Keep the current raw champion as the comparison anchor. New runs compete against that run, not against train loss or operator intuition.

Rule: promotion gates should reference one frozen probe report and one frozen best rung.

Pilot before full rung

If a repair idea touches shared-family behavior, run a small pilot first. Scale the token budget down and treat the result as a gate, not an automatic successor rung.

Rule: a full rung stays blocked until the pilot improves the target family with no family or hidden-split regression.

Decode first after strong raw runs

Once the raw line is mostly correct, push stop-boundary, prompt-spill, and deterministic cleanup into decode, validation, and repair before spending another broad training cycle.

Rule: if the miss set is mostly repairable, train less and validate more.

Ban literal prompt junk repair rows

Do not teach the model the exact wrapper junk you want it to avoid. That turns the warning language itself into the target distribution.

Rule: use control-agnostic clean-stop prompts rather than rows that spell out [OUT], duplicated prompt blocks, or schema-noise tokens.

Context Budget Is A Gate

Do not guess the next context length. Measure the gold scenes with the current tokenizer family before training.

Rule	Interpretation
`p95(prompt + output) < 400`	`ctx=512` is probably still enough.
`400-600`	Move to `ctx=768` only if the gold pack actually lives here.
`> 768`	Use `ctx=1024` only when the gold scenes really need it.

The important lesson from the keyed-scene line was that a layout can fail because of decode or context budget even when the model already learned the structure correctly. Budget diagnostics should therefore be part of the probe contract, not an afterthought.

How To Build Frontier-Style Intuition

No serious engineer would laugh at the core idea here. The separation of structure, content, compiler, and eval is exactly the kind of thinking that makes systems reliable. What strong teams would challenge is not the direction, but the discipline of the method.

Good instinct

Separating asset library, scene DSL, content, compiler, and probe contracts is a strong instinct. It reduces ambiguity and makes failures legible.

What frontier teams would push harder on

Cleaner ablations, stricter gates, more automatic validation, less manual drift between specs, and fewer simultaneous changes per run.

What to copy from them

Treat every run like an experiment with a written hypothesis, a fixed baseline, a canary, a non-regression gate, and a clear promotion or rejection rule.

What not to imitate blindly

Do not jump to giant-scale training intuition too early. On narrow compiler-backed tasks, clarity of interface and eval quality matter more than trying to mimic a frontier pretraining stack.

If a frontier engineer reviewed this	Likely reaction	What to improve
Separating DSL and content	Good systems instinct	Push it fully through the dataset, compiler, and probe path
Compiler-first validation	Correct	Keep the gold asset round-trip gate strict
Custom explicit tokens	Reasonable for a narrow formal language	Revisit only after the DSL stabilizes
Many specs in sequence	Fine if each run is interpretable	Make the hypothesis and held-constant set even more explicit
Weak or drifting evals	Unacceptable	Keep the report contract and canary gate mandatory

One Primary Axis Per Run

The model can only teach intuition if each run is interpretable. Every run should declare one primary axis and list everything held constant.

Primary Axis	What Changes	What Stays Fixed	Question Answered
DSL run	Scene grammar, token order, component vocabulary, canonicalization rules	Asset set, compiler, tokenizer family, model size, training budgets	Did the representation become easier and cleaner to learn?
Compiler run	Layout engine, gradients, wrappers, markers, defs, text wrapping, style packs	DSL, prompts, tokenizer, model config	Did the rendering surface become richer without widening the model surface?
Data run	Asset coverage, placeholder catalog, contrast pairs, holdout balance, context library	DSL, compiler, model size	Is the model underfit on coverage, or was the contract itself weak?
Budget run	Packed token budgets, effective epochs, stage mix	DSL, compiler, dataset rows, model config	Was failure caused by under-training or over-training rather than representation?
Capacity run	Layers, embed dim, hidden dim, context, optimizer scaling	DSL, compiler, dataset mix, evaluation contract	Has the representation stabilized enough that more capacity is the next lever?
Repair run	Narrow contrast slices for specific misses	Everything else, especially the winning baseline	Can a specific failure cluster be closed without global regression?

A run is only methodical when it can be summarized as one sentence: “This run changed X, held Y constant, and answered Z.”

Restart-Safe Agent Handoff

Yes, this page is the right umbrella documentation to keep around for restarts. It now has a companion handoff file so a future agent does not need to reconstruct the current plan from scattered reports or stale run logs.

Why this helps

It preserves the method baseline after a reboot: keep the last good spec[x] rung[y] as the training-method champion, move to a compiler-first successor DSL, and add one new family at a time.

What to copy after restart

Use the markdown handoff file below as the first message to the next agent. It points directly at the live contract and the autopilot policy.

What it prevents

It prevents blind resumption of stale runs, tokenizer guesswork, and launching training before the compiler and tokenizer contract are ready.

Reference	Purpose
`docs/site/_pages/spec-training-method.html`	Human-readable umbrella method page.
`docs/site/_pages/agent-handoff-template.md`	Copy-paste restart prompt for future agents.
`version/v7/reports/SPEC[X]_EXECUTION_CONTRACT_YYYY-MM-DD.md`	Example shape of the current internal execution contract.
`version/v7/reports/spec_family_autopilot_policy.json`	Machine-readable rule for when autonomy is allowed and when it must stop.

Suggested restart prompt

Read these first and treat them as the live source of truth:

1. docs/site/_pages/spec-training-method.html
2. docs/site/_pages/agent-handoff-template.md
3. version/v7/reports/SPEC[X]_EXECUTION_CONTRACT_YYYY-MM-DD.md
4. version/v7/reports/spec_family_autopilot_policy.json

Current intent:
- keep the last good spec[x] rung[y] as the training-method baseline, not the tokenizer ceiling
- build the successor DSL/compiler/tokenizer path explicitly
- keep payload/content external to the model DSL
- add one capability at a time
- make the next family[z] the active family-construction line
- do not launch training until the compiler, tokenizer corpus, launcher, and rung policy checklist are complete

Then continue from the current execution contract checklist and update the repo artifacts before starting any background training or autopilot.

Recommended Asset-to-DSL Workflow for SVG

Full pipeline flow — from input artifacts through IR stages to compiled output. The compiler validation phase in the workflow below ensures the DSL round-trips through this pipeline deterministically.

Asset intake

Track every reference SVG in the public asset set and copy it into the run-local cache workspace.

Keep original assets immutable.
Record family, source file, and intended holdout split.
Use run-local copies for extraction and experiments.

Dummy-text normalization

Replace all literal text with placeholders before vocabulary design.

heading_1, heading_2, paragraph_1, callout_1
Keep text roles, not the original prose.
Make layout errors visible by removing content noise.

Vocabulary extraction

Infer the reusable scene vocabulary from the normalized assets.

layout family
component families such as panels, charts, connectors, legend, band, poster stack
theme, tone, spacing, background motif, frame, emphasis, hero alignment

Compiler validation

Compile the new DSL back into placeholder SVG and compare it against the normalized asset.

Require deterministic output.
Require acceptable visual round-trip on the gold asset subset.
Reject the spec if the compiler cannot express the asset family cleanly.

Training corpus

Train on the DSL and placeholder roles, not on raw shipped copy.

Tokenize scene vocabulary plus placeholder text roles.
Add content/context libraries later as a separate layer.
Keep placeholder slot resolution external and deterministic when possible.

Probe discipline

Score both DSL exactness and compiled render exactness.

A wrong DSL with right render means hygiene is bad but the contract may still be close.
A right DSL with wrong render means the compiler is the problem.
Do not read train loss as the final answer.

HTML Report Contract After Every Spec or Run

Every run should leave behind a readable HTML report that answers the same questions in the same order. The point is not decoration. The point is to make the next decision obvious.

Report Section	Question Answered	Recommended Content
Run card	What is this run?	Spec name, run id, date, model config, dataset version, primary axis changed, baseline compared against.
Hypothesis	Why was the run executed?	One paragraph stating what changed, why it should help, and what would count as success or rejection.
Held constant	What was intentionally not changed?	DSL/compiler/data/tokenizer/capacity matrix with one primary axis highlighted.
Data + tokenizer	What did the model actually see?	Row counts, packed token counts, effective epochs, tokenizer size, reserved tokens, placeholder coverage.
Compiler gate	Could the compiler express the target family?	Round-trip gallery on gold assets, render diffs, unsupported primitives, fallback paths.
Probe summary	Did the run work?	Exact, renderable, materialized exact, split metrics, per-layout metrics, delta vs baseline.
Product scorecard	Did it get closer to usable output?	Content-binding success, gold-asset parity status, family non-regression, and whether the output is visibly in-family.
Failure gallery	Where did it fail?	Show the remaining misses with prompt, expected DSL, actual DSL, content JSON or refs, compiled SVG, and a short diagnosis.
Lessons	What was learned?	Two or three concrete takeaways about representation, compiler, data, or budgeting.
Decision	What happens next?	Promote, reject, repair, or branch. Include the next run shape and the intended axis change.

Beautiful means legible

A good report should show the metric headline, the delta from baseline, and the actual failure cards above the fold.

Use color sparingly: green for proven gain, amber for ambiguity, red for rejection, blue for structural notes.

Show the evidence

Every claim should anchor to an artifact: probe JSON, tested prompts report, compiler validation gallery, dataset profile, tokenizer stats.

End with a decision

The report is incomplete if it only says what happened. It must say what the next run should do and what should stay frozen.

Run summary template

Spec:
Run:
Primary axis changed:
Held constant:
Baseline:

Hypothesis:

Data delta:
DSL delta:
Compiler delta:
Tokenizer delta:
Budget delta:
Capacity delta:

Preflight result:
Compiler round-trip result:
Probe result:
Failure clusters:

Decision:
Next run:

What The Experiments Proved

Compute-bandwidth chasm — why small specs first

Insight Start small — bridge the chasm with cheap specs first

Theory of constraints — find the bottleneck

Method Find the real bottleneck before scaling

Iteration philosophy — fail fast, learn, repeat

Practice Fail fast, learn, freeze winners, redesign upfront

These conclusions did not come from theory alone. They came from repeated runs that failed, partially worked, regressed, recovered, and exposed the actual pressure points in the system. Across the full run history, the stable wins came from better contracts and better full curricula. Narrow raw repair churn became less reliable once a line was already mostly learned.

1. Evaluation bugs can masquerade as model collapse

We learned that decode/stop-marker mistakes and bad probe contracts can make a healthy model look broken. Evaluation infrastructure is part of the system, not an afterthought.

2. Freeze the winning baseline

Once a run becomes strong on the current contract, freeze it. Do not erase the baseline with broad speculative follow-ups.

3. Front-load predictable failures

Family drift, sibling-form confusion, style attractors, topology/count defaults, and stop leakage are usually visible in the contract before the first run. Teach those boundaries as clean positive coverage from day one instead of relying on post-hoc warning-language rows.

4. Use decode/repair before raw churn

When the remaining miss set is mostly syntax hygiene or mechanically repairable near-misses, deterministic decode/repair is the default path. Reopen raw training only for a true new branch such as a redesigned curriculum or a capacity test.

5. Separate asset, DSL, compiler, and content work

Runs became interpretable only after we started isolating which layer changed: richer scene vocabulary, richer compiler, broader asset family, or better structure-content separation.

6. Token granularity is its own spec axis

Whole-component tokens were useful as training wheels, but they should not remain the long-term boundary. Once keyed structure works, shrinking token granularity deserves a dedicated run.

How We Reached The Current Design

Quantization formats — structured technical reference

table_matrix layout family — a structured byte-level reference. The experiments below showed that this level of visual detail requires a compiler, not a model that emits raw SVG.

Observed pattern	What it taught us	Design conclusion
Raw or flat structural targets could render, but were brittle and hard to steer.	The model was spending too much capacity on low-level surface decisions.	Move upward into a scene DSL.
A richer compiler improved visual output without requiring the model to emit raw gradients, paths, or markers.	Beauty and fidelity can often be improved at the compiler layer first.	Keep low-level SVG machinery compiler-owned.
Some runs got stronger exact-match by baking visible text directly into component tokens.	That can help short-term contract accuracy, but it mixes structure and content in the wrong place.	Separate scene structure from content payloads.
Narrow fixes sometimes improved the target slice and damaged solved slices elsewhere.	Repair runs need strong anchor replay and non-regression checks by family, not just aggregate score.	Gate every repair against the frozen baseline.
Loss often looked healthy even when contract behavior regressed.	Train loss is not the product metric.	Use probe, canary, compiler round-trip, and failure galleries as the real decision surface.

How To Decide What Is Next

Failure decision tree — gate failure branching

Diagnostic Gate failure decision tree — match symptom to action

Timeline How specs mature through repeated gate passes

Diagnostic Matrix

After a run finishes, use the probe report, loss curve, and per-layout breakdown to decide what to do next. Do not guess — match the symptom pattern to the action.

Symptom	Likely Cause	Action	Example
Low exact across all layouts	Undertrained (too few epochs or too little curriculum pressure)	Raise midtrain epochs from 1 to 2–3. Keep everything else frozen.	Example: 2/35 exact after 1 midtrain epoch, with all 5 layouts weak
Low exact on some layouts, others strong	Curriculum imbalance — weak families got less edit pressure	Add targeted negative/repair rows for the weak families. Anchor the strong families with replay.	Example: one over-weighted slice caused cascading failure across the weak families
High exact match, but `rendered_svg_ok: null`	Probe/compiler plumbing bug — model output is right but the render path fails silently	Fix the probe pipeline first. Do not retrain until the eval is trustworthy.	Example: exact=true but `rendered_svg_ok`=null because the accounting path was wrong
Loss drops below 0.1 but exact match stays low	The model memorized training distribution but not the DSL contract	Check tokenizer coverage, holdout prompt diversity, and whether edits cover all layout × topic combinations.	Example: loss converged but capability stayed near 0% because the representation was wrong
Loss stuck above 0.5 after full midtrain	Capacity wall, LR too low, or data packing issue	Check grad norms for saturation. Try LR sweep. Verify token packing fill rate (>0.85).	Example: 81% loss reduction but final loss stayed at 0.905, signaling a capacity or token-budget wall
Good exact on train prompts, bad on dev/test	Overfitting to train distribution	Widen holdout prompt coverage. Add more topic × layout combinations. Reduce epoch count if loss is very low.	Example: 100% train, 91.7% dev, 83.3% test, showing a visible but still manageable overfit gradient
Good exact on one spec, regression on next	DSL contract changed silently, or tokenizer/compiler mismatch	Diff the DSL grammar, tokenizer vocab, and compiler output between specs. Use canary probes from the previous best.	Example: new layout families were introduced and legacy families still passed, but the tokenizer/compiler contract drifted
100% syntax valid, 0% semantic match	Model learned token order but not composition choices	Richer curriculum with edit pairs (wrong→right), not just direct generation examples.	Example: 100% syntactically valid DSL, 0% correct scene composition
Stage labels wrong in loss curve	Telemetry bookkeeping bug in pipeline — stage transitions not recorded	Fix the pipeline stage labeling. Do not trust source_stage field until verified. Use step counts to infer boundaries.	Example: all 665 steps were labeled `pretrain` even though midtrain steps were present
training_plan.json says "active" after run completed	Pipeline did not finalize stage status	Fix the plan update logic. Probe report should not depend on plan status for results.	Example: `training_plan.json` still reported `midtrain` after the run finished
One layout perfect until truncated — missing closing tags	Output exceeds context window — model learned the content but runs out of tokens	Measure output token count vs context length. Either compress the DSL (remove inferable fields) or raise context. Do not add more training data — the model already knows it.	Example: one family matched 2208/2606 characters perfectly, then truncated at the same point on every failure
Gold mapping token budget far exceeds context window	DSL is too verbose — carrying fields the compiler could infer	Run a DSL compression pass before defining the tokenizer. Target: output tokens < 80% of context length. Measure with a gold-budget report tool.	Example gold output: 2793 tokens at ctx = 512, or 5.5x over budget

Critical — fix before retraining Warning — likely cause of weak results Info — plumbing or telemetry issue

Reading the Probe Report

The probe report is the real decision surface, not the loss curve. A healthy loss curve with bad probe results means the representation or curriculum is wrong. A noisy loss curve with good probe results means the model learned despite training-signal noise.

Metric	What It Tells You	Healthy Range	Red Flag
`exact_match`	Model output matches expected DSL token-for-token	>70% overall, >50% per layout	<20% after full curriculum = representation or data problem
`renderable`	Output parses and compiles to valid SVG	100% (compiler contract must not break)	<90% = compiler or tokenizer bug, not model bug
`rendered_svg_ok`	Compiled SVG matches expected visual	Should track exact_match closely	`null` = probe pipeline broken, fix before interpreting other metrics
Per-layout breakdown	Which families the model learned vs. missed	All families above 50%	One family at 0% with others at 80%+ = curriculum imbalance
Train vs dev vs test split	Generalization gradient	Train≥dev≥test, gap <20pp	Train 100%, test 0% = severe overfitting

Reading the Loss Curve

Pattern	What It Means	Action
Smooth descent, final <0.05	Model is memorizing the training set well	Check probe to confirm generalization, not just memorization
Descent plateaus at 0.1–0.3	Model hit a capacity or data ceiling	If probe is also stuck, try more epochs, richer edits, or LR restart
Sudden spike at stage boundary	New data distribution — expected at pretrain→midtrain transition	Not a bug. Watch if it recovers within 50–100 steps
CK loss ≠ PT loss	Numerical parity broken	Stop training. Fix kernel parity before continuing. This should never happen.
Grad norms spike or collapse	Training instability — exploding or vanishing gradients	Check max_grad_norm clipping. If persistent, reduce LR or check data packing for degenerate sequences.
Loss oscillates without converging	LR too high, or data has conflicting signals (e.g., same prompt → different targets)	Verify dataset uniqueness. Try lower LR. Check for duplicate or contradictory training rows.

Decision Actions

Use deterministic decode/repair first when the contract is almost closed

If the current representation is strong and the remaining misses are syntax-only or mechanically repairable, keep the raw winner frozen and improve the decode/repair layer first. Do not default to another raw repair rung on the same brittle surface.

Use a compiler step when visuals are still weak

If the generated infographic is structurally correct but still visually simple, improve the compiler and asset vocabulary before spending more model compute.

Use a tokenizer step when outputs are too brittle

If the model only behaves well when whole component rows are reserved atomically, run a token-granularity spec next instead of adding more layouts or model size.

Use a DSL step when the representation is still too literal or too loose

If the model surface still carries inferable fields, arbitrary prose, or non-canonical structure, fix the DSL before scaling the model.

Use a structure-content split when generalization is the problem

If the scene language is still carrying visible prose or one-off values, branch to keyed structure plus separate content.json. The model should emit refs like @section_card.0.title, not literal asset prose inside the component token.

Use clean contrast coverage when planning is still wrong

If a failure is truly still in training, add compiler-valid contrast sets and boundary cases, not warning-language rows about [OUT], wrappers, or singleton tags. The curriculum should teach the right choice, not narrate what not to do.

Promotion Rules

Promote a run only if it improves the primary target metric without regressing the already-solved slices.
Reject a run if it changes too many axes to interpret, even if the final number looks better.
When a frozen raw winner exists, default to decode/repair and block more same-surface raw repair rungs until the next training idea is a true new branch.
Branch a new spec only when the current spec ceiling is real and well measured.
Make compiler fidelity a hard gate: use a gold pack, not just a couple of hand-picked examples.
Scale model size only after the DSL and compiler contracts are stable.
Keep run artifacts in ~/.cache/ck-engine-v7/models/train/; keep curated project docs in the repo.
Measure output token budget against context window before training. If the longest gold mapping exceeds 80% of context length, compress the DSL or raise context first.

Spec Versus Run

The project uses spec[x] and r[y] for a reason. A spec is a new training contract. A run revision is a controlled iteration inside the same contract.

Level	Meaning	What May Change	What Should Stay Fixed
`spec[x]`	A new experiment question and a new learning boundary	DSL, prompt surface, compiler ownership, token granularity, output family, structure/content split	The project goal and the evaluation discipline
`r[y]`	A revision inside one spec	Repair rows, replay ratios, token budgets, epochs, balance, decode hygiene, capacity after the contract is stable	The DSL contract, the probe target, and the main question

Start a new spec when the current ceiling is real and measured. Stay inside the same spec when the failure is narrow and the representation still matches the intended product boundary.

Working Definitions

Term	Meaning Here	Example
Trainable contract	The exact input/output behavior the model is being taught, with a form that can be evaluated reliably.	`prompt -> scene.dsl` with exact-match and render checks.
Representation	The model-facing format for the task: the grammar, fields, token shapes, and structure it must predict.	Flat SVG atoms versus a scene DSL with `layout`, `theme`, and component blocks.
Data	The actual examples used to teach the contract, including prompts, targets, repairs, anchors, and holdouts.	Direct gold rows, topic-swap rows, close-tag continuation rows, or negative contrast rows.
Compiler	The deterministic layer that turns structured outputs into the final product artifact.	`scene.dsl + content.json -> SVG`.
Eval	The measurement layer that decides whether the run actually improved the desired behavior.	Exact match, renderability, materialized exactness, and per-family breakdowns.
Canary	A tiny, cheap, high-signal test slice run before a full job to catch obvious format or compiler failures.	12 prompt cases that must parse and render before the real training run starts.
Eval gate	A pass/fail condition that must hold before a run is accepted or promoted.	`100%` renderability on the canary or no regression on solved families.
Ablation	A controlled experiment where one factor is changed while the rest stays fixed, so the effect can be interpreted.	Keep the DSL fixed and change only midtrain epochs from `1` to `3`.
Data mixture control	Deliberately choosing how much of each example type the model sees.	`40%` anchor replay, `40%` direct rows, `20%` repair rows.
Failure taxonomy	A named breakdown of failure classes so the next action is chosen from evidence instead of guesswork.	Grammar failure versus compiler failure versus capacity failure.
Repair run	A narrow run that targets a known failure slice without redefining the overall contract.	Add transition rows to fix one broken family in the current baseline line.
Replay / anchor rows	Stable rows from already-solved behavior that are kept in the curriculum to prevent regressions.	Keep strong `decision_tree` rows present while repairing `table_matrix`.
Prompt surface	The information exposed on the input side of the task.	Explicit `[layout:...]` prompts versus intent prompts with only `topic + goal + audience`.
System boundary	The line between what the model must learn and what deterministic systems should own.	The model chooses scene structure; the compiler owns exact geometry and gradients.
Canonicalization	Forcing one stable legal form for equivalent outputs so exact-match metrics mean something.	One legal field order for scene attributes, not many interchangeable spellings.

Use a new spec when the question changed

Examples: move from flat atoms to scene DSL, move from explicit layout prompts to intent prompts, or split visible content out into content.json.

Use a new run when the contract is right but weak

Examples: add closure repairs, strengthen replay, rebalance families, or raise effective epochs for the same grammar.

Reject blurry runs

If a run changes grammar, curriculum, and capacity at once, it may still improve a metric, but it will not teach much. That is wasted research signal.

Asset Scaling Strategy

Do not try to train on a large asset library immediately. Prove each capability level first, then widen. The pattern is: memorize → generalize → expand.

Phase	Gold Assets	Goal	Pass Criteria	What Failure Means
Memorization	3 hand-mapped	Can the model reproduce 3 exact gold scenes from their prompts?	100% exact on train, compiler round-trips all 3	DSL, tokenizer, or context window is wrong — fix before adding more data
Held-out generalization	3 gold + 3 synthetic variants	Can the model produce correct scenes for unseen topic × layout combinations?	>70% exact on dev/test splits	Curriculum needs more edit diversity — add topic swaps, density changes, theme variations
Family expansion	7–10 gold across 5+ families	Does adding new layout families break already-learned ones?	>70% exact overall, no per-family regression below 50%	Anchor replay too weak — increase replay ratio for stable families
Compositional generalization	10+ gold, compositional tokens	Can the model compose components it has seen in new combinations?	>60% exact on novel layout × component combinations	Token granularity too coarse — break monolithic tokens into compositional pieces
Open planning	20+ gold, underspecified prompts	Can the model choose layout family from an ambiguous request?	Reasonable family choice >80%, compiler renders successfully >90%	Model needs more prompt diversity and possibly larger capacity

Each phase should be a separate spec or run. Do not skip phases — a failure at phase 2 means phase 3 data will be wasted compute. The gold asset count is a guide, not a rule. What matters is that each phase answers its specific question before the next one starts.

How To Build Training Intuition Without Frontier Compute

The useful lesson from frontier work is not that every internal circuit is understood. The useful lesson is that model behavior can still be shaped in predictable directions through disciplined control of data, interfaces, budgets, and evaluation.

Layer	Question	What To Learn
Distribution	What experiences is the model compressing?	Data mix, replay pressure, repair rows, holdouts, contradiction checks
Interface	What problem is the model actually being asked to solve?	Prompt contract, DSL scope, structure/content split, canonical ordering
Budget	Did the model see enough clean signal to learn the task?	Effective epochs, packed token budgets, context usage, canary gates
Capacity	Is the model too small, or is the task still badly shaped?	Only scale after grammar, compiler, data, and probe paths are stable; large rung-to-rung swings on one small model usually mean recipe trouble before capacity trouble
Evaluation	Are the right things being measured?	Probe exactness, renderability, materialized exactness, per-family breakdowns, non-regression
System boundary	What belongs in the model versus the compiler or content system?	Keep deterministic rendering and data retrieval out of the model when possible

In other words, practical intuition comes from asking: what changed, why did behavior change, and which layer actually caused it? This is why the spec/run method matters. It turns training into a ledger of answers instead of a collection of lucky outcomes.

What Frontier Labs Usually Know

The claim that "no one knows how LLMs work" is too broad to be useful. A better version is this: full mechanistic theory is still incomplete, but empirical control is strong, and product/system control is often stronger still.

In practice, serious teams may not be able to explain every internal circuit, but they can still learn, with real discipline, that a particular data mixture, architecture, scale, objective, post-training recipe, and evaluation set tends to produce particular kinds of behavior. That is not total understanding, but it is real engineering knowledge.

Why Data Curation Still Matters

Scaling does not erase the training distribution. The model still compresses what it sees. Bad data creates bad priors, noisy mixtures create unstable behavior, narrow data creates narrow generalization, and well-shaped data creates cleaner abstractions.

This is why data curation remains central even without a complete theory of generalization. The practical loop is still: define the contract, shape the data to teach that contract, measure the right behavior, and repair the actual failure layer.

The Failure-To-Repair Loop

In practice, much of the progress comes from a simple discipline: observe the weakness, classify the weakness, teach that weakness directly, and rerun from the best relevant checkpoint when the spec has not changed.

This is not the same as "keep adding more data." Before adding rows, decide whether the failure is mainly in the data, DSL, compiler, tokenizer, token budget, decode hygiene, or capacity. Then fix the right layer.

If the model misses [/scene], add closure and continuation rows and tighten stop hygiene.
If the model confuses two families, add contrast rows and family-local anchors.
If the DSL is too verbose or ambiguous, start a new spec instead of piling on more data.
If the compiler cannot express the target asset, fix the compiler before retraining.
If a seeded checkpoint suddenly probes as all-zero before any new training, stop and verify seed/probe integrity before blaming the curriculum.
If "negative" cleanup rows are added under ordinary cross-entropy supervision, remember they are still positive targets for the model; treat contamination rows as hazardous, not automatically corrective.

A practical rule is: same spec -> usually continue or rerun inside the same family; new spec -> usually start a new line. This is how failures become supervision instead of wasted compute.

Run Failure Matrix

The most useful next step is not "train more." It is to name the failure class correctly. The table below is the default matrix for interpreting a run before deciding whether to repair the data, change the DSL, fix the compiler, or scale capacity.

Failure Class	What It Looks Like	What To Track	Likely Cause	Typical Fix
`scene_prefix_failure`	Missing `[scene]`, missing `[layout:...]`, duplicated top-level attrs	start-valid rate, missing-layout rate, duplicate-attr rate	weak canonical anchors, too much fragment training	add full-scene canonical rows and prefix-only repair rows
`scene_suffix_failure`	Missing `[/scene]`	close-tag miss rate	weak termination training, decode stop/budget issues	add close-tag rows, verify stop markers, verify decode budget
`block_nesting_failure`	wrong closing tag, invalid nesting, repeated block open	nesting error rate by block type	weak block grammar, fragment-heavy repair rows	add balanced block rows, transition rows, canonical block-order rows
`budget_truncation`	Output is a correct prefix but cut off	`truncated_at_budget` rate, prompt/output token counts	decode budget too small or context too small	raise decode budget first, then context only if needed
`special_token_leak`	`<\|bos\|>`, `<\|eos\|>`, or prompt tokens leaking inside scene output	special-token leak rate	tokenizer boundary contamination, bad row boundaries	strip/control special tokens and strengthen scene-only targets
`contamination_supervision`	cleanup or restart rows teach junk surfaces directly; outputs collapse to empty strings, stray single tokens, or contaminated prefixes after a repair push	empty-response rate, missing-scene-start rate, contamination-token frequency in training rows, before/after repair-row deltas	synthetic corruption was mixed into ordinary CE targets and learned as part of the distribution	remove contamination rows, protect full-scene replay anchors, and rerun a tiny canary before any broader repair pass
`layout_drift`	Wrong family or empty layout	per-layout confusion matrix	family overlap, weak family anchors	more direct family rows and layout-class repair rows
`theme_tone_drift`	wrong or duplicated theme/tone attrs	theme/tone confusion matrix, duplicate-attr rate	weak top-level canonicalization, over-repair	canonical scene-header rows and dedupe rules
`renderable_but_not_exact`	SVG compiles but scene DSL is off	exact vs renderable gap	semantic drift, ordering drift	targeted exactness repair rows
`exact_but_not_materialized`	scene matches but final SVG differs	materialized-exact gap	compiler or content-binding bug	fix renderer/probe path, not training
`family_imbalance`	one family learns, one collapses	per-family exact/renderable/materialized	data imbalance or family-specific grammar difficulty	family-specific anchors and family-weight tuning
`undertraining`	high loss, broad failure everywhere	loss curve, steps per epoch, token budget	too little budget for the grammar difficulty	raise epochs or total tokens
`over_repair_fragmentation`	valid local fragments but corrupted full scenes after a repair push	renderable drop after repair-row increase, local grammar error counts	too many fragment rows relative to clean full scenes	reduce fragment ratio and add more clean full-scene anchors
`compiler_parity_gap`	the model may be fine but the target family still looks weak	gold asset parity score	compiler not expressive enough	do a compiler pass before more training
`probe_accounting_bug`	obviously good outputs score wrong	mismatch between exact, renderable, and materialized evidence	reporting or probe bug	fix probe/report path first
`seed_probe_integrity_gap`	a copied "good" checkpoint probes as broken before any new training	seed-only probe result, artifact hash equality, tokenizer/template sidecar parity, live-vs-historic baseline repro	seed staging bug, probe-runner drift, or decode/runtime drift rather than new training damage	add a seed-only probe gate and repair staging/probe integrity before launching another rung
`capacity_misdiagnosis`	operators want a bigger model because a few repair runs failed, but the same architecture already moved from near-zero to strong probe scores on cleaner recipes	best-vs-worst rung spread on the same architecture, family stability under recipe changes, plateau across multiple clean canaries	recipe or evaluation instability is being mistaken for a hard parameter limit	stabilize the curriculum and probes first; scale only after several clean recipes plateau on the same failure class

Minimum Run Scoreboard

Every run should publish the same small scoreboard. This keeps comparisons honest and makes failure classes visible without reading raw prompt dumps first.

Metric	Why It Matters
`exact_rate`	scene contract fidelity
`renderable_rate`	structural validity
`materialized_exact_rate`	final compiler truth
`budget_truncation_rate`	separates learning failure from budget failure
`missing_scene_start_rate`	top-level grammar health
`missing_scene_end_rate`	termination health
`duplicate_attr_rate`	canonicalization health
`block_nesting_error_rate`	nested grammar health
`special_token_leak_rate`	tokenizer/output contamination
per-layout exact/renderable/materialized	family-specific diagnosis
train/dev/test split rates	overfit detection
`gold_asset_parity_score`	compiler readiness

A simple interpretation rule helps keep the read honest: low exact with high renderable usually means semantic or ordering drift; low renderable with low truncation usually means grammar corruption; high truncation is a budget problem; high exact with low materialized exact is a compiler or probe bug.

Repair Ladder Lessons

A recurring mistake in small-to-medium DSL runs is to treat every bad output as a direct template for the next repair row. That is too naive. The model does not understand that a row was meant as a warning; under ordinary supervised training it only sees another target distribution to imitate.

Do not treat synthetic corruption rows as automatically "negative data." Under cross-entropy, they are still positive supervision unless a different objective is used.
Protect full-scene replay anchors from the last good rung. Local continuation repairs should stay subordinate to full canonical scenes.
Separate three gates before training: probe-path integrity, seed-copy integrity, and then curriculum quality. A broken probe can look exactly like a broken rung.
If one small architecture already spans a wide quality range across runs, the next bottleneck is usually recipe shape, not model size.
Use tiny seeded canaries after a failure. If a 0.1x or 0.2x corrective run still destabilizes the same contract, pause rung-chasing and rethink the materializer itself.

In other words: do not use bigger models to paper over a broken supervision surface. Scale capacity when the contract is stable and the same clean failure persists across multiple disciplined canaries. Until then, the higher-signal move is to improve the curriculum, replay anchors, and measurement path.

What To Copy From Frontier Practice

Use proxy runs before expensive runs.
Design evals aggressively instead of trusting loss alone.
Run controlled ablations instead of relying on folklore.
Keep explicit failure taxonomies.
Control data mixtures carefully.
Separate base learning, post-training, and system scaffolding.
Do not expect one run to answer five different questions at once.

Ethical Scaling Path

Going from something simple to something bigger should not mean removing constraints faster than understanding improves. The safer path is to widen capability in stages while keeping the boundaries of responsibility clear.

Stage	Ethical Rule	Why It Matters
Bounded contract	Start with a narrow, measurable task.	Prevents vague demos from being mistaken for robust capability.
Honest boundaries	State clearly what the model does, what the compiler does, and what external data does.	Keeps the system understandable and avoids false claims about generality.
Bridge prompts	Move from explicit prompts to weaker prompts gradually.	Avoids turning one experiment into language learning, planning, retrieval, and rendering all at once.
Hard gates	Require non-regression, canaries, and probe honesty before widening scope.	Stops capability drift from being hidden behind bigger compute.
Human oversight	Keep humans in the loop for high-stakes domains and claims.	Capability should expand with accountability, not just with confidence.

Ethical scaling is not only about policy. It is also about research honesty: know what the model is really doing, know what the compiler is doing, know what the content system is doing, and report those boundaries clearly.

Pre-Training Checklist

Before launching any training run in a new spec, verify all of these gates. If any fails, fix it before spending compute.

Gate	Check	Tool
Compiler parity	Every gold scene.dsl + content.json compiles to valid SVG that matches the reference asset	Compiler round-trip test against gold pack
Token budget	Longest gold output fits within 80% of context window after tokenization	Gold-budget report tool or equivalent
Tokenizer coverage	All DSL tokens in the gold pack are in the tokenizer vocabulary — no `<unk>` fallbacks	Tokenize each gold scene and check for unknown tokens
Parity regimen	CK ↔ PyTorch forward/backward/optimizer parity passes at the target model size	`training_parity_regimen_latest.json`
Dataset QC	All training rows parse, no duplicates, holdout prompts are disjoint from train	Dataset QC step in training pipeline
Failure-frontier forecast	The curriculum blueprint names the predictable failure classes and shows which surfaces teach each one before the first run starts	`spec17_curriculum_blueprint.json` + `audit_curriculum_blueprint_v7.py`

What A Solved Structured-DSL Run Actually Proves

A solved spec[x] rung[y] does not mean the model learned open-ended infographic design. It means something narrower and more useful: the model can reliably map a bounded prompt contract to a compiler-facing scene DSL, and that DSL can then be combined with external content.json to produce exact SVG.

The model is responsible for the scene program: layout family, scene components, refs, and top-level style fields.
The compiler is responsible for deterministic layout and SVG assembly.
The external content payload is responsible for visible copy, data values, and topic-specific facts.

Keep those responsibilities separate. When adding new DSL families, do not blend asset identity into the output grammar unless the structure truly requires it. Case ids, topic-specific facts, and payload-specific copy should stay in prompt text, routing metadata, and external content.json wherever possible. The output DSL should stay as generic as the renderer contract allows.

This boundary matters. A strong spec[x] rung[y] result proves the training method can stabilize structured program generation. It does not yet prove arbitrary topic generalization, arbitrary tree depth, free-written infographic copy, or unconstrained scene planning.

Prompt-Surface And Decode-Boundary Discipline

Hidden eval should cover at least two different generalization surfaces:

Prompt-surface robustness: paraphrases, wording shifts, and bridge prompts that preserve the same underlying task.
Semantic breadth: new topics, new assets, or new content payloads that test whether the contract scales beyond the first few gold mappings.

Solving prompt paraphrases is not the same as solving new semantics. Treat those as different milestones and report them separately.

Also keep decode-boundary hygiene separate from model quality. In raw CLI inference, a scene DSL model may generate a correct [scene] ... [/scene] block and then continue into the next training-style prompt if no structural stop marker is supplied. That is a decode configuration issue, not automatically a training failure. Measure the first valid scene block, then fix stop hygiene at the inference boundary.

Next Bridge After A Solved Explicit-Layout Line

The next step after a solved explicit-layout line is not open-ended prose. It is a bounded intent bridge where the prompt still names the topic and goal, but stops prescribing the scene directly.

The short rule is:

keep content.json external
keep the compiler deterministic
let the model infer layout, theme, tone, density, and related scene-planning fields from topic + goal + audience
measure planning quality separately from syntax and renderability

Example intent-bridge prompt:

[task:svg] [topic:topic_id] [goal:goal_id] [audience:audience_id] [OUT]

This keeps the task bounded. The model is not being asked to learn open-domain language or free-written infographic copy. It is being asked to choose a good scene plan under weaker prompt control.

Current branch recommendation: spec17 established the bounded-intent bridge shape and spec18 tested a routing-first curriculum on the same frozen stack, but held-out exactness still stayed at zero. The next public recommendation is therefore spec19: a textbook-routing branch with named mixture buckets, denser minimal-pair routing coverage, and capacity held back as a separate fallback lever. See spec19-textbook-routing-mixture.html.

Progress Goal After The Last Good Baseline

The immediate goal after the last good baseline is not “teach the model English.” It is narrower and more useful: preserve compiler-backed precision while adding more SVG families and better within-family generalization.

Keep the renderer deterministic and visually strong.
Keep semantic payloads external until the family contract is stable.
Add new families one at a time so failure causes stay legible.
Judge progress by exact/materialized SVG quality and held-out family transfer, not by conversational fluency.

v7 Runbook

The concrete operator commands, parity gates, and train/infer workflow.

Open v7-runbook.html

Training Intuition

The deeper failure-analysis and checkpointing page with the Phase 1–7 diagnostic matrix for training dynamics (gradients, attention, weights).

Open training-intuition.html

Training Curriculum

The long-range CK-native learning ladder from v7 through later versions.

Open training-curriculum.html

Spec17 Curriculum

The bounded intent-bridge blueprint, failure-frontier map, and stage mix planned after the frozen spec16 winner.

Open spec17-curriculum-blueprint.html

Spec19 Routing Mixture

The current recommended next branch: named mixture buckets, textbook-style routing rows, and a clean capacity fallback if held-out exactness stays flat.

Open spec19-textbook-routing-mixture.html

Version History

The public roadmap showing how each version track feeds the next capability layer.

Open version-history.html

Spec / Run Discipline

The versioned internal note explaining why the project uses spec[x] and r[y], and how that method builds training intuition.

version/v7/reports/SPEC_RUN_DISCIPLINE_AND_TRAINING_INTUITION_2026-03-18.md