Training Method

Methodical DSL Training Methods

A spec is not just a dataset or a model run. It is a contract between an asset library, a scene DSL, and a compiler. The model should learn structure against a stable contract, not absorb random changes in vocabulary, content, and rendering all at once.

Asset library Placeholder content Scene DSL Compiler gate Tokenizer gate Probe report
Compute-bandwidth chasm infographic
Activation memory analysis infographic
Operator spectrum map
Core rule: the model always sees both structure and content placeholders, but an experiment should have only one primary teaching target. If the DSL, the compiler, the asset library, the tokenizer, the token budget, and the model size all move at once, the run becomes uninterpretable.
Page split: keep this page for the stable method only. Use spec-training-results.html for the generated rung history, best-run ladder, and recent regressions, and keep bulky raw artifacts under the run archive rather than turning this page into a log dump.
Local lesson from spec02 through spec16: broad spec-level progress has been real, but once a line already has a strong raw winner, post-hoc raw repair rungs have regressed more often than they helped. The default policy is now: predict likely failure frontiers before training, encode them as clean positive coverage in the initial curriculum, freeze strong raw winners, and route syntax-only residue to deterministic decode/repair instead of reopening the same brittle CE surface.
Post-gen1 method update: the first true broad-contract scene-DSL run fit the widened surface far more effectively than the older narrow spec/rung loop. That does not prove every scaling-law claim, but it does justify the current policy shift: broaden the clean compiler-backed corpus first, then harden the eval, then change model size only after the frozen broad contract has been tested under recombination pressure.

Why This Method Exists

This curriculum is not just local preference. It follows the same broad logic seen in scaling-law work, compute-optimal training research, and capability-predictability discussions: use smaller and cheaper experiments to validate the contract, the data, the tokenizer surface, and the evaluation loop before spending serious compute on larger runs.

Scaling Laws

OpenAI's scaling-laws work is the clearest public reference for why small and medium runs are useful first: they help forecast how loss changes with model size, data size, and compute before larger training runs.

Outcome: larger trends can be forecast from smaller controlled experiments.

Applied here: use cheap spec runs to validate DSL, tokenizer, and eval contracts before spending serious training compute.

OpenAI — Scaling Laws for Neural Language Models

Compute-Optimal Training

DeepMind's Chinchilla result is the classic argument against blindly scaling parameters without checking data and token budgets. It supports the habit of using many smaller runs to find the recipe before the expensive run.

Outcome: better results often come from the right model/data/token balance, not just a bigger model.

Applied here: compute packed token budgets and effective epochs per spec instead of copying old totals into new DSLs.

DeepMind — Training Compute-Optimal Large Language Models

Predictability vs Surprise

Anthropic's paper is a useful reminder that broad trends can be predictable even while specific capabilities and outputs remain surprising. That is why this project insists on probes, canaries, and run reports rather than loss alone.

Outcome: aggregate scaling behavior can be smooth while concrete behaviors still surprise operators.

Applied here: every run needs split-aware probes, failure galleries, and promotion rules instead of trusting train loss.

Anthropic — Predictability and Surprise in Large Generative Models

Large-Scale Recipe Practice

Meta's Llama 3 paper is a useful public example of a model family trained with scaling-law thinking across multiple sizes, rather than one single blind flagship run.

Outcome: a family of models is often used to refine the recipe, not just to ship different sizes.

Applied here: keep `spec` lines interpretable at small scale first, then widen to mixed training and larger CK-native runs later.

Meta — The Llama 3 Herd of Models

Project Curriculum

The CK-native version ladder explains how the current visual DSL work fits into the broader path from compiler-backed training to page DSLs, code/data/tool IR, and eventually mixed coding/scientific tasks.

Outcome: visual DSL work is treated as foundation-building, not the final destination.

Applied here: move from narrow-family SVG work toward page DSLs, code/data/tool IR, and broader mixed models.

Open training-curriculum.html

Failure Visibility

The local intuition page explains why this method emphasizes canaries, parity gates, checkpoints, and structured post-run reports instead of random trial-and-error.

Outcome: visibility turns failed runs into reusable knowledge instead of wasted compute.

Applied here: every spec should leave behind a fixed HTML report with hypothesis, deltas, failures, and a decision.

Open training-intuition.html

Three Separate Contracts

1. Asset/Data Contract

The asset library is the visual reference set and the context library is the semantic reference set.

  • Source from docs/site/assets/*.svg or later public asset libraries.
  • Strip literal text into placeholders first.
  • Track layout family, composition, theme, color system, shapes, connectors, charts, and text roles.
  • Do not let real copy hide layout mistakes.

2. DSL Contract

The DSL is the model boundary. It should describe composition and roles, not raw SVG bookkeeping.

  • Emit scene choices such as layout, theme, rail, density, gap, background, and components.
  • Use placeholders like heading_1, paragraph_1, badge_1.
  • Keep canonical token order and fixed arity.
  • Remove inferable fields from the model surface.

3. Compiler Contract

The compiler owns geometry, defs, gradients, markers, wrapping, and final render semantics.

  • Compile scene DSL into SVG with deterministic layout rules.
  • Guarantee round-trip reconstruction before any serious training run.
  • Keep rendering policy stable across runs unless the run is explicitly a compiler experiment.
  • Treat compiler regressions separately from model regressions.

These are real assets from the project library — the same asset families the training pipeline learns to plan and compose.

What We Are Actually Training

In this line, the model is not being trained to be a free-form SVG artist. It is being trained to choose structured visual intent against a stable compiler contract.

Composition

Composition means the overall information shape: compare two systems, show a pipeline, build a poster stack, or arrange dashboard cards.

Examples: poster_stack, comparison_span_chart, pipeline_lane.

Layout Family

Layout is the reusable geometry family inside the composition. It decides where headers, panels, tables, bars, and connectors belong.

Examples: one tall poster, two compare panels, three-column dashboard, staged pipeline lane.

Theme And Tone

Theme is the visual language. Tone is the accent family inside that language.

Examples: theme:infra_dark, theme:paper_editorial, tone:amber, tone:green.

Content Binding

The content is not the same thing as the scene structure. The model should choose where content goes; a separate payload should provide what the text and values actually are.

This is why later lines should move from literal prose inside scene tokens to keyed refs plus content.json.

Layer Example Who Owns It
Composition comparison_span_chart Model
Theme infra_dark Model
Tone amber Model
Content role @section_card.0.title Model chooses the slot, external content provides the value
Exact SVG path, gradient, shadow, marker, wrap <path ...>, <linearGradient ...> Compiler
Activation memory infographic — poster_stack layout family
poster_stack layout family — the model chooses composition, theme, tone, and content slots. The compiler handles gradients, geometry, text wrapping, and final SVG rendering.

Worked Example

A concrete keyed-scene example should look like this:

Request prompt

[task:svg]
[layout:poster_stack]
[topic:memory_reality]
[theme:infra_dark]
[tone:green]
[density:compact]
[OUT]

Scene DSL emitted by the model

[scene]
[canvas:tall]
[layout:poster_stack]
[theme:infra_dark]
[tone:green]
[frame:card]
[density:compact]
[inset:md]
[gap:sm]
[hero:center]
[columns:1]
[emphasis:top]
[rail:accent]
[background:rings]
[connector:line]
[topic:memory_reality]
[header_band:@header_band.0.kicker|@header_band.0.headline|@header_band.0.subtitle]
[section_card:@section_card.0.title|@section_card.0.value|@section_card.0.caption|variant=hero|accent=amber]
[compare_bar:@compare_bar.0.label|@compare_bar.0.value|@compare_bar.0.caption|accent=red]
[table_row:@table_row.0.column_1|@table_row.0.column_2|@table_row.0.column_3|state=highlight|accent=amber]
[/scene]

content.json bound by the compiler

{
  "header_band": [
    {
      "kicker": "First Principle",
      "headline": "LLM Memory Reality",
      "subtitle": "The math marketing will not show you"
    }
  ],
  "section_card": [
    {
      "title": "Memory Capacity",
      "value": "25x more memory capacity",
      "caption": "Capacity sets the real context boundary"
    }
  ],
  "compare_bar": [
    {
      "label": "GPU VRAM",
      "value": "80 GB",
      "caption": "single device"
    }
  ],
  "table_row": [
    {
      "column_1": "128K",
      "column_2": "3x GPUs",
      "column_3": "Fits"
    }
  ]
}

The training target here is the scene decision, not the literal prose. The compiler can now render the same scene with different content payloads without retraining the model.

Topology Refs vs Content Refs

Some richer layout families need more than content slots. Trees, maps, and graph-style layouts also need stable structural handles so the compiler knows what connects to what.

Ref Type Example Used For
Topology ref [node_id:start], [from_ref:start], [to_ref:l2] Identity, routing, graph layout, branch placement, segment attachment
Content ref [title_ref:nodes.start.title], [branch_label_ref:edges.start_l2] Visible text and numeric payload from content.json
[decision_node]
[node_id:start]
[title_ref:nodes.start.title]
[/decision_node]

[decision_edge]
[from_ref:start]
[to_ref:l2]
[branch_label_ref:edges.start_l2]
[/decision_edge]

This dual-reference pattern is not a bug. It is the right compiler boundary for tree, map, and topology layouts.

Why Explicit Tokens Instead Of Generic BPE Right Now?

Tokenizer architecture — hash vs trie lookup
Tokenizer architecture — the CK-Engine tokenizer uses explicit reserved tokens for the DSL surface. This makes every probe miss interpretable as a scene decision error, not subword spelling drift.

The short answer is: because this output language is small, formal, and brittle, and the current training line is optimizing for control and interpretability before token-efficiency.

Why not plain BPE first?

If the tokenizer is free to break tags arbitrarily, a tiny model must learn both the scene language and the spelling of the scene language at the same time.

What explicit reserved tokens buy

They make the contract visible. A probe miss can be read as the wrong scene decision instead of arbitrary subword drift across brackets, separators, and role markers.

Why this is not a universal rule

Frontier models absolutely use learned tokenizers like BPE or SentencePiece at base-model scale. The explicit token surface here is a local engineering choice for a narrow formal DSL.

What should happen later

Once the DSL is stable, revisit the tokenizer boundary as its own spec question: smaller reserved surface, learned merges, or mixed strategy. Do that after the contract is proven, not before.

A useful rule is:

Early line:
prefer explicit structural tokens so the contract is obvious.

Later line:
shrink or relax the reserved surface only after the model, compiler, and evals agree on the language.

One mistake to avoid is treating a whole component row as the permanent atomic token surface.

Shape What It Buys What It Risks
[compare_bar:@compare_bar.0.label|@compare_bar.0.value|@compare_bar.0.caption|accent=amber|note=@compare_bar.0.note] Short sequences and strong format control. Too brittle if kept forever. Similar component variants do not share enough structure, so the vocabulary becomes overly specific.
[compare_bar]
[label_ref:compare_bar.0.label]
[value_ref:compare_bar.0.value]
[caption_ref:compare_bar.0.caption]
[accent:amber]
[note_ref:compare_bar.0.note]
[/compare_bar]
More compositional reuse across fields and component variants. Longer sequences and a slightly harder grammar, but much better long-term generalization.

The first shape can be acceptable as a transition step. It should not be treated as the final production tokenizer boundary.

What went wrong in practice: one of the keyed-scene lines still inherited the old tokenizer builder, which collected every whitespace-delimited bracket chunk as a reserved control token. That turned rows like [compare_bar:@compare_bar.0.label|@compare_bar.0.value|@compare_bar.0.caption|accent=amber] into single atomic tokens again. This is not training-token packing in the data pipeline. It is tokenizer-boundary packing, and it makes the DSL brittle.
Failure Mode Why It Is Wrong Corrective Direction
Whole bracketed component rows become reserved tokens because the tokenizer harvests every whitespace-delimited [...] chunk. The model no longer learns a compositional scene language. It learns oversized one-off control ids. Keep only small structural tokens reserved and stop reserving payload-bearing ref tokens like @compare_bar.0.label.
Dense packing is confused with normal training-token packing. The packed-window step is not the issue here. The issue is the token boundary chosen before packing. Debug tokenizer boundaries first, then debug total-token budgets and effective epochs second.
Keyed refs still sit inside one monolithic component token. The content moved out of literal prose, but the structure is still too atomic. Move to block-style components such as [compare_bar], field-role tokens, and separate ref tokens or unreserved ref payloads.
Bad tokenizer boundary

[compare_bar:@compare_bar.0.label|@compare_bar.0.value|@compare_bar.0.caption|accent=amber]

Better boundary

[compare_bar]
[field:label]
[@compare_bar.0.label]
[field:value]
[@compare_bar.0.value]
[field:caption]
[@compare_bar.0.caption]
[accent:amber]
[/compare_bar]

The rule is simple: reserve structure, not payload. If a token contains too much scene-specific data or too many keyed refs, it is probably the wrong token boundary.

Spec Lifecycle

v6.6 evolution timeline — spec progression milestones
Evolution timeline — each spec passes through the same lifecycle stages below. The milestones above show how the v6.6 line matured through this exact process.
0

State the hypothesis

Write down exactly what the next spec is trying to prove: richer DSL, better compiler, stronger asset family, or narrower repair.

1

Snapshot the assets

Copy the chosen asset family into the cache-backed run workspace so the training contract is frozen and reproducible.

2

Replace literal text

Strip shipped copy and replace it with placeholders. Train structure first, then bind content from a separate library.

3

Extract vocabulary

Build the scene vocabulary from layouts, components, composition patterns, style families, and semantic text roles.

4

Prove the compiler

Compile the DSL with dummy text and verify that the output is acceptably close to the reference asset family before training.

5

Freeze a gold asset gate

Require a real gold pack, ideally 5 to 10 assets spanning comparison, poster, table, and pipeline families. Treat compiler fidelity as a gate, not a note.

6

Freeze the tokenizer surface

Tokenize the DSL vocabulary plus placeholder text roles. Do not let arbitrary literal text dominate the tokenizer early, and do not keep whole component rows as permanent atomic tokens once the structure-content split is proven.

7

Train with one primary axis

Change one main thing per run: DSL, compiler, data mix, tokenizer budget, or capacity. Hold the rest stable.

8

Write the decision

After every run, decide: promote, reject, repair, or branch. Every run ends with a next action, not just a score.

Operational Training Process

The practical mistake is to let a rung answer too many questions at once. The better process is narrower: freeze measurement, name one failure class, teach family structure, and let validation close the last deterministic syntax gaps.

Freeze measurement first

Before changing the curriculum, freeze the probe prompt lists and keep them balanced across cases, forms, and prompt surfaces.

Rule: if the probe contract changes, version it. Do not compare old and new runs as if they were the same measurement.

One rung, one question

Every rung should target one named failure class: missing terminal tail, wrong family choice, wrong style bundle, stop-boundary spill, or broken exactness.

Rule: if a rung changes grammar, probe shape, and repair mix together, it is not a clean experiment.

Train structure, not boilerplate

The model should learn the family program: layout, components, refs, style controls, and counts. Deterministic tails and canonical ordering should be pushed into the compiler or repair layer whenever possible.

Rule: do not waste multiple rungs trying to make SGD rediscover a fixed footer, terminal block, or closing sequence.

Add repair early

Validation and repair are part of the system, not an admission of failure. Use them as soon as the family structure is mostly right and only a small deterministic residual remains.

Rule: always report raw and repaired probe scores separately.

Process gate Required rule Why it exists
Probe freeze Keep balanced prompt lists on disk and version the probe contract when they change. Otherwise a probe bug or skewed selector can make a healthy line look broken.
Rung scope Declare one failure class target and one allowed intervention family per rung. Clean attribution matters more than squeezing one more point from a noisy run.
Repair budget Keep broad meta-repair rows capped. Default budget: 10-15% unless the rung is explicitly a syntax rung. Too much repair prose teaches the warning language itself instead of the underlying scene contract.
Compiler-first closure If a missing region is mechanically implied by form and counts, prefer validation or repair before adding another broad training rung. This keeps the model focused on family structure rather than deterministic boilerplate.
Raw vs repaired reporting Publish both raw probe metrics and repaired probe metrics. This separates training quality from system quality and prevents false conclusions.
Seed and probe integrity Fail preflight if seed staging, tokenizer sidecars, or probe-path hashes drift unexpectedly. Many apparent curriculum regressions are really measurement or staging bugs.

A useful default wording for the rung brief is: "Fix one named failure class while preserving the last good baseline on the frozen probe." That sentence is narrow enough to govern the materializer, the probe, and the post-run decision.

Operator rule from the live spec lines: once a raw rung becomes the champion on a frozen contract, freeze it. The next training change should default to decode-first work or a small pilot, not an unconstrained full rung. A full rung should be blocked until a pilot proves the target family improves without hidden-holdout or family regression.

Freeze the raw winner

Keep the current raw champion as the comparison anchor. New runs compete against that run, not against train loss or operator intuition.

Rule: promotion gates should reference one frozen probe report and one frozen best rung.

Pilot before full rung

If a repair idea touches shared-family behavior, run a small pilot first. Scale the token budget down and treat the result as a gate, not an automatic successor rung.

Rule: a full rung stays blocked until the pilot improves the target family with no family or hidden-split regression.

Decode first after strong raw runs

Once the raw line is mostly correct, push stop-boundary, prompt-spill, and deterministic cleanup into decode, validation, and repair before spending another broad training cycle.

Rule: if the miss set is mostly repairable, train less and validate more.

Ban literal prompt junk repair rows

Do not teach the model the exact wrapper junk you want it to avoid. That turns the warning language itself into the target distribution.

Rule: use control-agnostic clean-stop prompts rather than rows that spell out [OUT], duplicated prompt blocks, or schema-noise tokens.

Context Budget Is A Gate

Do not guess the next context length. Measure the gold scenes with the current tokenizer family before training.

Rule Interpretation
p95(prompt + output) < 400 ctx=512 is probably still enough.
400-600 Move to ctx=768 only if the gold pack actually lives here.
> 768 Use ctx=1024 only when the gold scenes really need it.

The important lesson from the keyed-scene line was that a layout can fail because of decode or context budget even when the model already learned the structure correctly. Budget diagnostics should therefore be part of the probe contract, not an afterthought.

How To Build Frontier-Style Intuition

No serious engineer would laugh at the core idea here. The separation of structure, content, compiler, and eval is exactly the kind of thinking that makes systems reliable. What strong teams would challenge is not the direction, but the discipline of the method.

Good instinct

Separating asset library, scene DSL, content, compiler, and probe contracts is a strong instinct. It reduces ambiguity and makes failures legible.

What frontier teams would push harder on

Cleaner ablations, stricter gates, more automatic validation, less manual drift between specs, and fewer simultaneous changes per run.

What to copy from them

Treat every run like an experiment with a written hypothesis, a fixed baseline, a canary, a non-regression gate, and a clear promotion or rejection rule.

What not to imitate blindly

Do not jump to giant-scale training intuition too early. On narrow compiler-backed tasks, clarity of interface and eval quality matter more than trying to mimic a frontier pretraining stack.

If a frontier engineer reviewed this Likely reaction What to improve
Separating DSL and content Good systems instinct Push it fully through the dataset, compiler, and probe path
Compiler-first validation Correct Keep the gold asset round-trip gate strict
Custom explicit tokens Reasonable for a narrow formal language Revisit only after the DSL stabilizes
Many specs in sequence Fine if each run is interpretable Make the hypothesis and held-constant set even more explicit
Weak or drifting evals Unacceptable Keep the report contract and canary gate mandatory

One Primary Axis Per Run

The model can only teach intuition if each run is interpretable. Every run should declare one primary axis and list everything held constant.

Primary Axis What Changes What Stays Fixed Question Answered
DSL run Scene grammar, token order, component vocabulary, canonicalization rules Asset set, compiler, tokenizer family, model size, training budgets Did the representation become easier and cleaner to learn?
Compiler run Layout engine, gradients, wrappers, markers, defs, text wrapping, style packs DSL, prompts, tokenizer, model config Did the rendering surface become richer without widening the model surface?
Data run Asset coverage, placeholder catalog, contrast pairs, holdout balance, context library DSL, compiler, model size Is the model underfit on coverage, or was the contract itself weak?
Budget run Packed token budgets, effective epochs, stage mix DSL, compiler, dataset rows, model config Was failure caused by under-training or over-training rather than representation?
Capacity run Layers, embed dim, hidden dim, context, optimizer scaling DSL, compiler, dataset mix, evaluation contract Has the representation stabilized enough that more capacity is the next lever?
Repair run Narrow contrast slices for specific misses Everything else, especially the winning baseline Can a specific failure cluster be closed without global regression?

A run is only methodical when it can be summarized as one sentence: “This run changed X, held Y constant, and answered Z.”

Restart-Safe Agent Handoff

Yes, this page is the right umbrella documentation to keep around for restarts. It now has a companion handoff file so a future agent does not need to reconstruct the current plan from scattered reports or stale run logs.

Why this helps

It preserves the method baseline after a reboot: keep the last good spec[x] rung[y] as the training-method champion, move to a compiler-first successor DSL, and add one new family at a time.

What to copy after restart

Use the markdown handoff file below as the first message to the next agent. It points directly at the live contract and the autopilot policy.

What it prevents

It prevents blind resumption of stale runs, tokenizer guesswork, and launching training before the compiler and tokenizer contract are ready.

Reference Purpose
docs/site/_pages/spec-training-method.html Human-readable umbrella method page.
docs/site/_pages/agent-handoff-template.md Copy-paste restart prompt for future agents.
version/v7/reports/SPEC[X]_EXECUTION_CONTRACT_YYYY-MM-DD.md Example shape of the current internal execution contract.
version/v7/reports/spec_family_autopilot_policy.json Machine-readable rule for when autonomy is allowed and when it must stop.
Suggested restart prompt

Read these first and treat them as the live source of truth:

1. docs/site/_pages/spec-training-method.html
2. docs/site/_pages/agent-handoff-template.md
3. version/v7/reports/SPEC[X]_EXECUTION_CONTRACT_YYYY-MM-DD.md
4. version/v7/reports/spec_family_autopilot_policy.json

Current intent:
- keep the last good spec[x] rung[y] as the training-method baseline, not the tokenizer ceiling
- build the successor DSL/compiler/tokenizer path explicitly
- keep payload/content external to the model DSL
- add one capability at a time
- make the next family[z] the active family-construction line
- do not launch training until the compiler, tokenizer corpus, launcher, and rung policy checklist are complete

Then continue from the current execution contract checklist and update the repo artifacts before starting any background training or autopilot.

Recommended Asset-to-DSL Workflow for SVG

IR pipeline flow — GGUF to C runtime
Full pipeline flow — from input artifacts through IR stages to compiled output. The compiler validation phase in the workflow below ensures the DSL round-trips through this pipeline deterministically.

Asset intake

Track every reference SVG in the public asset set and copy it into the run-local cache workspace.

  • Keep original assets immutable.
  • Record family, source file, and intended holdout split.
  • Use run-local copies for extraction and experiments.

Dummy-text normalization

Replace all literal text with placeholders before vocabulary design.

  • heading_1, heading_2, paragraph_1, callout_1
  • Keep text roles, not the original prose.
  • Make layout errors visible by removing content noise.

Vocabulary extraction

Infer the reusable scene vocabulary from the normalized assets.

  • layout family
  • component families such as panels, charts, connectors, legend, band, poster stack
  • theme, tone, spacing, background motif, frame, emphasis, hero alignment

Compiler validation

Compile the new DSL back into placeholder SVG and compare it against the normalized asset.

  • Require deterministic output.
  • Require acceptable visual round-trip on the gold asset subset.
  • Reject the spec if the compiler cannot express the asset family cleanly.

Training corpus

Train on the DSL and placeholder roles, not on raw shipped copy.

  • Tokenize scene vocabulary plus placeholder text roles.
  • Add content/context libraries later as a separate layer.
  • Keep placeholder slot resolution external and deterministic when possible.

Probe discipline

Score both DSL exactness and compiled render exactness.

  • A wrong DSL with right render means hygiene is bad but the contract may still be close.
  • A right DSL with wrong render means the compiler is the problem.
  • Do not read train loss as the final answer.

HTML Report Contract After Every Spec or Run

Every run should leave behind a readable HTML report that answers the same questions in the same order. The point is not decoration. The point is to make the next decision obvious.

Report Section Question Answered Recommended Content
Run card What is this run? Spec name, run id, date, model config, dataset version, primary axis changed, baseline compared against.
Hypothesis Why was the run executed? One paragraph stating what changed, why it should help, and what would count as success or rejection.
Held constant What was intentionally not changed? DSL/compiler/data/tokenizer/capacity matrix with one primary axis highlighted.
Data + tokenizer What did the model actually see? Row counts, packed token counts, effective epochs, tokenizer size, reserved tokens, placeholder coverage.
Compiler gate Could the compiler express the target family? Round-trip gallery on gold assets, render diffs, unsupported primitives, fallback paths.
Probe summary Did the run work? Exact, renderable, materialized exact, split metrics, per-layout metrics, delta vs baseline.
Product scorecard Did it get closer to usable output? Content-binding success, gold-asset parity status, family non-regression, and whether the output is visibly in-family.
Failure gallery Where did it fail? Show the remaining misses with prompt, expected DSL, actual DSL, content JSON or refs, compiled SVG, and a short diagnosis.
Lessons What was learned? Two or three concrete takeaways about representation, compiler, data, or budgeting.
Decision What happens next? Promote, reject, repair, or branch. Include the next run shape and the intended axis change.

Beautiful means legible

A good report should show the metric headline, the delta from baseline, and the actual failure cards above the fold.

Use color sparingly: green for proven gain, amber for ambiguity, red for rejection, blue for structural notes.

Show the evidence

Every claim should anchor to an artifact: probe JSON, tested prompts report, compiler validation gallery, dataset profile, tokenizer stats.

End with a decision

The report is incomplete if it only says what happened. It must say what the next run should do and what should stay frozen.

Run summary template

Spec:
Run:
Primary axis changed:
Held constant:
Baseline:

Hypothesis:

Data delta:
DSL delta:
Compiler delta:
Tokenizer delta:
Budget delta:
Capacity delta:

Preflight result:
Compiler round-trip result:
Probe result:
Failure clusters:

Decision:
Next run:

What The Experiments Proved

These conclusions did not come from theory alone. They came from repeated runs that failed, partially worked, regressed, recovered, and exposed the actual pressure points in the system. Across the full run history, the stable wins came from better contracts and better full curricula. Narrow raw repair churn became less reliable once a line was already mostly learned.

1. Evaluation bugs can masquerade as model collapse

We learned that decode/stop-marker mistakes and bad probe contracts can make a healthy model look broken. Evaluation infrastructure is part of the system, not an afterthought.

2. Freeze the winning baseline

Once a run becomes strong on the current contract, freeze it. Do not erase the baseline with broad speculative follow-ups.

3. Front-load predictable failures

Family drift, sibling-form confusion, style attractors, topology/count defaults, and stop leakage are usually visible in the contract before the first run. Teach those boundaries as clean positive coverage from day one instead of relying on post-hoc warning-language rows.

4. Use decode/repair before raw churn

When the remaining miss set is mostly syntax hygiene or mechanically repairable near-misses, deterministic decode/repair is the default path. Reopen raw training only for a true new branch such as a redesigned curriculum or a capacity test.

5. Separate asset, DSL, compiler, and content work

Runs became interpretable only after we started isolating which layer changed: richer scene vocabulary, richer compiler, broader asset family, or better structure-content separation.

6. Token granularity is its own spec axis

Whole-component tokens were useful as training wheels, but they should not remain the long-term boundary. Once keyed structure works, shrinking token granularity deserves a dedicated run.

How We Reached The Current Design

Quantization formats — structured technical reference
table_matrix layout family — a structured byte-level reference. The experiments below showed that this level of visual detail requires a compiler, not a model that emits raw SVG.
Observed pattern What it taught us Design conclusion
Raw or flat structural targets could render, but were brittle and hard to steer. The model was spending too much capacity on low-level surface decisions. Move upward into a scene DSL.
A richer compiler improved visual output without requiring the model to emit raw gradients, paths, or markers. Beauty and fidelity can often be improved at the compiler layer first. Keep low-level SVG machinery compiler-owned.
Some runs got stronger exact-match by baking visible text directly into component tokens. That can help short-term contract accuracy, but it mixes structure and content in the wrong place. Separate scene structure from content payloads.
Narrow fixes sometimes improved the target slice and damaged solved slices elsewhere. Repair runs need strong anchor replay and non-regression checks by family, not just aggregate score. Gate every repair against the frozen baseline.
Loss often looked healthy even when contract behavior regressed. Train loss is not the product metric. Use probe, canary, compiler round-trip, and failure galleries as the real decision surface.

How To Decide What Is Next

Diagnostic Matrix

After a run finishes, use the probe report, loss curve, and per-layout breakdown to decide what to do next. Do not guess — match the symptom pattern to the action.

Symptom Likely Cause Action Example
Low exact across all layouts Undertrained (too few epochs or too little curriculum pressure) Raise midtrain epochs from 1 to 2–3. Keep everything else frozen. Example: 2/35 exact after 1 midtrain epoch, with all 5 layouts weak
Low exact on some layouts, others strong Curriculum imbalance — weak families got less edit pressure Add targeted negative/repair rows for the weak families. Anchor the strong families with replay. Example: one over-weighted slice caused cascading failure across the weak families
High exact match, but rendered_svg_ok: null Probe/compiler plumbing bug — model output is right but the render path fails silently Fix the probe pipeline first. Do not retrain until the eval is trustworthy. Example: exact=true but rendered_svg_ok=null because the accounting path was wrong
Loss drops below 0.1 but exact match stays low The model memorized training distribution but not the DSL contract Check tokenizer coverage, holdout prompt diversity, and whether edits cover all layout × topic combinations. Example: loss converged but capability stayed near 0% because the representation was wrong
Loss stuck above 0.5 after full midtrain Capacity wall, LR too low, or data packing issue Check grad norms for saturation. Try LR sweep. Verify token packing fill rate (>0.85). Example: 81% loss reduction but final loss stayed at 0.905, signaling a capacity or token-budget wall
Good exact on train prompts, bad on dev/test Overfitting to train distribution Widen holdout prompt coverage. Add more topic × layout combinations. Reduce epoch count if loss is very low. Example: 100% train, 91.7% dev, 83.3% test, showing a visible but still manageable overfit gradient
Good exact on one spec, regression on next DSL contract changed silently, or tokenizer/compiler mismatch Diff the DSL grammar, tokenizer vocab, and compiler output between specs. Use canary probes from the previous best. Example: new layout families were introduced and legacy families still passed, but the tokenizer/compiler contract drifted
100% syntax valid, 0% semantic match Model learned token order but not composition choices Richer curriculum with edit pairs (wrong→right), not just direct generation examples. Example: 100% syntactically valid DSL, 0% correct scene composition
Stage labels wrong in loss curve Telemetry bookkeeping bug in pipeline — stage transitions not recorded Fix the pipeline stage labeling. Do not trust source_stage field until verified. Use step counts to infer boundaries. Example: all 665 steps were labeled pretrain even though midtrain steps were present
training_plan.json says "active" after run completed Pipeline did not finalize stage status Fix the plan update logic. Probe report should not depend on plan status for results. Example: training_plan.json still reported midtrain after the run finished
One layout perfect until truncated — missing closing tags Output exceeds context window — model learned the content but runs out of tokens Measure output token count vs context length. Either compress the DSL (remove inferable fields) or raise context. Do not add more training data — the model already knows it. Example: one family matched 2208/2606 characters perfectly, then truncated at the same point on every failure
Gold mapping token budget far exceeds context window DSL is too verbose — carrying fields the compiler could infer Run a DSL compression pass before defining the tokenizer. Target: output tokens < 80% of context length. Measure with a gold-budget report tool. Example gold output: 2793 tokens at ctx = 512, or 5.5x over budget

Critical — fix before retraining Warning — likely cause of weak results Info — plumbing or telemetry issue

Reading the Probe Report

The probe report is the real decision surface, not the loss curve. A healthy loss curve with bad probe results means the representation or curriculum is wrong. A noisy loss curve with good probe results means the model learned despite training-signal noise.

Metric What It Tells You Healthy Range Red Flag
exact_match Model output matches expected DSL token-for-token >70% overall, >50% per layout <20% after full curriculum = representation or data problem
renderable Output parses and compiles to valid SVG 100% (compiler contract must not break) <90% = compiler or tokenizer bug, not model bug
rendered_svg_ok Compiled SVG matches expected visual Should track exact_match closely null = probe pipeline broken, fix before interpreting other metrics
Per-layout breakdown Which families the model learned vs. missed All families above 50% One family at 0% with others at 80%+ = curriculum imbalance
Train vs dev vs test split Generalization gradient Train≥dev≥test, gap <20pp Train 100%, test 0% = severe overfitting

Reading the Loss Curve

Pattern What It Means Action
Smooth descent, final <0.05 Model is memorizing the training set well Check probe to confirm generalization, not just memorization
Descent plateaus at 0.1–0.3 Model hit a capacity or data ceiling If probe is also stuck, try more epochs, richer edits, or LR restart
Sudden spike at stage boundary New data distribution — expected at pretrain→midtrain transition Not a bug. Watch if it recovers within 50–100 steps
CK loss ≠ PT loss Numerical parity broken Stop training. Fix kernel parity before continuing. This should never happen.
Grad norms spike or collapse Training instability — exploding or vanishing gradients Check max_grad_norm clipping. If persistent, reduce LR or check data packing for degenerate sequences.
Loss oscillates without converging LR too high, or data has conflicting signals (e.g., same prompt → different targets) Verify dataset uniqueness. Try lower LR. Check for duplicate or contradictory training rows.

Decision Actions

Use deterministic decode/repair first when the contract is almost closed

If the current representation is strong and the remaining misses are syntax-only or mechanically repairable, keep the raw winner frozen and improve the decode/repair layer first. Do not default to another raw repair rung on the same brittle surface.

Use a compiler step when visuals are still weak

If the generated infographic is structurally correct but still visually simple, improve the compiler and asset vocabulary before spending more model compute.

Use a tokenizer step when outputs are too brittle

If the model only behaves well when whole component rows are reserved atomically, run a token-granularity spec next instead of adding more layouts or model size.

Use a DSL step when the representation is still too literal or too loose

If the model surface still carries inferable fields, arbitrary prose, or non-canonical structure, fix the DSL before scaling the model.

Use a structure-content split when generalization is the problem

If the scene language is still carrying visible prose or one-off values, branch to keyed structure plus separate content.json. The model should emit refs like @section_card.0.title, not literal asset prose inside the component token.

Use clean contrast coverage when planning is still wrong

If a failure is truly still in training, add compiler-valid contrast sets and boundary cases, not warning-language rows about [OUT], wrappers, or singleton tags. The curriculum should teach the right choice, not narrate what not to do.

Promotion Rules

Spec Versus Run

The project uses spec[x] and r[y] for a reason. A spec is a new training contract. A run revision is a controlled iteration inside the same contract.

Level Meaning What May Change What Should Stay Fixed
spec[x] A new experiment question and a new learning boundary DSL, prompt surface, compiler ownership, token granularity, output family, structure/content split The project goal and the evaluation discipline
r[y] A revision inside one spec Repair rows, replay ratios, token budgets, epochs, balance, decode hygiene, capacity after the contract is stable The DSL contract, the probe target, and the main question

Start a new spec when the current ceiling is real and measured. Stay inside the same spec when the failure is narrow and the representation still matches the intended product boundary.

Working Definitions

Term Meaning Here Example
Trainable contract The exact input/output behavior the model is being taught, with a form that can be evaluated reliably. prompt -> scene.dsl with exact-match and render checks.
Representation The model-facing format for the task: the grammar, fields, token shapes, and structure it must predict. Flat SVG atoms versus a scene DSL with layout, theme, and component blocks.
Data The actual examples used to teach the contract, including prompts, targets, repairs, anchors, and holdouts. Direct gold rows, topic-swap rows, close-tag continuation rows, or negative contrast rows.
Compiler The deterministic layer that turns structured outputs into the final product artifact. scene.dsl + content.json -> SVG.
Eval The measurement layer that decides whether the run actually improved the desired behavior. Exact match, renderability, materialized exactness, and per-family breakdowns.
Canary A tiny, cheap, high-signal test slice run before a full job to catch obvious format or compiler failures. 12 prompt cases that must parse and render before the real training run starts.
Eval gate A pass/fail condition that must hold before a run is accepted or promoted. 100% renderability on the canary or no regression on solved families.
Ablation A controlled experiment where one factor is changed while the rest stays fixed, so the effect can be interpreted. Keep the DSL fixed and change only midtrain epochs from 1 to 3.
Data mixture control Deliberately choosing how much of each example type the model sees. 40% anchor replay, 40% direct rows, 20% repair rows.
Failure taxonomy A named breakdown of failure classes so the next action is chosen from evidence instead of guesswork. Grammar failure versus compiler failure versus capacity failure.
Repair run A narrow run that targets a known failure slice without redefining the overall contract. Add transition rows to fix one broken family in the current baseline line.
Replay / anchor rows Stable rows from already-solved behavior that are kept in the curriculum to prevent regressions. Keep strong decision_tree rows present while repairing table_matrix.
Prompt surface The information exposed on the input side of the task. Explicit [layout:...] prompts versus intent prompts with only topic + goal + audience.
System boundary The line between what the model must learn and what deterministic systems should own. The model chooses scene structure; the compiler owns exact geometry and gradients.
Canonicalization Forcing one stable legal form for equivalent outputs so exact-match metrics mean something. One legal field order for scene attributes, not many interchangeable spellings.

Use a new spec when the question changed

Examples: move from flat atoms to scene DSL, move from explicit layout prompts to intent prompts, or split visible content out into content.json.

Use a new run when the contract is right but weak

Examples: add closure repairs, strengthen replay, rebalance families, or raise effective epochs for the same grammar.

Reject blurry runs

If a run changes grammar, curriculum, and capacity at once, it may still improve a metric, but it will not teach much. That is wasted research signal.

Asset Scaling Strategy

Do not try to train on a large asset library immediately. Prove each capability level first, then widen. The pattern is: memorize → generalize → expand.

Phase Gold Assets Goal Pass Criteria What Failure Means
Memorization 3 hand-mapped Can the model reproduce 3 exact gold scenes from their prompts? 100% exact on train, compiler round-trips all 3 DSL, tokenizer, or context window is wrong — fix before adding more data
Held-out generalization 3 gold + 3 synthetic variants Can the model produce correct scenes for unseen topic × layout combinations? >70% exact on dev/test splits Curriculum needs more edit diversity — add topic swaps, density changes, theme variations
Family expansion 7–10 gold across 5+ families Does adding new layout families break already-learned ones? >70% exact overall, no per-family regression below 50% Anchor replay too weak — increase replay ratio for stable families
Compositional generalization 10+ gold, compositional tokens Can the model compose components it has seen in new combinations? >60% exact on novel layout × component combinations Token granularity too coarse — break monolithic tokens into compositional pieces
Open planning 20+ gold, underspecified prompts Can the model choose layout family from an ambiguous request? Reasonable family choice >80%, compiler renders successfully >90% Model needs more prompt diversity and possibly larger capacity

Each phase should be a separate spec or run. Do not skip phases — a failure at phase 2 means phase 3 data will be wasted compute. The gold asset count is a guide, not a rule. What matters is that each phase answers its specific question before the next one starts.

How To Build Training Intuition Without Frontier Compute

The useful lesson from frontier work is not that every internal circuit is understood. The useful lesson is that model behavior can still be shaped in predictable directions through disciplined control of data, interfaces, budgets, and evaluation.

Layer Question What To Learn
Distribution What experiences is the model compressing? Data mix, replay pressure, repair rows, holdouts, contradiction checks
Interface What problem is the model actually being asked to solve? Prompt contract, DSL scope, structure/content split, canonical ordering
Budget Did the model see enough clean signal to learn the task? Effective epochs, packed token budgets, context usage, canary gates
Capacity Is the model too small, or is the task still badly shaped? Only scale after grammar, compiler, data, and probe paths are stable; large rung-to-rung swings on one small model usually mean recipe trouble before capacity trouble
Evaluation Are the right things being measured? Probe exactness, renderability, materialized exactness, per-family breakdowns, non-regression
System boundary What belongs in the model versus the compiler or content system? Keep deterministic rendering and data retrieval out of the model when possible

In other words, practical intuition comes from asking: what changed, why did behavior change, and which layer actually caused it? This is why the spec/run method matters. It turns training into a ledger of answers instead of a collection of lucky outcomes.

What Frontier Labs Usually Know

The claim that "no one knows how LLMs work" is too broad to be useful. A better version is this: full mechanistic theory is still incomplete, but empirical control is strong, and product/system control is often stronger still.

In practice, serious teams may not be able to explain every internal circuit, but they can still learn, with real discipline, that a particular data mixture, architecture, scale, objective, post-training recipe, and evaluation set tends to produce particular kinds of behavior. That is not total understanding, but it is real engineering knowledge.

Why Data Curation Still Matters

Scaling does not erase the training distribution. The model still compresses what it sees. Bad data creates bad priors, noisy mixtures create unstable behavior, narrow data creates narrow generalization, and well-shaped data creates cleaner abstractions.

This is why data curation remains central even without a complete theory of generalization. The practical loop is still: define the contract, shape the data to teach that contract, measure the right behavior, and repair the actual failure layer.

The Failure-To-Repair Loop

In practice, much of the progress comes from a simple discipline: observe the weakness, classify the weakness, teach that weakness directly, and rerun from the best relevant checkpoint when the spec has not changed.

This is not the same as "keep adding more data." Before adding rows, decide whether the failure is mainly in the data, DSL, compiler, tokenizer, token budget, decode hygiene, or capacity. Then fix the right layer.

A practical rule is: same spec -> usually continue or rerun inside the same family; new spec -> usually start a new line. This is how failures become supervision instead of wasted compute.

Run Failure Matrix

The most useful next step is not "train more." It is to name the failure class correctly. The table below is the default matrix for interpreting a run before deciding whether to repair the data, change the DSL, fix the compiler, or scale capacity.

Failure Class What It Looks Like What To Track Likely Cause Typical Fix
scene_prefix_failure Missing [scene], missing [layout:...], duplicated top-level attrs start-valid rate, missing-layout rate, duplicate-attr rate weak canonical anchors, too much fragment training add full-scene canonical rows and prefix-only repair rows
scene_suffix_failure Missing [/scene] close-tag miss rate weak termination training, decode stop/budget issues add close-tag rows, verify stop markers, verify decode budget
block_nesting_failure wrong closing tag, invalid nesting, repeated block open nesting error rate by block type weak block grammar, fragment-heavy repair rows add balanced block rows, transition rows, canonical block-order rows
budget_truncation Output is a correct prefix but cut off truncated_at_budget rate, prompt/output token counts decode budget too small or context too small raise decode budget first, then context only if needed
special_token_leak <|bos|>, <|eos|>, or prompt tokens leaking inside scene output special-token leak rate tokenizer boundary contamination, bad row boundaries strip/control special tokens and strengthen scene-only targets
contamination_supervision cleanup or restart rows teach junk surfaces directly; outputs collapse to empty strings, stray single tokens, or contaminated prefixes after a repair push empty-response rate, missing-scene-start rate, contamination-token frequency in training rows, before/after repair-row deltas synthetic corruption was mixed into ordinary CE targets and learned as part of the distribution remove contamination rows, protect full-scene replay anchors, and rerun a tiny canary before any broader repair pass
layout_drift Wrong family or empty layout per-layout confusion matrix family overlap, weak family anchors more direct family rows and layout-class repair rows
theme_tone_drift wrong or duplicated theme/tone attrs theme/tone confusion matrix, duplicate-attr rate weak top-level canonicalization, over-repair canonical scene-header rows and dedupe rules
renderable_but_not_exact SVG compiles but scene DSL is off exact vs renderable gap semantic drift, ordering drift targeted exactness repair rows
exact_but_not_materialized scene matches but final SVG differs materialized-exact gap compiler or content-binding bug fix renderer/probe path, not training
family_imbalance one family learns, one collapses per-family exact/renderable/materialized data imbalance or family-specific grammar difficulty family-specific anchors and family-weight tuning
undertraining high loss, broad failure everywhere loss curve, steps per epoch, token budget too little budget for the grammar difficulty raise epochs or total tokens
over_repair_fragmentation valid local fragments but corrupted full scenes after a repair push renderable drop after repair-row increase, local grammar error counts too many fragment rows relative to clean full scenes reduce fragment ratio and add more clean full-scene anchors
compiler_parity_gap the model may be fine but the target family still looks weak gold asset parity score compiler not expressive enough do a compiler pass before more training
probe_accounting_bug obviously good outputs score wrong mismatch between exact, renderable, and materialized evidence reporting or probe bug fix probe/report path first
seed_probe_integrity_gap a copied "good" checkpoint probes as broken before any new training seed-only probe result, artifact hash equality, tokenizer/template sidecar parity, live-vs-historic baseline repro seed staging bug, probe-runner drift, or decode/runtime drift rather than new training damage add a seed-only probe gate and repair staging/probe integrity before launching another rung
capacity_misdiagnosis operators want a bigger model because a few repair runs failed, but the same architecture already moved from near-zero to strong probe scores on cleaner recipes best-vs-worst rung spread on the same architecture, family stability under recipe changes, plateau across multiple clean canaries recipe or evaluation instability is being mistaken for a hard parameter limit stabilize the curriculum and probes first; scale only after several clean recipes plateau on the same failure class

Minimum Run Scoreboard

Every run should publish the same small scoreboard. This keeps comparisons honest and makes failure classes visible without reading raw prompt dumps first.

Metric Why It Matters
exact_rate scene contract fidelity
renderable_rate structural validity
materialized_exact_rate final compiler truth
budget_truncation_rate separates learning failure from budget failure
missing_scene_start_rate top-level grammar health
missing_scene_end_rate termination health
duplicate_attr_rate canonicalization health
block_nesting_error_rate nested grammar health
special_token_leak_rate tokenizer/output contamination
per-layout exact/renderable/materialized family-specific diagnosis
train/dev/test split rates overfit detection
gold_asset_parity_score compiler readiness

A simple interpretation rule helps keep the read honest: low exact with high renderable usually means semantic or ordering drift; low renderable with low truncation usually means grammar corruption; high truncation is a budget problem; high exact with low materialized exact is a compiler or probe bug.

Repair Ladder Lessons

A recurring mistake in small-to-medium DSL runs is to treat every bad output as a direct template for the next repair row. That is too naive. The model does not understand that a row was meant as a warning; under ordinary supervised training it only sees another target distribution to imitate.

In other words: do not use bigger models to paper over a broken supervision surface. Scale capacity when the contract is stable and the same clean failure persists across multiple disciplined canaries. Until then, the higher-signal move is to improve the curriculum, replay anchors, and measurement path.

What To Copy From Frontier Practice

Ethical Scaling Path

Going from something simple to something bigger should not mean removing constraints faster than understanding improves. The safer path is to widen capability in stages while keeping the boundaries of responsibility clear.

Stage Ethical Rule Why It Matters
Bounded contract Start with a narrow, measurable task. Prevents vague demos from being mistaken for robust capability.
Honest boundaries State clearly what the model does, what the compiler does, and what external data does. Keeps the system understandable and avoids false claims about generality.
Bridge prompts Move from explicit prompts to weaker prompts gradually. Avoids turning one experiment into language learning, planning, retrieval, and rendering all at once.
Hard gates Require non-regression, canaries, and probe honesty before widening scope. Stops capability drift from being hidden behind bigger compute.
Human oversight Keep humans in the loop for high-stakes domains and claims. Capability should expand with accountability, not just with confidence.

Ethical scaling is not only about policy. It is also about research honesty: know what the model is really doing, know what the compiler is doing, know what the content system is doing, and report those boundaries clearly.

Pre-Training Checklist

Before launching any training run in a new spec, verify all of these gates. If any fails, fix it before spending compute.

Gate Check Tool
Compiler parity Every gold scene.dsl + content.json compiles to valid SVG that matches the reference asset Compiler round-trip test against gold pack
Token budget Longest gold output fits within 80% of context window after tokenization Gold-budget report tool or equivalent
Tokenizer coverage All DSL tokens in the gold pack are in the tokenizer vocabulary — no <unk> fallbacks Tokenize each gold scene and check for unknown tokens
Parity regimen CK ↔ PyTorch forward/backward/optimizer parity passes at the target model size training_parity_regimen_latest.json
Dataset QC All training rows parse, no duplicates, holdout prompts are disjoint from train Dataset QC step in training pipeline
Failure-frontier forecast The curriculum blueprint names the predictable failure classes and shows which surfaces teach each one before the first run starts spec17_curriculum_blueprint.json + audit_curriculum_blueprint_v7.py

What A Solved Structured-DSL Run Actually Proves

A solved spec[x] rung[y] does not mean the model learned open-ended infographic design. It means something narrower and more useful: the model can reliably map a bounded prompt contract to a compiler-facing scene DSL, and that DSL can then be combined with external content.json to produce exact SVG.

Keep those responsibilities separate. When adding new DSL families, do not blend asset identity into the output grammar unless the structure truly requires it. Case ids, topic-specific facts, and payload-specific copy should stay in prompt text, routing metadata, and external content.json wherever possible. The output DSL should stay as generic as the renderer contract allows.

This boundary matters. A strong spec[x] rung[y] result proves the training method can stabilize structured program generation. It does not yet prove arbitrary topic generalization, arbitrary tree depth, free-written infographic copy, or unconstrained scene planning.

Prompt-Surface And Decode-Boundary Discipline

Hidden eval should cover at least two different generalization surfaces:

Solving prompt paraphrases is not the same as solving new semantics. Treat those as different milestones and report them separately.

Also keep decode-boundary hygiene separate from model quality. In raw CLI inference, a scene DSL model may generate a correct [scene] ... [/scene] block and then continue into the next training-style prompt if no structural stop marker is supplied. That is a decode configuration issue, not automatically a training failure. Measure the first valid scene block, then fix stop hygiene at the inference boundary.

Next Bridge After A Solved Explicit-Layout Line

The next step after a solved explicit-layout line is not open-ended prose. It is a bounded intent bridge where the prompt still names the topic and goal, but stops prescribing the scene directly.

The short rule is:

Example intent-bridge prompt:

[task:svg] [topic:topic_id] [goal:goal_id] [audience:audience_id] [OUT]

This keeps the task bounded. The model is not being asked to learn open-domain language or free-written infographic copy. It is being asked to choose a good scene plan under weaker prompt control.

Current branch recommendation: spec17 established the bounded-intent bridge shape and spec18 tested a routing-first curriculum on the same frozen stack, but held-out exactness still stayed at zero. The next public recommendation is therefore spec19: a textbook-routing branch with named mixture buckets, denser minimal-pair routing coverage, and capacity held back as a separate fallback lever. See spec19-textbook-routing-mixture.html.

Progress Goal After The Last Good Baseline

The immediate goal after the last good baseline is not “teach the model English.” It is narrower and more useful: preserve compiler-backed precision while adding more SVG families and better within-family generalization.

Related Pages

v7 Runbook

The concrete operator commands, parity gates, and train/infer workflow.

Open v7-runbook.html

Training Intuition

The deeper failure-analysis and checkpointing page with the Phase 1–7 diagnostic matrix for training dynamics (gradients, attention, weights).

Open training-intuition.html

Training Curriculum

The long-range CK-native learning ladder from v7 through later versions.

Open training-curriculum.html

Spec17 Curriculum

The bounded intent-bridge blueprint, failure-frontier map, and stage mix planned after the frozen spec16 winner.

Open spec17-curriculum-blueprint.html

Spec19 Routing Mixture

The current recommended next branch: named mixture buckets, textbook-style routing rows, and a clean capacity fallback if held-out exactness stays flat.

Open spec19-textbook-routing-mixture.html

Version History

The public roadmap showing how each version track feeds the next capability layer.

Open version-history.html

Spec / Run Discipline

The versioned internal note explaining why the project uses spec[x] and r[y], and how that method builds training intuition.

version/v7/reports/SPEC_RUN_DISCIPLINE_AND_TRAINING_INTUITION_2026-03-18.md

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close