Spec19 Textbook Routing Mixture

spec19 is the recommended next branch after the bounded-intent experiments on top of the frozen spec16 scene-bundle winner. The point is not to keep churning near-identical routing rungs. The point is to change the data philosophy in a controlled way: explicit named mixture buckets, denser textbook-style routing coverage, cleaner minimal pairs, and capacity as a separate fallback lever instead of a hidden curriculum tweak.

Current branch decision: spec17 was a useful diagnostic bridge and spec18 proved the routing-first launcher path end to end, but spec18 r1 still ended with zero held-out exactness. That blocks spec18 r2 on the same recipe. The next public recommendation is a new branch: spec19.

Why Spec19 Exists

Spec17 Fixed The Question

spec17 established the real problem: bounded intent is a planning task, not just a syntax task. It also forced better audits and cleaner contrast coverage.

Spec18 Fixed The Launch Shape

spec18 kept the same compiler, renderer, tokenizer, and shared [bundle] contract while testing a routing-first curriculum. The run path was correct and reproducible.

Held-Out Routing Still Failed

spec18 r1 ended with overall exactness around 4.8%, but visible and hidden held-out exactness stayed at 0%. That means the remaining problem is not a missing guardrail. It is still a data and capacity question.

What To Copy From The Papers And The NVIDIA Dataset Lesson

Named Mixtures, Not One Blob

The useful lesson from NVIDIA's large public pretraining collections is organizational: split the corpus into named components with explicit provenance and mixture control. For CK-native training, that means versioned bucket ids, row counts, token counts, and stage weights.

phi-1: Increase Structural Density

Prefer clean, compiler-validated, textbook-style teaching rows over more noisy coverage. The next branch should spend more mass on direct routing supervision and minimal pairs than on broad scaffold language.

Grokking: Stop Same-Shape Churn

If loss moves and held-out exactness stays flat, that is a block on repeating the same training style. Another spec18 rung with the same recipe would likely add cost without adding clarity.

HumanEval / Codex: Measure Correctness

Renderable-but-wrong rows are semantic failures, not wins. Promotion should still depend on held-out exactness. Sidecar best-of-k analysis can help diagnose whether the model knows more than greedy decode shows.

CodeT5: Score Fields, Not Just Strings

Bundle tags behave like code identifiers with constrained semantics. That means evaluation should separate syntax from family, form, style, and topology errors instead of collapsing everything into one exact-match bucket.

AlphaCode: Generate Then Filter

The compiler is a structural oracle. Best-of-k decode plus compiler filtering is a valid sidecar metric and practical inference path when syntax is close but greedy decode is brittle.

Nemotron: Curate The Mixture Explicitly

The dataset lesson is operational, not mystical: keep named buckets, provenance, dedupe discipline, and measured mix ratios. The run artifact directory should record those manifests so the branch can be replayed or shipped.

Hard Boundary

Keep Fixed

  • Seed from frozen spec16 r9.
  • Freeze the tokenizer unless a tokenizer branch is the explicit axis.
  • Keep the same shared [bundle] ... [/bundle] output contract.
  • Keep the same deterministic renderer and compiler boundary.

Change One Main Thing

  • Change the dataset philosophy, not the whole stack at once.
  • Move to named mixture buckets.
  • Increase direct routebook density and minimal-pair coverage.
  • Separate any later capacity step into its own branch decision.

Do Not Regress Into

  • warning-language repair rows
  • raw SVG targets
  • prompt-contamination teaching
  • blurry runs that change curriculum and capacity together

Named Mixture Buckets

Bucket Purpose Typical contents
Anchor Replay Preserve closure and exact bundle syntax. Explicit bundle anchors, clean-stop anchors, permuted explicit anchors.
Routebook Direct Teach topic + goal + audience -> family + form. Short direct prompts with deterministic lexical templates and canonical targets.
Form Minimal Pairs Separate sibling forms inside one family. Near-neighbor prompts with one controlled form-changing difference.
Family Minimal Pairs Separate similar intents across different families. Cross-family contrasts with explicit distractor structure when needed.
Routebook Paraphrase Strengthen prompt-surface robustness after the direct route is learned. Reordered tags, plain-language paraphrases, and small lexical alternations.
Style/Topology Bridge Widen to style and count inference only after routing rows exist in mass. Bounded emphasis, constraint, and content_pack hints.
Repair Hygiene Keep narrow closure cleanup separate from semantic teaching. Stop-boundary and singleton cleanup only, no broad warning prose.
Holdout / Hidden Evaluation only. Visible routing holdouts, hidden paraphrase, hidden recombination.

Stage Mix

Stage A: Route Lock

Surface family Weight Reason
anchors 20% Keep the solved shared bundle contract alive.
routebook_direct 25% Make direct routing the center of mass.
form_minimal_pairs 20% Force sibling-form discrimination.
family_minimal_pairs 15% Force cross-family discrimination.
routebook_paraphrase 10% Improve robustness without widening semantics too early.
style_topology_bridge + hygiene 10% Keep widening pressure present, but subordinate to routing.

Stage B: Controlled Widening

Surface family Weight Reason
anchors 20% Preserve non-regression on the solved contract.
direct + minimal-pair routing 45% Routing stays dominant even while the target space widens.
routebook_paraphrase 10% Keep surface robustness under the same semantic contract.
style_topology_bridge 20% Add style and topology only after stronger routing coverage exists.
recombination + hygiene 5% Probe whether routing survives slightly broader combinations.

Recommended Run Sequence

Spec19 R1 Canary

Keep the current architecture first. The question is whether the routebook mixture changes held-out exactness without introducing new stack risk.

  • same frozen tokenizer
  • same spec16 r9 seed
  • same shared bundle contract
  • full-pass canary only

Gate For Promotion

Require nonzero exactness on both visible held-out splits and at least one hidden split. Renderability alone is not enough.

If It Is Still Zero

Treat that as a clean signal that the current model size may be the limit. Run a separate capacity canary on the same spec19 dataset recipe instead of silently changing data and model size together.

How To Read The Spec19 Rungs

By 2026-04-02, spec19 is no longer a reset line. It is a learnable rung line. The main question is no longer “does this branch work at all?” but “which curriculum regions are still under-covered?”

Rung What changed What it taught
r2 Expanded corpus, stronger compiler smoke gating, longer real training budget. First real bounded-intent success. Held-out exactness crossed above zero on every split, proving the branch is learnable.
r3b Coherent replay union of prior train corpora with clean eval collision filtering. Replay-heavy cumulative data improved aggregate exactness, but visible test regressed. That showed replay helps, but the coverage was still unbalanced.
r3c r3b replay base plus broader neighbor coverage around persistent miss classes. Visible test recovered to 10/10 and the family miss class disappeared entirely. Aggregate exactness did not rise because the remaining pressure moved into style and syntax.
Interpretation: this is now a curriculum-coverage problem, not another branch-reset problem. A rung should teach which regions are under-supported, not which exact prompts to memorize. The next additions should widen coverage around the surviving miss classes while keeping strong replay from the winning corpus.

Current Error-Class Shift

Rung Failure mix
r3b family 1, form 3, style 2, syntax 2
r3c family 0, form 2, style 3, syntax 3

That is a deterministic signal. The branch is progressing in meaningful areas and regressing in a few structured areas. The right response is broader balanced curriculum support around those unstable regions, not narrow prompt patches and not a new spec.

Related Pages

Spec Training Method

The broader method for one-axis runs, deterministic repair boundaries, and branch decisions.

Open spec-training-method.html

Training Curriculum

The long-range CK-native roadmap that places spec19 inside the wider version ladder.

Open training-curriculum.html

Spec17 Curriculum Blueprint

The bounded-intent predecessor that made the problem visible and established the family structure.

Open spec17-curriculum-blueprint.html

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close