v7 Inference + Training Runbook

Single place to run v7 inference from GGUF and v7 training with true_bpe as the tokenizer source of truth across train + infer.

One Pipeline, Two Paths

Inference path: GGUF -> run dir -> IR/codegen -> .ck_build -> ck_chat.py.

Training path: text data -> true_bpe train -> token stream -> ck_run_v7.py train -> reuse same tokenizer artifacts for inference.

v7 Pipeline — Two Paths, One Runtime v7 Script Orchestrator ck_run_v7.py • train_data_pipeline_v7.py Data Input text / JSONL / hf:// GGUF Run Artifacts tokenizer_bin + .ck_build Training Path (true_bpe) Data Prep BPE Train Tokenize CK Train Checkpoint Promote Steps 0 → 5 Inference Path (GGUF / trained) GGUF / Run IR Codegen .ck_build ck_chat.py interactive / batch Steps 6 → 8 promote weights
Demo Lane
For a short operator demo, show three things only: GGUF -> IR visualizer, one SVG-backed training run, and the family training matrix via make regression-training-fast or make regression-training-full.

Demo Lane: Shortest Video Path

1. GGUF -> IR Visualizer -> Chat

Use this when you want the fastest visual proof that v7 lowers, codegens, and serves an interactive run directory.

version/v7/scripts/cks-v7-run run \
  hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \
  --context-len 1024 --force-compile --force-convert \
  --generate-visualizer

This leaves a run root under ${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v7/models}/Qwen--Qwen3-0.6B-GGUF with ir_report.html ready to open.

2. SVG Training Demo -> Dashboard

This is the shortest end-to-end training demo: repo SVG assets become staged corpora, tokenizer, parity gate, checkpoints, and refreshed visualizer artifacts.

bash version/v7/scripts/v7_train_init_preset.sh \
  --name demo_qwen3_l16_d128_h512_ctx512 \
  --template qwen3 --init xavier_uniform \
  --tokenizer ascii_bpe --optimizer adamw \
  --layers 16 --embed-dim 128 --hidden-dim 512 \
  --context-len 512 --vocab-size 1024 \
  --num-heads 8 --num-kv-heads 4 \
  --mode commands | bash

The helper seeds training_plan.json, builds the SVG corpora from docs/site/assets/*.svg, bootstraps the run, runs parity, then refreshes $RUN/ir_report.html.

3. Family Training Proof Table

Use this when you want the neat nightly-style table that shows the v7 training parity surface across the supported families.

make regression-training-fast
make regression-training-full

fast covers qwen2, qwen3, gemma, and nanbeige. full adds qwen35 and matches the heavier nightly-sized training lane.

Training Family Quick Starts

These are toy-scale operator starters for the supported v7 training families. They all use the same SVG-backed pipeline and the same helper, so the demo shape stays stable while the family changes.

Qwen2

bash version/v7/scripts/v7_train_init_preset.sh \
  --name demo_qwen2_l12_d96_h384_ctx512 \
  --template qwen2 --init xavier_uniform \
  --tokenizer ascii_bpe --optimizer adamw \
  --layers 12 --embed-dim 96 --hidden-dim 384 \
  --context-len 512 --vocab-size 1024 \
  --num-heads 8 --num-kv-heads 4 \
  --mode commands | bash

Qwen3

bash version/v7/scripts/v7_train_init_preset.sh \
  --name demo_qwen3_l16_d128_h512_ctx512 \
  --template qwen3 --init xavier_uniform \
  --tokenizer ascii_bpe --optimizer adamw \
  --layers 16 --embed-dim 128 --hidden-dim 512 \
  --context-len 512 --vocab-size 1024 \
  --num-heads 8 --num-kv-heads 4 \
  --mode commands | bash

Qwen3.5

bash version/v7/scripts/v7_train_init_preset.sh \
  --name demo_qwen35_l12_d128_h512_ctx1024 \
  --template qwen35 --init xavier_uniform \
  --tokenizer ascii_bpe --optimizer adamw \
  --layers 12 --embed-dim 128 --hidden-dim 512 \
  --context-len 1024 --vocab-size 1024 \
  --num-heads 8 --num-kv-heads 4 \
  --mode commands | bash

Gemma 3

bash version/v7/scripts/v7_train_init_preset.sh \
  --name demo_gemma3_l18_d160_h640_ctx1024 \
  --template gemma3 --init xavier_uniform \
  --tokenizer ascii_bpe --optimizer adamw \
  --layers 18 --embed-dim 160 --hidden-dim 640 \
  --context-len 1024 --vocab-size 1536 \
  --num-heads 10 --num-kv-heads 5 \
  --mode commands | bash

Nanbeige

bash version/v7/scripts/v7_train_init_preset.sh \
  --name demo_nanbeige_l16_d128_h512_ctx1024 \
  --template nanbeige --init xavier_uniform \
  --tokenizer ascii_bpe --optimizer adamw \
  --layers 16 --embed-dim 128 --hidden-dim 512 \
  --context-len 1024 --vocab-size 1024 \
  --num-heads 8 --num-kv-heads 8 \
  --mode commands | bash

After any family run, regenerate and open the dashboard with the matching run path.

RUN="$HOME/.cache/ck-engine-v7/models/train/demo_qwen3_l16_d128_h512_ctx512"
python3 version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only --strict-run-artifacts
python3 version/v7/tools/open_ir_hub.py --open

Swap the RUN=... value to the family you just trained. These presets are demo-sized starters, not full published model sizes.

0.0
Config
0A
Bootstrap
0.5
Data
0.7
Tokenize
3.1
Parity Gate
3.7
Train
5
Promote
6
Build
7
Visualizer
8
Chat

Click any step above to jump there. Steps light up as you work through the training path.

Training Path (true_bpe)

Dataset prep -> ASCII/SVG cleanup -> ascii_bpe tokenizer -> train -> run-local inference.

Start training flow at Step 0

Inference Path (GGUF)

No training needed. Pull a GGUF from HF and run directly in v7 runtime.

Jump to Step 8 (GGUF inference)

Dataset-only workflow: for focused pretrain/midtrain SVG corpus generation from docs/site/assets/*.svg, use v7-svg-dataset-runbook.html.

Multi-domain training progression (experimental): for staged playbooks across SVG, reasoning/agent routing, and code domains, use v7-training-progression-playbook.html.

Methodical spec design: for the contract-level method that separates asset libraries, scene DSLs, compilers, and per-run report discipline, use spec-training-method.html.

Long-range CK-native roadmap: for how v7 training foundations roll into v8-v17 capabilities, use training-curriculum.html.

Parity gate checklist: run v7-parity-checklist.html for the current GO criteria (dataset_qc, tokenizer_roundtrip exact_match, D1/E1/F1, canary row1/row2).

Caveat: A1/A2 remain a separate kernel-harness bug track and do not block runbook execution readiness under the current operator gate.

Python Authoring Notebooks

Quickstart Notebook

Launch from the repo root so the notebook can auto-detect ckernel_engine/ and version/v7/.

jupyter lab notebooks/v7_python_authoring_quickstart.ipynb

Walks through materialize -> train -> prepare_viewers() and renders a run artifact dashboard with direct links into the rest of the v7 surface.

Artifact Walkthrough Notebook

Use after the quickstart or the CLI example to inspect the run-dir handoff boundary and generated artifacts.

jupyter lab notebooks/v7_python_authoring_artifact_walkthrough.ipynb

Focuses on python_authoring_plan.json, weights_manifest.json, ir1/ir2, layout, reports, and viewer outputs.

DSL Dataset Preparation Notebook

Use this when the user needs to understand the split-aware SVG/DSL workspace itself, not just a tiny training run.

jupyter lab notebooks/v7_dsl_dataset_preparation.ipynb

Scaffolds the real dataset workflow: inspect version/v7/data/spec04, materialize workspace artifacts, stage the workspace into $RUN/dataset/, regenerate dataset_viewer.html, refresh ir_hub.html, and surface the Python + CLI training handoff commands.

What the notebook lane gives you: a Python front door into the existing v7 runtime and artifact surface, not a separate execution stack.

Typical tiny-run outputs: python_authoring_plan.json, ir_report.html, embeddings.json, train_e2e_latest.json, and the shared ir_hub.html.

Conditional artifacts: dataset_viewer.html requires dataset manifests or a staged dataset workspace, and attention.json requires tokenizer plus probe artifacts.

Recommended Notebook Suite

The current notebook set now covers thin Python authoring plus DSL dataset prep. The broader notebook surface should still settle into three user-facing lanes:

Suggested notebook names:

The right design is notebook as guided front door, with ir_hub.html and run-local HTML artifacts remaining the durable operator surface.

IR Hub Quick Launch

Open the parent dashboard for all runs under $HOME/.cache/ck-engine-v7/models.

.venv/bin/python version/v7/tools/open_ir_hub.py --open

Your latest run will appear automatically when its run dir is under $HOME/.cache/ck-engine-v7/models/train/... (including svg_l16_d128_h512_v1024_ctx512).

Why SVG for v7

Why SVG works for training, and how to stage the data Domain properties (why SVG) Deterministic syntax + parse gates Code-like structure (tags/attrs/tree) Immediate visual feedback for operators Fast measurable reward loops Repo-local bootstrap: docs/site/assets/*.svg Operator leverage Better diagnostics Faster iteration loops Visible quality checks Stable train pipeline Data curriculum policy (this runbook) Step 0.5 Step 0.55 Stage A Stage B 0.5 = baseline cleanup on current dataset 0.55 = scale corpus (repo SVG + synthetic) Stage A pretrain: simple forms / closure stability Stage B midtrain: charts + infographic composition Then SFT / DPO / RL for instruction behavior

What to remember

  • Step 0.5 is minimal and fast; use it for quick validation runs.
  • Step 0.55 is for quality scaling before hyperparameter tuning.
  • Full infographics are useful after syntax/closure is already stable.

Default progression

Simple primitives -> layout patterns -> full infographic structures.

Use Stage A checkpoint as the base for Stage B midtraining.

After that, run instruction-focused SFT/RL stages for prompt fidelity.

Complete v7 Workflow — From Raw Data to Interactive Chat Step 0 Data Prep UTF-8 text file Step 0.7 BPE Train tokenizer.json Step 0.8 Roundtrip determinism gate Step 3 CK Train loss drops Step 5 Promote ckpt → weights Step 6 IR + Build .ck_build/ ck_chat.py Interactive Generation Step 8 — "<svg" → valid SVG GGUF shortcut: skip to Step 8 → Training path (Steps 0–5) Inference path (Steps 6–8) GGUF shortcut (data → chat directly)
Visual

Training Pipeline Graph

Paste a training_plan.json to visualize compute stages + split-aware data flow. Default layout is stage + train/dev/test, with optional holdout when you need a canary or OOD bucket.

Click "Load Example" or paste a training_plan.json and click Render Graph.
Completed
Active
Planned
Deferred
Dataset node
Output / Checkpoint

Step 0.0: Config Init + Pipeline Designer (Recommended)

Set model shape once (name/layers/embed/context/vocab), then auto-generate the full export + command block with consistent run/data paths.

Interactive Config Builder (JS)

Pick model settings and stage plan, then click Generate to produce copy-ready commands in strict operator sequence.

Preset note: family defaults here are toy-scale operator starters (not exact published model sizes).

Stage Script Activation

DPO/GRPO/PPO are generated in --plan-only mode by default (datasets + manifests + workflow artifacts). Remove --plan-only to execute CE-surrogate alignment updates.

Advanced: token budget override

Auto mode reads pretrain token count from $RUN/training_pipeline_latest.json after --prepare-only.

Pipeline Designer — Define Stages & Generate training_plan.json

Define which training stages to run, assign the primary train dataset and LR per stage, then append dev/test/holdout datasets as needed in JSON. Click Generate Plan to produce a training_plan.json you can commit alongside your run dir and load into the IR Visualizer.

pretrain
midtrain
sft
dpo
grpo
rlhf
# Click "Generate training_plan.json" to produce the plan.

Save this JSON as $RUN/training_plan.json. The IR Visualizer reads it to render the pipeline timeline. Toggle stages on/off to define exactly what you intend to run. The builder seeds one train dataset per stage; append dev/test/holdout entries under stages[].datasets as you formalize evaluation, and append each completed run under stages[].runs without overwriting history.

# Click "Generate Commands" to populate this block.

Generated sequence: config -> dataset build -> pipeline bootstrap -> tokenizer check -> pretrain -> parity gate -> midtrain -> sft -> dashboards.

Shell script alternative — same config, run directly in terminal instead of using the builder above:

# Print exports + full command plan (copy/paste result; does NOT execute training)
bash version/v7/scripts/v7_train_init_preset.sh \
  --name svg_l16_d128_h512_v1024_ctx512_clean02 \
  --template qwen3 --init xavier_uniform \
  --tokenizer ascii_bpe --optimizer adamw \
  --layers 16 \
  --embed-dim 128 \
  --hidden-dim 512 \
  --context-len 512 \
  --vocab-size 1024 \
  --num-heads 8 \
  --num-kv-heads 4 \
  --mode both

# Optional: apply exports directly in current shell
eval "$(bash version/v7/scripts/v7_train_init_preset.sh \
  --name svg_l16_d128_h512_v1024_ctx512_clean02 \
  --template qwen3 --init xavier_uniform \
  --tokenizer ascii_bpe --optimizer adamw \
  --layers 16 --embed-dim 128 --hidden-dim 512 \
  --context-len 512 --vocab-size 1024 --mode env)"

# Optional: execute generated command plan directly
bash version/v7/scripts/v7_train_init_preset.sh \
  --name svg_l16_d128_h512_v1024_ctx512_clean02 \
  --template qwen3 --init xavier_uniform \
  --tokenizer ascii_bpe --optimizer adamw \
  --layers 16 --embed-dim 128 --hidden-dim 512 \
  --context-len 512 --vocab-size 1024 --mode commands | bash

The helper derives RUN under $HOME/.cache/ck-engine-v7/models/train/$CK_NAME and binds stage dataset paths using the same prefix, so the IR hub and visualizer stay aligned.

Produces: $RUN path from config name consistent stage dataset paths $RUN/training_plan.json seed block copyable train command set

Step 0: Prereqs + Paths

Run this once before any raw .venv/bin/python ... commands:

make v7-init
make v7-doctor

Required Python packages (supported v7 workflows): numpy torch safetensors requests tqdm huggingface_hub gguf tokenizers transformers

This is the supported v7 bundle for run / train / parity workflows. Not every package is used on every execution path. If you prefer uv, conda, or manual pip, install that same package set into the interpreter you plan to use, then run make v7-doctor. The runbook examples assume that check already passed.

If you used the Config Builder above, $ROOT, $RUN, and $GEN_DIR are already exported — skip to Step 0A. Use this step when setting paths manually or after opening a new terminal.

export ROOT=/home/antshiv/Workspace/C-Kernel-Engine
export RUN=$HOME/.cache/ck-engine-v7/models/train/v7_svg_assets_bpe_l24_full_e1
export DATA_DIR=$ROOT/version/v7/data
export DATA_TXT=$DATA_DIR/svg_assets_train.txt

mkdir -p "$RUN" "$DATA_DIR"
if [ ! -f "$DATA_TXT" ]; then
  echo "[ERROR] Missing data file: $DATA_TXT"
  echo "Set DATA_TXT to a real file or create one in $DATA_DIR."
else
  echo "[OK] Found data file: $DATA_TXT"
fi

Important: use $HOME/.cache/..., not /.cache/.... These vars are exported for later Python steps; if you open a new terminal, re-run Step 0 first. The repo includes a bundled demo file at version/v7/data/svg_assets_train.txt.

Exports: $ROOT $RUN $DATA_DIR $DATA_TXT

Open Run Hub (all runs in one place):

# Generate + open consolidated run dashboard for ~/.cache/ck-engine-v7/models
.venv/bin/python version/v7/tools/open_ir_hub.py --open

# Optional explicit paths
# .venv/bin/python version/v7/tools/open_ir_hub.py \
#   --models-root "$HOME/.cache/ck-engine-v7/models" \
#   --output "$HOME/.cache/ck-engine-v7/models/ir_hub.html" \
#   --index-out "$HOME/.cache/ck-engine-v7/models/runs_hub_index.json" \
#   --open

Use Run Hub as the parent operator view, then drill into each run's ir_report.html.

Guardrail: keep RUN under $HOME/.cache/ck-engine-v7/models/train/... for automatic hub indexing. Repo-local run dirs are easy to forget and usually need manual move/symlink later.

Step 0A: Pipeline Bootstrap (Recommended)

Use this once per new run to make the training pipeline explicit from the start: generate staged datasets, lock tokenizer corpus, seed run artifacts, and generate the first operator view.

Dataset workspace first: if you are using the staged dataset flow, prepare a split-aware workspace such as version/v7/data/spec04 in v7-svg-dataset-runbook.html first, then stage that workspace into the run with --dataset-workspace. The repo workspace is the seed template; the run-local copy under $RUN/dataset/ is the working dataset for that run.

Operator rule: do not split generated artifacts between the repo and cache. Keep the staged dataset, dataset_viewer.html, ir_report.html, checkpoints, parity JSON, and perf outputs together under one cache run directory so the run can be inspected, tarred, shared, and discovered by python3 version/v7/tools/open_ir_hub.py --open without hidden side locations.

# 1) Build staged SVG corpora + instruction corpus
export GEN_DIR="$RUN/data"
mkdir -p "$GEN_DIR"

.venv/bin/python version/v7/scripts/build_svg_pretrain_corpus_v7.py \
  --out-dir "$GEN_DIR" \
  --prefix svg_pretrain_pack_l16d128 \
  --assets-glob "$ROOT/docs/site/assets/*.svg"

.venv/bin/python version/v7/scripts/generate_svg_instruction_dataset_v7.py \
  --out-dir "$GEN_DIR" \
  --prefix svg_pretrain_pack_l16d128 \
  --num-samples 1000 --jsonl

# 2) Canonical stage dataset bindings for this run
export TOKENIZER_CORPUS="$GEN_DIR/svg_pretrain_pack_l16d128_tokenizer_corpus.txt"
export PRETRAIN_DATA="$GEN_DIR/svg_pretrain_pack_l16d128_stage_a_plus_bridge.txt"
export MIDTRAIN_DATA="$GEN_DIR/svg_pretrain_pack_l16d128_stage_b.txt"
export SFT_DATA="$GEN_DIR/svg_pretrain_pack_l16d128_instruction_train.txt"

# 3) Bootstrap run + tokenizer artifacts (no training yet)
.venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \
  --run "$RUN" \
  --init-if-missing \
  --init xavier_uniform \
  --template qwen3 \
  --curriculum-stage stage_a \
  --tokenizer ascii_bpe \
  --require-svg-rows \
  --strict-data-gates \
  --data "$TOKENIZER_CORPUS" \
  --vocab-size 1024 --bpe-vocab-size 1024 \
  --layers 16 --embed-dim 128 --hidden-dim 512 \
  --seq-len 512 --total-tokens 1048576 \
  --prepare-only \
  --json-out "$RUN/train_prepare_stage_a.json"

# 4) Generate initial operator artifacts
.venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only --strict-run-artifacts
.venv/bin/python version/v7/tools/open_ir_hub.py --open

Canonical path: Step 0A is the real bootstrap path. Use Step 0.7 only as a tiny tokenizer preview smoke-check (it uses stub dims and should not be your production bootstrap).

Split-aware staged-dataset bootstrap (spec04 default)
# 0) Prepare a split-aware workspace once (see v7-svg-dataset-runbook.html for import/normalize/classify/materialize)
bash version/v7/scripts/init_data_workspace_v7.sh --spec spec04 --dataset-type svg

# 1) Bootstrap a run with a staged dataset snapshot under $RUN/dataset
.venv/bin/python version/v7/scripts/ck_run_v7.py init \
  --run "$RUN" \
  --layers 16 --vocab-size 1024 --embed-dim 128 --hidden-dim 512 \
  --num-heads 8 --num-kv-heads 4 --context-len 512 \
  --template qwen3 \
  --dataset-workspace version/v7/data/spec04 \
  --dataset-stage-mode copy \
  --dataset-stage-force \
  --generate-ir --generate-runtime --strict

# 2) Regenerate both operator views after dataset changes
.venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only --strict-run-artifacts --output "$RUN/ir_report.html"
.venv/bin/python version/v7/scripts/build_svg_dataset_visualizer_v7.py \
  --workspace "$RUN/dataset" \
  --output "$RUN/dataset_viewer.html"

Use this flow when you want the whole run to be self-contained under $RUN: IR viewer, Dataset Viewer, manifests, stage split corpora (train/dev/test by default), tokenizer corpus, optional holdout, parity JSON, and perf artifacts all live under the same run directory.

Rule: after this bootstrap, keep tokenizer fixed and run later stages with --reuse-run-tokenizer to avoid vocabulary drift.

Produces: $RUN/train_prepare_stage_a.json $RUN/training_pipeline_latest.json $RUN/ir_report.html $HOME/.cache/ck-engine-v7/models/ir_hub.html

Step 0B: Tag Contract — Specialist Model Interface

This model is a tag→SVG specialist renderer, not a general instruction-following LLM. The interface is a closed vocabulary of structured tags — understand this contract before training or running inference.

Training Row Format

# Every training row and inference prompt uses the same closed-vocab tag prefix:
[circle][palette:cool][style:minimal][layout:center]<svg width="140" height="140">...</svg><eos>
[bar-chart][bars:5][ascending][palette:warm][axes][trend-line]<svg width="280" height="180">...</svg><eos>
[infographic][palette:dark][style:gradient][labeled][complexity:rich]<svg width="320" height="180">...</svg><eos>

The old <task>...</task><svg> format is superseded. All spec-driven corpus rows use the tag prefix format above. Do not mix formats in the tokenizer corpus.

Tag Vocabulary

CategoryTags
Shape[circle] [rect] [line] [triangle] [ellipse] [polygon] [path] [arrow] [double-arrow] [polyline]
Chart[bar-chart] [line-chart] [table] [pie-chart] [scatter]
Infographic[infographic] [badge] [flow] [timeline] [legend] [progress]
Count[bars:3] [bars:5] [bars:7] [points:4] [points:6] [points:8] [slices:3] [slices:4] [shapes:2] [shapes:3]
Data order[ascending] [descending] [mixed] [axes] [trend-line] [labeled] [values]
Palette[palette:neutral] [palette:bold] [palette:warm] [palette:cool] [palette:pastel] [palette:dark]
Style[style:gradient] [style:minimal] [style:filled] [style:outline]
Layout[layout:center] [layout:tiled] [layout:grid] [layout:stacked] [layout:horizontal] [layout:header] [layout:flow]
Complexity[complexity:simple] [complexity:moderate] [complexity:rich]
Prompt style[prompt:concise] [prompt:descriptive] [prompt:coordinate] [prompt:imperative]

Custom BPE is trained on this corpus — frequent patterns like [bar-chart] merge into a single token. Verify this after Step 0A: [bar-chart] should appear as 1 token ID, not 11 characters.

Two-Model Cascade (Inference)

# Big model (GPT/Claude): natural language → structured tags
# Small model (128d CK specialist): tags → SVG

# User: "5-bar ascending chart, warm colors"
#   → Big model → [bar-chart][bars:5][ascending][palette:warm][axes]
#   → Small model → <svg>...</svg>

# Direct inference after training:
echo "[bar-chart][bars:5][ascending][palette:cool][axes]" | \
  ./build/ck-cli-v7 --model "$CKPT_PATH" --max-tokens 512

Coverage Gate Rules

Gate is enforced by --strict-coverage in the corpus builder. Spec catalog: version/v7/data/spec_catalog_v1.json (128 specs across 6 groups). Coverage manifests are emitted per stage and consumed by the pipeline gate in Step 3.7.

Reference: version/v7/data/spec_catalog_v1.json $DATA_DIR/*_coverage_manifest.json

Step 0.1: Clean Start (Optional)

Use this when you want a true fresh runbook pass with no stale artifacts.

# Preview what will be removed (scoped to this RUN only)
echo "$RUN"
echo "$RUN/parity_canary"
echo "$RUN/parity_svg_row1"
echo "$RUN/parity_svg_row2"

# Fresh start: remove previous artifacts for this run scope
rm -rf "$RUN"

# Recreate run dir and verify env is still exported
mkdir -p "$RUN"
echo "[OK] RUN=$RUN"
echo "[OK] DATA_TXT=$DATA_TXT"

This does not touch repo source files. It only resets this specific run path under $HOME/.cache/ck-engine-v7/models/train/....

Step 0.5: Data Preparation (Training)

This is the minimal baseline data path. Training expects a UTF-8 plain text file. For a quick demo, use the bundled $DATA_TXT. For strict SVG + ASCII mode, treat cleanup as required before training.

When to use Step 0.55: if the baseline file is small/repetitive and output quality stalls (closure/valid SVG rate), run Step 0.55 to scale corpus coverage before changing model hyperparameters.

# Fast path: bundled demo dataset (already in repo)
BASE_DATA_TXT="${DATA_DIR:-$ROOT/version/v7/data}/svg_assets_train.txt"
if [ ! -s "$DATA_TXT" ] && [ -s "$BASE_DATA_TXT" ]; then
  echo "[WARN] DATA_TXT missing or empty: $DATA_TXT"
  echo "[WARN] Falling back to bundled dataset: $BASE_DATA_TXT"
  DATA_TXT="$BASE_DATA_TXT"
  export DATA_TXT
fi

if [ ! -s "$DATA_TXT" ]; then
  echo "[ERROR] Bundled dataset missing or empty: $DATA_TXT"
  echo "Fix: git pull, or set DATA_TXT to your own text file."
else
  echo "[OK] Using bundled dataset: $DATA_TXT"
  wc -l -c "$DATA_TXT"
  head -n 3 "$DATA_TXT"
fi

# Optional: replace with your own JSONL source
# Leave SRC_JSONL empty to skip this section.
SRC_JSONL=""
if [ -n "$SRC_JSONL" ]; then
  if [ ! -f "$SRC_JSONL" ]; then
    echo "[ERROR] Missing JSONL source: $SRC_JSONL"
    echo "Set SRC_JSONL to a real path, or keep it empty to use bundled data."
  else
    jq -r '.text // empty' "$SRC_JSONL" | sed '/^[[:space:]]*$/d' > "$DATA_TXT"
    perl -pe 's/\r\n?/\n/g' "$DATA_TXT" > "$DATA_TXT.tmp" && mv "$DATA_TXT.tmp" "$DATA_TXT"
    echo "[OK] Replaced DATA_TXT from JSONL: $DATA_TXT"
    wc -l -c "$DATA_TXT"
    head -n 3 "$DATA_TXT"
  fi
else
  echo "[INFO] SRC_JSONL not set; keeping current DATA_TXT."
fi
# Required for strict SVG + ASCII path
# This cleanup keeps bytes ASCII-only and drops non-SVG rows.
SRC_DATA="$DATA_TXT"      # or /absolute/path/raw.jsonl
SRC_FMT="text"            # text | jsonl | auto

python3 version/v7/scripts/prepare_ascii_dataset_v7.py \
  --input "$SRC_DATA" \
  --output "$DATA_TXT" \
  --input-format "$SRC_FMT" \
  --jsonl-text-key text \
  --ascii-map-common \
  --ascii-mode xml_escape \
  --svg-only

wc -l -c "$DATA_TXT"
head -n 3 "$DATA_TXT"

--ascii-map-common applies keyboard-style replacements for common symbols first; xml_escape then preserves any remaining Unicode semantics while keeping file bytes ASCII-only (example: é -> &#xE9;).

At runtime, v7 training uses next-token windows from the token stream, not "one JSONL row = one batch sample".

Step 0.55: Build Larger ASCII SVG Corpus (Recommended)

This is the scale-up data path. It increases dataset diversity before you tune model hyperparameters. Output remains strictly ASCII while coverage expands using repo SVG assets + synthetic SVG generator data.

# Recommended: one-command builder (stitches existing scripts).
python3 version/v7/scripts/build_svg_pretrain_corpus_v7.py \
  --out-dir "$DATA_DIR" \
  --prefix svg_pretrain_pack \
  --assets-glob "$ROOT/docs/site/assets/*.svg" \
  --spec-catalog "$ROOT/version/v7/data/spec_catalog_v1.json" \
  --stage-a-samples 24000 \
  --stage-b-samples 28000 \
  --holdout-ratio 0.10 \
  --seed 42 \
  --strict-coverage

DATA_TXT="$DATA_DIR/svg_pretrain_pack_stage_a_plus_bridge.txt"
DATA_TXT_STAGE_B="$DATA_DIR/svg_pretrain_pack_stage_b.txt"
echo "[OK] Stage-A+bridge: $DATA_TXT"
echo "[OK] Stage-B:        $DATA_TXT_STAGE_B"
echo "[OK] Coverage gate:  passed (strict)"
wc -l -c "$DATA_TXT" "$DATA_TXT_STAGE_B"

This wrapper calls existing scripts (build_svg_corpus_from_assets_v7.py, generate_svg_instruction_dataset_v7.py, build_stage_a_bridge_svg_v7.py, prepare_ascii_dataset_v7.py) and emits train-ready Stage-A+bridge and Stage-B corpora.

Manual equivalent (advanced / debugging):

ASSET_UTF8="$DATA_DIR/svg_assets_docs_utf8.txt"
ASSET_UTF8_MANIFEST="$DATA_DIR/svg_assets_docs_utf8_manifest.json"
MAP_TSV="$DATA_DIR/svg_utf8_to_ascii_table.tsv"
MAP_JSON="$DATA_DIR/svg_utf8_to_ascii_table.json"
ASSET_ASCII="$DATA_DIR/svg_assets_docs_ascii.txt"
ASSET_MANIFEST="$DATA_DIR/svg_assets_docs_ascii_manifest.json"
SYN_PREFIX="svg_instruction_aug"
SYN_SVG="$DATA_DIR/${SYN_PREFIX}_svg_train.txt"
DATA_TXT_ASCII="$DATA_DIR/svg_assets_train_scaled_ascii.txt"
TMP_ASCII="$DATA_DIR/svg_assets_train_scaled_ascii.tmp.txt"

# 1) Flatten docs/site/assets/*.svg to one UTF-8 SVG-per-line corpus.
python3 version/v7/scripts/build_svg_corpus_from_assets_v7.py \
  --assets-glob "$ROOT/docs/site/assets/*.svg" \
  --output "$ASSET_UTF8" \
  --manifest "$ASSET_UTF8_MANIFEST" \
  --ascii-mode none \
  --no-dedupe

# 2) Generate UTF-8 -> ASCII mapping coverage table.
python3 version/v7/scripts/generate_utf8_ascii_map_report_v7.py \
  --input "$ASSET_UTF8" \
  --tsv-out "$MAP_TSV" \
  --json-out "$MAP_JSON"

# 3) Build strict ASCII SVG corpus: keyboard mapping first, XML escapes as fallback.
python3 version/v7/scripts/build_svg_corpus_from_assets_v7.py \
  --assets-glob "$ROOT/docs/site/assets/*.svg" \
  --output "$ASSET_ASCII" \
  --manifest "$ASSET_MANIFEST" \
  --ascii-map-common \
  --ascii-mode xml_escape \
  --no-dedupe

# 4) Generate additional synthetic SVG-only rows (already ASCII).
python3 version/v7/scripts/generate_svg_instruction_dataset_v7.py \
  --out-dir "$DATA_DIR" \
  --prefix "$SYN_PREFIX" \
  --num-samples 20000 \
  --holdout-ratio 0.10 \
  --seed 42

# 5) Merge both sources, shuffle, and enforce strict ASCII +  "$TMP_ASCII"
python3 version/v7/scripts/prepare_ascii_dataset_v7.py \
  --input "$TMP_ASCII" \
  --output "$DATA_TXT_ASCII" \
  --input-format text \
  --ascii-map-common \
  --ascii-mode xml_escape \
  --svg-only

# 6) Promote the scaled corpus as canonical training input for next steps.
DATA_TXT="$DATA_TXT_ASCII"
echo "[OK] DATA_TXT=$DATA_TXT"
wc -l -c "$ASSET_UTF8" "$ASSET_ASCII" "$SYN_SVG" "$DATA_TXT"
echo "[OK] UTF->ASCII map table: $MAP_TSV"

What Step 0.55 does in plain terms: (1) flattens docs/site/assets/*.svg into line-per-SVG corpora, (2) reports UTF-8->ASCII mapping coverage, (3) builds strict ASCII corpus, (4) augments with synthetic SVG rows, (5) shuffles/filters and promotes the larger dataset for the next steps.

This keeps the training stream strictly ASCII while preserving intent: keyboard-style substitutions first, XML escapes second. The scaled corpus usually improves closure stability and lowers exact-copy rate on holdout checks compared with the tiny bundled demo file.

Step 0.56: Stage-A Syntax Bridge Pack (Optional, Recommended)

If Stage-A uses a tiny/simple corpus, inject a small bridge pack so Stage-A already sees core Stage-B syntax primitives (<g>, <defs>, gradients, markers, transforms, filters, viewBox).

STAGE_B_SOURCE="${ASSET_ASCII:-$DATA_DIR/svg_assets_docs_ascii.txt}"
BRIDGE_TXT="$DATA_DIR/svg_stage_a_bridge_small.txt"
BRIDGE_MANIFEST="$DATA_DIR/svg_stage_a_bridge_small_manifest.json"
STAGE_A_PLUS="$DATA_DIR/svg_stage_a_plus_bridge_small.txt"

python3 version/v7/scripts/build_stage_a_bridge_svg_v7.py \
  --stage-a "$DATA_TXT" \
  --stage-b "$STAGE_B_SOURCE" \
  --out "$BRIDGE_TXT" \
  --manifest "$BRIDGE_MANIFEST" \
  --per-feature-cap 4 \
  --max-total 48 \
  --seed 42

{ cat "$DATA_TXT"; cat "$BRIDGE_TXT"; } | sed '/^[[:space:]]*$/d' | shuf > "$STAGE_A_PLUS"
python3 version/v7/scripts/prepare_ascii_dataset_v7.py \
  --input "$STAGE_A_PLUS" \
  --output "$STAGE_A_PLUS" \
  --input-format text \
  --ascii-map-common \
  --ascii-mode xml_escape \
  --svg-only

DATA_TXT="$STAGE_A_PLUS"
echo "[OK] Stage-A bridge enabled: DATA_TXT=$DATA_TXT"
wc -l "$BRIDGE_TXT" "$DATA_TXT"

This keeps Stage-A simple while preventing a hard syntax shock at Stage-B. You still keep Stage-A objective/hyperparameters; only the input syntax support broadens.

Step 0.6: Preview Dataset

Quick sanity pass before training to confirm file size/content shape.

echo "[INFO] DATA_TXT=$DATA_TXT"
ls -lh "$DATA_TXT"
wc -l -c "$DATA_TXT"
head -n 5 "$DATA_TXT"
tail -n 5 "$DATA_TXT"

# Optional: show any non-SVG-looking lines
grep -nEv '^[[:space:]]*= 20 }' "$DATA_TXT" || true

Step 0.7: Tokenize Dataset (true_bpe preview)

Optional preview only: skip this if you already ran Step 0A. Step 0.7 is a quick tokenizer smoke-check path with intentionally tiny/stub dims.

Create tokenizer artifacts and a token stream in the main run so operators use one dashboard path from start to finish.

TOKENIZE_RUN="$RUN"
TOKENIZE_WORK="$TOKENIZE_RUN/.ck_pipeline/bpe_preview_$(date -u +%Y%m%d_%H%M%S)"
mkdir -p "$TOKENIZE_RUN"

cd "$ROOT"
# Use strict ASCII tokenizer mode:
#   --tokenizer ascii_bpe
.venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \
  --run "$TOKENIZE_RUN" \
  --init-if-missing \
  --init xavier_uniform \
  --template qwen3 \
  --curriculum-stage stage_a \
  --tokenizer ascii_bpe \
  --require-svg-rows \
  --strict-data-gates --min-valid-svg-rate 0.70 \
  --roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \
  --data "$DATA_TXT" \
  --vocab-size 320 --bpe-vocab-size 320 \
  --layers 2 --embed-dim 64 --hidden-dim 128 \
  --epochs 1 --seq-len 8 --total-tokens 64 \
  --grad-accum 1 --lr 5e-4 --max-grad-norm 1.0 --seed 42 \
  --work-dir "$TOKENIZE_WORK" \
  --json-out "$TOKENIZE_WORK/pipeline_report.json" \
  --prepare-only

echo "[OK] tokenizer: $TOKENIZE_WORK/tokenizer.json"
echo "[OK] tokens:    $TOKENIZE_WORK/train_tokens.txt"
wc -l "$TOKENIZE_WORK/train_tokens.txt"
head -n 40 "$TOKENIZE_WORK/train_tokens.txt"

This is a preflight path. --prepare-only builds tokenizer + token stream without launching training.

Open IR visualizer (canonical run path):

.venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$TOKENIZE_RUN" --html-only
echo "[OK] Visualizer: $TOKENIZE_RUN/ir_report.html"
xdg-open "$TOKENIZE_RUN/ir_report.html" >/dev/null 2>&1 || true

In the report, check: Architecture -> Tokenizer (click tokenizer.json) and Training -> Pipeline + Data Lab (dataset folder path + tokenizer path + roundtrip samples).

Produces: tokenizer.json train_tokens.txt ir_report.html

Step 0.8: Encode/Decode Determinism Gate

Most accurate correctness check for ascii_bpe: encode the full dataset, decode it back, and compare to original bytes.

.venv/bin/python version/v7/scripts/test_ascii_bpe_roundtrip_v7.py \
  --run "$TOKENIZE_RUN" \
  --dataset "$DATA_TXT" \
  --require-ascii

Pass condition: ascii_bpe roundtrip PASS and matching byte counts.

Step 0.9: Data Lab Artifacts + Paths

Pipeline now writes Data Lab JSON artifacts directly under $RUN so the visualizer can show dataset folder path, tokenizer JSON path, and row-level roundtrip samples.

ls -lh "$TOKENIZE_RUN"/dataset_qc.json \
       "$TOKENIZE_RUN"/dataset_profile.json \
       "$TOKENIZE_RUN"/tokenizer_roundtrip.json || true

jq '{status, path, dataset_dir, checks, non_empty_lines}' "$TOKENIZE_RUN/dataset_qc.json"
jq '{dataset_dir, line_length, duplicates, top_chars: (.top_chars|length), top_tokens: (.top_tokens|length)}' "$TOKENIZE_RUN/dataset_profile.json"
jq '{status, exact_match, tokenizer_json_path, line_eval, sample_rows: (.sample_rows|length)}' "$TOKENIZE_RUN/tokenizer_roundtrip.json"

Visualizer path: switch to Training mode and open Data Lab. This tab shows: dataset_dir, dataset_path, tokenizer_json_path, and row → token-id → decoded previews.

Produces: dataset_qc.json dataset_profile.json tokenizer_roundtrip.json

Step 2: Optional Alternative Data Ingestion (true_bpe)

This section is an alternative ingest path. Step 0.5 already covers the canonical data-prep flow; if you completed Step 0.5, skip this section and go straight to Step 3.1.

JSONL -> text with jq

# Optional: set only if you want to replace bundled DATA_TXT.
SRC_JSONL=""
if [ -z "$SRC_JSONL" ]; then
  echo "[INFO] SRC_JSONL empty; keeping DATA_TXT as-is: $DATA_TXT"
elif [ ! -f "$SRC_JSONL" ]; then
  echo "[ERROR] Missing JSONL source: $SRC_JSONL"
  echo "Set SRC_JSONL to a real path. Example: SRC_JSONL=$HOME/data/train.jsonl"
else
  jq -r '.text // empty' "$SRC_JSONL" > "$DATA_TXT"
  echo "[OK] Wrote DATA_TXT from JSONL: $DATA_TXT"
fi

JSONL -> text with Python

# Optional: set only if you want to replace bundled DATA_TXT.
export SRC_JSONL=""
python3 - < {dst}")
PY
Produces (if SRC_JSONL set): $DATA_TXT overwritten from JSONL text fields

Step 3.1: Mandatory CK vs PyTorch Parity Gate (10 epochs)

Required gate before long CK-only training: run two tiny canaries (dataset rows 1 and 2) for 10 epochs with --with-torch-ref. This catches generated-runtime or codegen corruption early.

cd "$ROOT"
PARITY_ROOT="$RUN/parity_canary"
mkdir -p "$PARITY_ROOT"

# Use a stable canary source for strict CK-vs-PyTorch parity gating.
# You can override with: export V7_PARITY_CANARY_DATA=/path/to/your_two_row_svg_file.txt
CANARY_SRC="${V7_PARITY_CANARY_DATA:-$ROOT/version/v7/data/svg_assets_train.txt}"
if [ ! -s "$CANARY_SRC" ]; then
  echo "[WARN] Missing default canary source: $CANARY_SRC"
  echo "[WARN] Falling back to DATA_TXT for canaries: $DATA_TXT"
  CANARY_SRC="$DATA_TXT"
fi
echo "[INFO] parity canary source: $CANARY_SRC"

sed -n '1p' "$CANARY_SRC" > "$PARITY_ROOT/svg_row1.txt"
sed -n '2p' "$CANARY_SRC" > "$PARITY_ROOT/svg_row2.txt"

if [ ! -s "$PARITY_ROOT/svg_row1.txt" ] || [ ! -s "$PARITY_ROOT/svg_row2.txt" ]; then
  echo "[FAIL] Need at least 2 non-empty rows in DATA_TXT for parity canaries."
  echo "[FAIL] Re-run Step 0A and Step 0.5/0.55/0.56, then retry Step 3.1."
  exit 1
else
  for idx in 1 2; do
    ROW_FILE="$PARITY_ROOT/svg_row${idx}.txt"
    ROW_RUN="$RUN/parity_svg_row${idx}"
    rm -rf "$ROW_RUN"
    .venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \
      --run "$ROW_RUN" \
      --init-if-missing \
      --init xavier_uniform \
      --template qwen3 \
      --tokenizer ascii_bpe \
      --require-svg-rows \
      --strict-data-gates \
      --data "$ROW_FILE" \
      --vocab-size 2048 --bpe-vocab-size 2048 \
      --layers 4 --embed-dim 96 --hidden-dim 192 --num-heads 8 --num-kv-heads 4 \
      --epochs 10 --seq-len 64 --total-tokens 12288 \
      --grad-accum 1 --lr 3e-4 --max-grad-norm 1.0 --seed 42 \
      --train-driver ck_run \
      --with-torch-ref \
      --no-post-train-eval \
      --no-open-visualizer \
      --json-out "$ROW_RUN/parity_pipeline.json"
  done
fi

Step 3.1 is a numerical parity gate. It intentionally disables post-train SVG quality eval so generation-quality noise does not mask CK-vs-PyTorch parity status. With --no-post-train-eval, strict output-quality SVG gating is skipped by design for this canary stage. The canary command uses --seq-len 64 on purpose so single-row SVG probes do not trigger wrap-heavy false drift.

Gate check: compare CK vs PyTorch loss curves and enforce thresholds.

python3 - <<'PY'
import json
import os
from pathlib import Path
from statistics import mean

# TH = threshold
TH_MAX = 1e-4
TH_MEAN = 5e-5
TH_PARAM = 1e-4
run_env = os.environ.get("RUN", "").strip()
if not run_env:
    print("[FAIL] RUN env var is empty. Example: export RUN=$HOME/.cache/ck-engine-v7/models/train/my_run")
    raise SystemExit(1)
ROOT = Path(run_env)
ok = True

for idx in (1, 2):
    run_dir = ROOT / f"parity_svg_row{idx}" / ".ck_pipeline"
    work_dirs = sorted([p for p in run_dir.glob("ascii_bpe_*") if p.is_dir()])
    if not work_dirs:
        print(f"[FAIL] row{idx}: missing {run_dir}/ascii_bpe_*")
        ok = False
        continue
    w = work_dirs[-1]
    ck = json.loads((w / "train_ck.json").read_text())
    pt = json.loads((w / "train_torch_ref.json").read_text())
    c = [float(x["loss_ck"]) for x in ck.get("loss_curve", [])]
    t = [float(x["loss"]) for x in pt.get("loss_curve", [])]
    n = min(len(c), len(t))
    if n == 0:
        print(f"[FAIL] row{idx}: empty loss curves")
        ok = False
        continue
    diffs = [abs(c[i] - t[i]) for i in range(n)]
    max_abs = max(diffs)
    mean_abs = mean(diffs)
    final_param = float(ck.get("final_param_max_abs_diff", 1.0))
    passed = max_abs <= TH_MAX and mean_abs <= TH_MEAN and final_param <= TH_PARAM
    print(f"[row{idx}] max_abs={max_abs:.6e} mean_abs={mean_abs:.6e} final_param={final_param:.6e} pass={passed}")
    ok = ok and passed

print("PARITY_GATE=PASS" if ok else "PARITY_GATE=FAIL")
PY

Go/No-go: Continue CK-only training only after PARITY_GATE=PASS. If it fails, keep --with-torch-ref enabled and debug before scaling model size, epochs, or data.

Canonical operator path: use $RUN/ir_report.html as the single dashboard. Parity canaries run under $RUN/parity_svg_row1 and $RUN/parity_svg_row2 only for isolated diagnostics; Step 3.2 writes aggregated parity artifacts back into $RUN.

Canary visibility: regenerate $RUN/ir_report.html and open Training -> Train Dashboard / Training -> CK vs PyTorch to see a canary summary (row1/row2 pass/fail) without opening sub-run reports.

Produces: parity_svg_row1/parity_pipeline.json parity_svg_row2/parity_pipeline.json PARITY_GATE=PASS required

Step 3.2: Automated Training Parity Regimen (Recommended)

Run the full staged regimen (first/second-step parity, backend xray attribution, grad-accum sweeps, multi-epoch stability, replay/stitch checks) and save one operator table artifact.

cd "$ROOT"
.venv/bin/python version/v7/scripts/run_training_parity_regimen_v7.py \
  --run-dir "$RUN"

Outputs: $RUN/training_parity_regimen_latest.json, $RUN/training_parity_regimen_latest.md, $RUN/regimen_backend_xray.json, and $RUN/training_parity_regimen_logs/.

Operator gate mapping: this step supplies the D1/E1/F1 pass evidence used in the checklist GO criteria.

Backend xray adds per-backend first-step deltas (rmsnorm/swiglu/loss swap table), suspected divergence source, and reference forward/backward formulas so operators can inspect exact math paths without scrolling large logs.

Auto-skip behavior: if previous regimen passed and runtime/codegen fingerprint is unchanged, this script exits quickly with a reused PASS state. Use --force to rerun anyway.

ck_run operator options: training commands can now suggest, run, or require this regimen.

# Non-blocking recommendation only (default)
python3 version/v7/scripts/ck_run_v7.py train --run "$RUN" --parity-regimen suggest

# Run regimen after train, but continue even if it fails
python3 version/v7/scripts/ck_run_v7.py train --run "$RUN" --parity-regimen run

# Run regimen after train and fail command if regimen fails
python3 version/v7/scripts/ck_run_v7.py train --run "$RUN" --parity-regimen require

Use --parity-regimen off only for deliberate fast-debug loops.

Produces: training_parity_regimen_latest.json training_parity_regimen_latest.md regimen_backend_xray.json training_parity_regimen_logs/

Step 3.3: Nightly Core Stabilization Matrix (1-4 Layers)

Run one nightly matrix that combines strict tokenizer gates and parity gates:

cd "$ROOT"
make v7-stabilization-nightly \
  V7_STABILIZATION_RUN_ROOT="$RUN/nightly_stabilization" \
  V7_STABILIZATION_DATA="$DATA_TXT"

Optional main-run validation: add V7_STABILIZATION_MAIN_RUN_DIR="$RUN" to include your current run-dir weights as an extra parity case.

Outputs: training_stabilization_scorecard_latest.json, training_stabilization_scorecard_latest.md, and append-only training_stabilization_history.jsonl.

Tune matrix size with: V7_STABILIZATION_LAYERS, V7_STABILIZATION_TOKEN_BUDGETS, V7_STABILIZATION_GRAD_ACCUM_SWEEP.

Produces: training_stabilization_scorecard_latest.json training_stabilization_scorecard_latest.md training_stabilization_history.jsonl

Step 3.7: Run Full Training Pipeline (After Parity Gates)

Execution order: run this section only after Step 3.1 and Step 3.2 pass.

Default Step 3.7 below is a functional end-to-end run (parity-first, quality gate non-blocking) so operators can complete the pipeline in one pass. For strict output-quality enforcement (valid_svg_rate gate), use the strict command in Step 9 after data scaling/SFT expansion.

Stage Pass Criteria (SVG Specialist Model)

Before advancing to the next training stage, verify these minimum bars. Check manually on 5–10 holdout samples using the tag sequences below.

StagePass conditionQuick check command
Stage A
primitives
Loss converges. [circle][palette:cool] → valid SVG containing <circle. Output parses without error. echo "[circle][palette:cool][style:minimal]" | ./build/ck-cli-v7 --model $CKPT
Stage B
charts/infographic
[bar-chart][bars:5][ascending] → bars are actually ascending in output. [palette:warm] → output uses warm-family hex colors, not random CSS names. echo "[bar-chart][bars:5][ascending][palette:warm][axes]" | ./build/ck-cli-v7 --model $CKPT
SFT
enriched
[infographic][palette:dark][labeled][complexity:moderate] → labeled dashboard with dark background. Multiple elements visible and in-bounds. echo "[infographic][palette:dark][style:gradient][labeled][complexity:moderate]" | ./build/ck-cli-v7 --model $CKPT

If Stage A fails the circle test, check that [circle] merged to a single BPE token (Step 0B). If Stage B bars are not ascending, the bar_chart generator's ascending constraint may not be wired — check spec_catalog_v1.json constraint field vs generator logic.

train_data_pipeline_v7.py — What Happens Inside 1. Init Model --init-if-missing xavier_uniform init qwen3 template dims → weights.bump 2. BPE Train ck-bpe-train (C) corpus → byte pairs merge loop × vocab_size → tokenizer.json + bins 3. Tokenize Data encode full dataset apply merge rules GPT-2 pretokenizer → train_tokens.txt 4. CK Train C runtime training seq_len windows AdamW + grad clip → checkpoints/ 5. Report pipeline_report.json loss stats + SHA256 IR visualizer → ir_report.html Data Flow svg_train.txt tokenizer.json train_tokens.txt ck_train(seq_len windows) checkpoints/step_N.bin report Key Flags --tokenizer ascii_bpe | --vocab-size 320 | --seq-len 32 | --total-tokens 841472 | --lr 5e-4 | --grad-accum 1

Same pipeline, two train executors. Pick one:

C Variant (ck-cli)

Native training loop in build/ck-cli-v7 with live step/epoch loss + perplexity logs.

cd "$ROOT"
.venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \
  --run "$RUN" \
  --init-if-missing \
  --init xavier_uniform \
  --template qwen3 \
  --curriculum-stage stage_a \
  --tokenizer ascii_bpe \
  --require-svg-rows \
  --roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \
  --data "$DATA_TXT" \
  --vocab-size 320 --bpe-vocab-size 320 \
  --layers 24 --embed-dim 64 --hidden-dim 128 \
  --epochs 1 --seq-len 32 --total-tokens 841472 \
  --grad-accum 1 --lr 5e-4 --max-grad-norm 1.0 --seed 42 \
  --train-driver ck_cli --ck-cli-log-every 200 \
  --json-out version/v7/reports/v7_svg_assets_bpe_pipeline_l24_full_e1_latest.json

Python Variant (ck_run)

Python harness path (ck_run_v7.py train --backend ck) with parity/debug flexibility.

cd "$ROOT"
.venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \
  --run "$RUN" \
  --init-if-missing \
  --init xavier_uniform \
  --template qwen3 \
  --curriculum-stage stage_a \
  --tokenizer ascii_bpe \
  --require-svg-rows \
  --roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \
  --data "$DATA_TXT" \
  --vocab-size 320 --bpe-vocab-size 320 \
  --layers 24 --embed-dim 64 --hidden-dim 128 \
  --epochs 1 --seq-len 32 --total-tokens 841472 \
  --grad-accum 1 --lr 5e-4 --max-grad-norm 1.0 --seed 42 \
  --train-driver ck_run \
  --json-out version/v7/reports/v7_svg_assets_bpe_pipeline_l24_full_e1_latest.json
Produces: weights.bump weights_manifest.json checkpoints/ pipeline_report.json ir_report.html

Step 4: Artifact Locations + Data Format

Artifact Map — Where Everything Lives $RUN/ tokenizer.json HF-compatible vocab + merges Source of truth for inference tokenizer_bin/ vocab_offsets.bin vocab_strings.bin + merges.bin weights.bump Model weights (promoted) + weights_manifest.json .ck_pipeline/ bpe_<ts>/tokenizer.json bpe_<ts>/train_tokens.txt .ck_build/ Compiled IR + runtime Used by ck_chat.py checkpoints/ step_N.bin (promote → weights.bump) Tokenizer artifacts Model weights Pipeline working dir Binary / runtime All paths relative to $RUN (your --run dir)
Artifact Path Pattern Purpose
Training BPE JSON $RUN/.ck_pipeline/bpe_<ts>/tokenizer.json Tokenizer produced by BPE training pass.
Training BPE binary $RUN/.ck_pipeline/bpe_<ts>/bpe_bin/* Binary artifacts consumed by CK true_bpe runtime.
Canonical run tokenizer $RUN/tokenizer.json + $RUN/tokenizer_bin/* Single source of truth copied into run dir for infer reuse.
Token stream file $RUN/.ck_pipeline/bpe_<ts>/train_tokens.txt One integer token id per line. Used by --train-token-file.
Dataset QC artifact $RUN/dataset_qc.json ASCII/SVG gate status, row/byte counts, dataset path metadata.
Dataset profile artifact $RUN/dataset_profile.json Length histogram, duplicate rows, top chars/tokens.
Tokenizer roundtrip artifact $RUN/tokenizer_roundtrip.json Encode/decode exactness + per-line sample browser rows.
Post-train eval artifact $RUN/post_train_eval.json valid SVG rate, closure success, repetition/loop score. If strict gate fails here, treat it as data/task quality (not CK-vs-PyTorch parity math); increase corpus coverage and add instruction-to-SVG SFT pairs.
Train report .../train_ck.json + your --json-out file Loss stats + pipeline metadata.
PyTorch parity report .../train_torch_ref.json + $RUN/training_parity.json Reference loss curve + stepwise parity diffs for CK-vs-PyTorch training validation.
How training consumes data
Training is next-token LM over the token stream in seq_len windows; it is not "one JSONL row = one sample" at runtime.
Key artifacts: checkpoints/step_N.bin weights.bump weights_manifest.json dataset_qc.json training_pipeline_latest.json training_parity.json

Step 5: Promote Latest Checkpoint to Inference Weights

Training writes runtime checkpoints under $RUN/checkpoints, while inference consumes $RUN/weights.bump + $RUN/weights_manifest.json. Promote by stage/pass for deterministic analysis.

# Inspect stage history + pass numbers first
python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --list-runs

# Promote latest checkpoint (legacy default: max step)
python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN"

# Optional: preview selected checkpoint only
python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --dry-run

# Optional: promote a specific step
python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --step 1234

# Preferred: promote latest available pass for a stage (scales to any model)
python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage pretrain
python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage midtrain
python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage sft

# Optional: exact stage pass (when multiple passes exist)
python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage sft --stage-pass 2

# Optional: promote exact run_id
python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --run-id ascii_bpe_YYYYmmdd_HHMMSS

Promotion is atomic and updates weights.bump + weights_manifest.json in $RUN. Use --stage/--stage-pass for apples-to-apples stage comparisons.

Promotes: weights.bump weights_manifest.json

Step 5.1: Stage Completion Test Loop (Promote -> Infer -> Refresh)

Run this loop after each completed stage to validate behavior before continuing.

# 0) List runs once (find stage_pass values)
python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --list-runs

# 1) Promote target stage checkpoint (latest pass for that stage)
python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage pretrain
# python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage midtrain
# python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage sft

# 2) Build inference runtime from promoted weights
.venv/bin/python version/v7/scripts/ck_run_v7.py run "$RUN" --generate-only --context-len 512

# 3) Probe behavior for this stage
.venv/bin/python scripts/ck_chat.py \
  --model-dir "$RUN/.ck_build" \
  --python-tokenizer --chat-template none \
  --prompt "[circle][palette:cool][style:minimal]<svg" \
  --max-tokens 96 --temperature 0 --top-p 1.0 --repeat-penalty 1.05 --repeat-last-n 256 \
  --stop-at-eos

# 4) Refresh visualizer to capture this state
python3 version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only --strict-run-artifacts

Recommended probes by stage: pretrain: shape syntax + closure; midtrain: chart/layout structure; sft: strict tag-following + SVG validity.

Step 6: Build Inference from Trained Run

Generate runtime and chat from the same run

.venv/bin/python version/v7/scripts/ck_run_v7.py run "$RUN" --generate-only --context-len 128

.venv/bin/python scripts/ck_chat.py \
  --model-dir "$RUN/.ck_build" \
  --python-tokenizer --chat-template none \
  --show-token-ids --prompt "<svg" --max-tokens 80 --temperature 0.0

Use $RUN/.ck_build for chat so model + tokenizer artifacts come from the same run.

Produces: .ck_build/ runtime binary

Step 7: Open IR Visualizer

Generate a fresh visualizer report for this run and open it locally.

.venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only
echo "[OK] Visualizer: $RUN/ir_report.html"

# Optional: open in browser
xdg-open "$RUN/ir_report.html" >/dev/null 2>&1 || true

The training pipeline already generates this by default; this command regenerates it on demand.

Regression gate: make visualizer validates run-dir path wiring + decode/profile artifact loading. make visualizer-full also validates train-runtime ASan artifact loading via a tiny fixture run.

Runbook E2E gate: make v7-runbook-e2e executes training-path Steps 1-7 (smoke mode) and fails fast on path/gate/artifact drift.

Regression memory: the visualizer can load version/v7/reports/REGRESSION_LEDGER.json to show known root causes, detection gates, and proof artifacts directly in Training run-gates.

CheckpointRun PathWhat to inspect
After Step 0.7 tokenization preview$RUNArchitecture -> Tokenizer, Training -> Data Lab (dataset folder + tokenizer path + roundtrip samples + Tokenizer Internals merges/pieces)
After Step 3.7 training$RUNTraining -> Data Lab + Train Dashboard (valid SVG/closure/loop metrics)
After Step 6 inference build$RUNkernel graph, runtime files, and artifact links

Step 7.1: Live Training Monitor

Watch loss, grad norm, step count, and tok/s update in real-time while training is running — no need to regenerate the report or restart the training process. Open a second terminal alongside your training run and pick the path that suits your setup.

Path A — Zero-config (python3 -m http.server)

Works with any static HTTP server. The visualizer auto-detects it is served over HTTP (not file://) and starts polling the JSON artifact files that live next to ir_report.html. No flags, no server code — just stdlib.

# In a second terminal while training is running:
python3 -m http.server 7700 -d "$RUN"

# Open in browser:
xdg-open http://localhost:7700/ir_report.html 2>/dev/null || \
  echo "Open: http://localhost:7700/ir_report.html"

$RUN is the training run directory (e.g. ~/.cache/ck-engine-v7/models/train/svg_l16_d128_h512_v1024_ctx512). The ir_report.html must already exist in that directory (generated in Step 7 or by the training pipeline's --generate-visualizer flag).

Path B — Custom live server (single snapshot endpoint)

Generates a fresh report and starts a dedicated HTTP server that exposes a /api/snapshot endpoint — one request per poll cycle instead of N file fetches. Injects window.CK_LIVE_MODE automatically so the visualizer uses the snapshot path.

# Generates report, starts server, opens browser:
.venv/bin/python version/v7/tools/open_ir_visualizer.py \
  --run "$RUN" \
  --html-only \
  --live \
  --live-port 7700 \
  --live-interval 4000

# Press Ctrl+C to stop the server when done.

--live-interval is in milliseconds (default 5000 = 5 s). Reduce to 2000–3000 for faster feedback during early training.

What the live bar shows

A fixed status bar appears at the bottom of every page while live mode is active:

LIVE ↺ step 42 · loss 3.2401 · 92.1 tok/s · just now file poll · 5s stop
FieldSource fileWhat to watch for
step Ntraining_checkpoint_policy_latest.jsonSteps should increment steadily; gaps indicate stalls or restarts.
loss X.XXXXtraining_loss_curve_latest.jsonHealthy curve trends down. Spikes (>2× recent baseline) may signal LR or data issues.
X tok/straining_step_profile_latest.jsonDrop in tok/s without loss spike → hardware throttle or memory pressure.
prefixAppears when any artifact changed this cycle. Absent = no new steps yet.

Which files are polled: training_loss_curve_latest.json, training_grad_norms_latest.json, training_parity_latest.json, training_step_profile_latest.json, training_checkpoint_policy_latest.json. All are <5 KB and written atomically after each step by the training harness. The active training tab (Train Dashboard, Grad Health, Parity, etc.) re-renders automatically on each change.

Stage detection: the active curriculum stage (active_stage, curriculum_stage) is embedded from the last report generation. It updates automatically if you regenerate the report mid-run: python3 version/v7/tools/open_ir_visualizer.py --run "$RUN" --html-only (no server restart needed for Path A — just refresh the browser after regeneration).

ir_report.html (live via HTTP)

Step 8: Inference from GGUF (HF URLs)

Support boundary
The GGUF runtime path is documented and supported on Linux. macOS and Windows/WSL may work for local smoke tests, but parity and profiling workflows remain Linux-first and are not the supported operator baseline.

Step 8A: Plain Runtime Path

Use this when you only want the compiled runtime and local chat loop.

make v7-init
make v7-demo-runtime \
  V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf

Scope: runtime only. This path does not promise a complete operator artifact set.

Step 8B: Full Artifact Capture Path (Non-privileged)

Use this when you want the run directory, HTML report, and parent Run Hub in one pass.

make v7-init
make v7-capture-artifacts \
  V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf

Scope: runtime + report + hub. Deep profiling artifacts (perf, VTune, Advisor, flamegraphs) remain separate Linux-only steps and may require elevated privileges or host tooling.

Step 8C: Full Profiling Dashboard Path

Use this on Linux when you want runtime, report, hub, and profiling artifacts refreshed together.

make v7-init
make v7-profile-dashboard \
  V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf

Scope: runtime + report + hub + profiling refresh. Profiling subtargets rely on Linux host tools; missing tools may cause portions of the capture to skip.

Use this path when you want inference-only runs without training a local model.

These raw CLI examples use the shell wrapper so a first-time user can be prompted to create .venv and install requirements-v7.txt if the repo-local environment is missing.

Gemma 3 270M

version/v7/scripts/cks-v7-run run \
  hf://unsloth/gemma-3-270m-it-GGUF/gemma-3-270m-it-Q5_K_M.gguf \
  --context-len 1024 --force-compile --force-convert --chat-template=none \
  --generate-visualizer

Qwen2 0.5B Instruct

version/v7/scripts/cks-v7-run run \
  hf://Qwen/Qwen2-0.5B-Instruct-GGUF/qwen2-0_5b-instruct-q4_k_m.gguf \
  --context-len 1024 --force-compile --force-convert \
  --generate-visualizer

Qwen3 0.6B

version/v7/scripts/cks-v7-run run \
  hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \
  --context-len 1024 --force-compile --force-convert \
  --generate-visualizer

Qwen3.5 0.8B

python3 version/v7/scripts/ck_run_v7.py run \
  hf://unsloth/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q4_K_M.gguf \
  --force-convert --force-compile \
  --context-len 1034

The canonical v7 bring-up path is the hf://... URI. That materializes the run under ${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v7/models}/unsloth--Qwen3.5-0.8B-GGUF. Local GGUF paths are still supported when you intentionally want an offline or copied artifact.

NaanBeige / Llama-family template symptom
If the first reply echoes <|im_start|>assistant or starts with <think>, the prompt is being fed with the wrong chat wrapper or stop markers. For NaanBeige, keep the default --chat-template auto, prefer --python-tokenizer on first bring-up, and do not force --chat-template none unless you are testing raw logits on purpose.

NaanBeige 4.1 3B

version/v7/scripts/cks-v7-run run \
  hf://mradermacher/Nanbeige4.1-3B-GGUF/Nanbeige4.1-3B.Q4_K_M.gguf \
  --context-len 1024 --force-compile --force-convert \
  --chat-template auto \
  --generate-visualizer

Current scope: documented inference/parity bring-up plus the supported v7 training families qwen2, qwen3, qwen35, gemma3, and nanbeige.

Step 8.1: Inference + IR Visualizer for GGUF Runs

Inference runs can come from either hf://... URIs or local GGUF paths. HF-backed runs are stored under ${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v7/models}/<org--repo>. If you used --generate-visualizer in Step 8, ir_report.html is already in the same folder. Otherwise generate visualizer reports directly from those run directories:

CACHE_MODELS="${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v7/models}"
GEMMA_RUN="$CACHE_MODELS/unsloth--gemma-3-270m-it-GGUF"
QWEN2_RUN="$CACHE_MODELS/Qwen--Qwen2-0.5B-Instruct-GGUF"
QWEN3_RUN="$CACHE_MODELS/Qwen--Qwen3-0.6B-GGUF"
QWEN35_RUN="$CACHE_MODELS/unsloth--Qwen3.5-0.8B-GGUF"
NANBEIGE_RUN="$CACHE_MODELS/mradermacher--Nanbeige4.1-3B-GGUF"

for RUN_DIR in "$GEMMA_RUN" "$QWEN2_RUN" "$QWEN3_RUN" "$QWEN35_RUN" "$NANBEIGE_RUN"; do
  if [ ! -d "$RUN_DIR" ]; then
    echo "[Missing] $RUN_DIR (run Step 8 command first)"
    continue
  fi
  .venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$RUN_DIR" --html-only
  echo "[OK] $RUN_DIR/ir_report.html"
done
# Optional: open generated reports
CACHE_MODELS="${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v7/models}"
xdg-open "$CACHE_MODELS/unsloth--gemma-3-270m-it-GGUF/ir_report.html" >/dev/null 2>&1 || true
xdg-open "$CACHE_MODELS/Qwen--Qwen2-0.5B-Instruct-GGUF/ir_report.html" >/dev/null 2>&1 || true
xdg-open "$CACHE_MODELS/Qwen--Qwen3-0.6B-GGUF/ir_report.html" >/dev/null 2>&1 || true
xdg-open "$CACHE_MODELS/unsloth--Qwen3.5-0.8B-GGUF/ir_report.html" >/dev/null 2>&1 || true
xdg-open "$CACHE_MODELS/mradermacher--Nanbeige4.1-3B-GGUF/ir_report.html" >/dev/null 2>&1 || true

Report files are written to each run root as ir_report.html (not inside .ck_build).

Step 9: Quality Improvement Phases (SVG)

Quality Improvement Roadmap — Small Dataset, Big Results 1 Make It Work Goal: Get baseline running • Roundtrip passes • Loss drops steadily • Some valid SVG output Config: vocab 320 | seq 32 layers 24 | lr 5e-4 1 epoch | grad_accum 1 If fails: fix data, not hyperparams tune → 2 Make It Better Goal: Improve structure • Higher valid-SVG rate • Fewer broken tails • Better tag closure Config changes: vocab 640 | seq 96 4 epochs | lr 2e-4 grad_accum 4 Key: bigger vocab + longer context scale → 3 Make It Generalize Goal: Not just memorize • Clean holdout outputs • Reduced exact-copy rate • Novel valid SVGs Strategy: 90/10 train/holdout split more data + diversity ablation matrix sweep Use: run_ablation_matrix_v7.py loss: 4.2→2.8 loss: 2.8→1.5 loss: 1.5→0.8

Use this phased path for small custom datasets. The goal is not "ChatGPT quality"; the goal is stable, valid SVG generation that improves step-by-step.

PhaseGoalPass Criteria
Phase 1Get baseline workingRoundtrip passes, loss drops, and at least some valid SVG samples are generated
Phase 2Improve structure stabilityHigher valid-SVG rate and fewer broken tails in longer generations
Phase 3Generalize (not memorize only)Cleaner outputs on holdout prompts/data and reduced exact-copy rate

Phase 1: Make It Work (Baseline)

cd "$ROOT"
.venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \
  --run "$RUN" \
  --init-if-missing \
  --init xavier_uniform \
  --template qwen3 \
  --curriculum-stage stage_a \
  --tokenizer ascii_bpe \
  --require-svg-rows \
  --strict-data-gates --min-valid-svg-rate 0.70 \
  --roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \
  --data "$DATA_TXT" \
  --vocab-size 320 --bpe-vocab-size 320 \
  --layers 24 --embed-dim 64 --hidden-dim 128 \
  --epochs 1 --seq-len 32 --total-tokens 841472 \
  --grad-accum 1 --lr 5e-4 --max-grad-norm 1.0 --seed 42 \
  --train-driver ck_cli --ck-cli-log-every 200 \
  --json-out version/v7/reports/v7_svg_assets_bpe_pipeline_l24_full_e1_latest.json

.venv/bin/python version/v7/scripts/test_ascii_bpe_roundtrip_v7.py \
  --run "$RUN" \
  --dataset "$DATA_TXT" \
  --require-ascii

If this phase fails, do not tune hyperparameters yet. First fix data cleanliness and tokenizer roundtrip.

Phase 2: Expand Data + Continue From Latest Checkpoint (same RUN)

If Phase 1 converges but outputs are still brittle, expand corpus coverage and continue training from your latest checkpoint in the same run directory.

# 1) Expand dataset (Stage A data + richer docs SVG corpus), keep ASCII + one-row-per-svg.
STAGE_B_SOURCE="$DATA_DIR/svg_assets_docs_ascii.txt"
RESUME_DATA="$RUN/svg_resume_stageb_mix.txt"

if [ ! -s "$STAGE_B_SOURCE" ]; then
  echo "[ERROR] Missing $STAGE_B_SOURCE"
  echo "Run Step 0.55 first to build the larger docs corpus."
else
  { cat "$DATA_TXT"; cat "$STAGE_B_SOURCE"; } \
    | sed '/^[[:space:]]*$/d' \
    | awk 'length($0) <= 4096' \
    | sort -u > "$RESUME_DATA"

  python3 version/v7/scripts/prepare_ascii_dataset_v7.py \
    --input "$RESUME_DATA" \
    --output "$RESUME_DATA" \
    --input-format text \
    --ascii-map-common \
    --ascii-mode xml_escape \
    --svg-only

  DATA_TXT="$RESUME_DATA"
fi

# 2) Continue from latest checkpoint in SAME run (no re-init).
.venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \
  --run "$RUN" \
  --resume-latest-checkpoint \
  --template qwen3 \
  --curriculum-stage stage_b \
  --tokenizer ascii_bpe \
  --require-svg-rows \
  --strict-data-gates \
  --roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \
  --data "$DATA_TXT" \
  --vocab-size 320 --bpe-vocab-size 320 \
  --layers 24 --embed-dim 64 --hidden-dim 128 \
  --epochs 1 --seq-len 128 --total-tokens 1200000 \
  --grad-accum 4 --lr 2e-4 --max-grad-norm 1.0 --seed 42 \
  --train-driver ck_cli --ck-cli-log-every 200 \
  --json-out version/v7/reports/v7_svg_assets_bpe_pipeline_l24_resume_stageb_latest.json

Keep vocab/dims aligned with the existing run when resuming. If you want a larger vocab (for example 640 instead of 320), start a fresh run directory and re-init.

Phase 3: Add Data + Check Memorization

More data helps only when it stays clean and you measure train-vs-holdout behavior.

# Build a simple 90/10 line split for memorization checks
SHUF="$RUN/svg_all_shuf.txt"
TRAIN_TXT="$RUN/svg_train_90.txt"
HOLDOUT_TXT="$RUN/svg_holdout_10.txt"

shuf "$DATA_TXT" > "$SHUF"
TOTAL=$(wc -l < "$SHUF")
HOLDOUT=$(( TOTAL / 10 ))
TRAIN=$(( TOTAL - HOLDOUT ))
head -n "$TRAIN" "$SHUF" > "$TRAIN_TXT"
tail -n "$HOLDOUT" "$SHUF" > "$HOLDOUT_TXT"

echo "[split] train_lines=$(wc -l < "$TRAIN_TXT") holdout_lines=$(wc -l < "$HOLDOUT_TXT")"
echo "[next] retrain using --data \"$TRAIN_TXT\" and compare sample quality against holdout prompts"

If outputs are mostly exact copies of training lines, you are memorizing. If outputs stay valid and varied on holdout prompts, you are generalizing.

Step 10: Ablation Matrix Sweep (one command)

Ablation Matrix — Sweep These Dimensions Each combination is an independent training run. Results ranked by final loss. Embed Dim 64, 96 Model capacity width --embed-dims Layers 12, 24 Model capacity depth --layers-list BPE Vocab 320, 640 Token granularity --bpe-vocab-sizes Dataset Frac 0.5, 1.0 Data scaling check --dataset-fracs KV Divisor 1, 2 MHA vs GQA --kv-divisors Combinations 32 total runs 2×2×2×2×2 = 32 Outputs: ablation_results_ranked.json | ablation_results.csv | ablation_summary.md ← ranked by final loss

Run a ranked matrix over embed dim, dataset size, token vocab size, layer count, and GQA ratio. This is the fastest way to see which configs stay stable and improve quality.

AB_RUN_ROOT=$HOME/.cache/ck-engine-v7/models/train/v7_ablation_matrix_svg_ascii

python3 version/v7/scripts/run_ablation_matrix_v7.py \
  --run-root "$AB_RUN_ROOT" \
  --data "$DATA_TXT" \
  --tokenizer ascii_bpe \
  --require-svg-rows \
  --embed-dims 64,96 \
  --layers-list 12,24 \
  --bpe-vocab-sizes 320,640 \
  --dataset-fracs 0.5,1.0 \
  --total-tokens-list 131072 \
  --num-heads 8 \
  --kv-divisors 1,2 \
  --seq-len 96 \
  --epochs 3 \
  --grad-accum 4 \
  --lr 2e-4 \
  --max-grad-norm 1.0 \
  --seed 42

Outputs: $AB_RUN_ROOT/ablation_results_ranked.json, $AB_RUN_ROOT/ablation_results.csv, $AB_RUN_ROOT/ablation_summary.md.

# Optional: dry-run preview (no training execution)
python3 version/v7/scripts/run_ablation_matrix_v7.py \
  --run-root "$AB_RUN_ROOT" \
  --data "$DATA_TXT" \
  --dry-run --max-runs 2

CLI Options Cheat Sheet

List all options

python3 version/v7/scripts/ck_run_v7.py --help
python3 version/v7/scripts/ck_run_v7.py run --help
python3 version/v7/scripts/ck_run_v7.py train --help
python3 version/v7/scripts/train_data_pipeline_v7.py --help
python3 version/v7/scripts/promote_latest_checkpoint_v7.py --help
python3 version/v7/scripts/run_ablation_matrix_v7.py --help
python3 version/v7/tools/open_ir_visualizer.py --help
python3 scripts/ck_chat.py --help

Most-used flags

CommandUseful flags
ck_run_v7.py run--context-len, --force-compile, --force-convert, --chat-template, --generate-only, --generate-visualizer
train_data_pipeline_v7.py--template qwen2|qwen3|qwen35|gemma3|nanbeige|llama, --curriculum-stage stage_a|stage_b, --tokenizer bpe|ascii_bpe, --require-svg-rows, --require-ascii-data, --bpe-vocab-size, --seq-len, --total-tokens, --lr, --train-driver ck_cli, --ck-cli-log-every, --prepare-only, --roundtrip-max-lines, --strict-data-gates, --min-valid-svg-rate, --post-train-eval
promote_latest_checkpoint_v7.py--run, --list-runs, --stage, --stage-pass, --run-id, --dry-run, --step
run_ablation_matrix_v7.py--embed-dims, --layers-list, --bpe-vocab-sizes, --dataset-fracs, --kv-divisors, --resume
ck_chat.py--python-tokenizer, --show-token-ids, --show-token-pieces, --ascii-display, --temperature

Human-Readable Output Check

.venv/bin/python scripts/ck_chat.py \
  --model-dir "$RUN/.ck_build" \
  --python-tokenizer --chat-template none \
  --prompt "<svg" --max-tokens 120 --temperature 0.0
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close