v7 Inference + Training Runbook
Single place to run v7 inference from GGUF and v7 training with true_bpe as the tokenizer source of truth across train + infer.
One Pipeline, Two Paths
Inference path: GGUF -> run dir -> IR/codegen -> .ck_build -> ck_chat.py.
Training path: text data -> true_bpe train -> token stream -> ck_run_v7.py train -> reuse same tokenizer artifacts for inference.
For a short operator demo, show three things only:
GGUF -> IR visualizer, one SVG-backed training run, and the family training matrix via make regression-training-fast or make regression-training-full.
Demo Lane: Shortest Video Path
1. GGUF -> IR Visualizer -> Chat
Use this when you want the fastest visual proof that v7 lowers, codegens, and serves an interactive run directory.
version/v7/scripts/cks-v7-run run \ hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \ --context-len 1024 --force-compile --force-convert \ --generate-visualizer
This leaves a run root under ${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v7/models}/Qwen--Qwen3-0.6B-GGUF with ir_report.html ready to open.
2. SVG Training Demo -> Dashboard
This is the shortest end-to-end training demo: repo SVG assets become staged corpora, tokenizer, parity gate, checkpoints, and refreshed visualizer artifacts.
bash version/v7/scripts/v7_train_init_preset.sh \ --name demo_qwen3_l16_d128_h512_ctx512 \ --template qwen3 --init xavier_uniform \ --tokenizer ascii_bpe --optimizer adamw \ --layers 16 --embed-dim 128 --hidden-dim 512 \ --context-len 512 --vocab-size 1024 \ --num-heads 8 --num-kv-heads 4 \ --mode commands | bash
The helper seeds training_plan.json, builds the SVG corpora from docs/site/assets/*.svg, bootstraps the run, runs parity, then refreshes $RUN/ir_report.html.
3. Family Training Proof Table
Use this when you want the neat nightly-style table that shows the v7 training parity surface across the supported families.
make regression-training-fast make regression-training-full
fast covers qwen2, qwen3, gemma, and nanbeige. full adds qwen35 and matches the heavier nightly-sized training lane.
Training Family Quick Starts
These are toy-scale operator starters for the supported v7 training families. They all use the same SVG-backed pipeline and the same helper, so the demo shape stays stable while the family changes.
Qwen2
bash version/v7/scripts/v7_train_init_preset.sh \ --name demo_qwen2_l12_d96_h384_ctx512 \ --template qwen2 --init xavier_uniform \ --tokenizer ascii_bpe --optimizer adamw \ --layers 12 --embed-dim 96 --hidden-dim 384 \ --context-len 512 --vocab-size 1024 \ --num-heads 8 --num-kv-heads 4 \ --mode commands | bash
Qwen3
bash version/v7/scripts/v7_train_init_preset.sh \ --name demo_qwen3_l16_d128_h512_ctx512 \ --template qwen3 --init xavier_uniform \ --tokenizer ascii_bpe --optimizer adamw \ --layers 16 --embed-dim 128 --hidden-dim 512 \ --context-len 512 --vocab-size 1024 \ --num-heads 8 --num-kv-heads 4 \ --mode commands | bash
Qwen3.5
bash version/v7/scripts/v7_train_init_preset.sh \ --name demo_qwen35_l12_d128_h512_ctx1024 \ --template qwen35 --init xavier_uniform \ --tokenizer ascii_bpe --optimizer adamw \ --layers 12 --embed-dim 128 --hidden-dim 512 \ --context-len 1024 --vocab-size 1024 \ --num-heads 8 --num-kv-heads 4 \ --mode commands | bash
Gemma 3
bash version/v7/scripts/v7_train_init_preset.sh \ --name demo_gemma3_l18_d160_h640_ctx1024 \ --template gemma3 --init xavier_uniform \ --tokenizer ascii_bpe --optimizer adamw \ --layers 18 --embed-dim 160 --hidden-dim 640 \ --context-len 1024 --vocab-size 1536 \ --num-heads 10 --num-kv-heads 5 \ --mode commands | bash
Nanbeige
bash version/v7/scripts/v7_train_init_preset.sh \ --name demo_nanbeige_l16_d128_h512_ctx1024 \ --template nanbeige --init xavier_uniform \ --tokenizer ascii_bpe --optimizer adamw \ --layers 16 --embed-dim 128 --hidden-dim 512 \ --context-len 1024 --vocab-size 1024 \ --num-heads 8 --num-kv-heads 8 \ --mode commands | bash
After any family run, regenerate and open the dashboard with the matching run path.
RUN="$HOME/.cache/ck-engine-v7/models/train/demo_qwen3_l16_d128_h512_ctx512" python3 version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only --strict-run-artifacts python3 version/v7/tools/open_ir_hub.py --open
Swap the RUN=... value to the family you just trained. These presets are demo-sized starters, not full published model sizes.
Click any step above to jump there. Steps light up as you work through the training path.
Training Path (true_bpe)
Dataset prep -> ASCII/SVG cleanup -> ascii_bpe tokenizer -> train -> run-local inference.
Inference Path (GGUF)
No training needed. Pull a GGUF from HF and run directly in v7 runtime.
Dataset-only workflow: for focused pretrain/midtrain SVG corpus generation from docs/site/assets/*.svg, use v7-svg-dataset-runbook.html.
Multi-domain training progression (experimental): for staged playbooks across SVG, reasoning/agent routing, and code domains, use v7-training-progression-playbook.html.
Methodical spec design: for the contract-level method that separates asset libraries, scene DSLs, compilers, and per-run report discipline, use spec-training-method.html.
Long-range CK-native roadmap: for how v7 training foundations roll into v8-v17 capabilities, use training-curriculum.html.
Parity gate checklist: run v7-parity-checklist.html for the current GO criteria (dataset_qc, tokenizer_roundtrip exact_match, D1/E1/F1, canary row1/row2).
Caveat: A1/A2 remain a separate kernel-harness bug track and do not block runbook execution readiness under the current operator gate.
Python Authoring Notebooks
Quickstart Notebook
Launch from the repo root so the notebook can auto-detect ckernel_engine/ and version/v7/.
jupyter lab notebooks/v7_python_authoring_quickstart.ipynb
Walks through materialize -> train -> prepare_viewers() and renders a run artifact dashboard with direct links into the rest of the v7 surface.
Artifact Walkthrough Notebook
Use after the quickstart or the CLI example to inspect the run-dir handoff boundary and generated artifacts.
jupyter lab notebooks/v7_python_authoring_artifact_walkthrough.ipynb
Focuses on python_authoring_plan.json, weights_manifest.json, ir1/ir2, layout, reports, and viewer outputs.
DSL Dataset Preparation Notebook
Use this when the user needs to understand the split-aware SVG/DSL workspace itself, not just a tiny training run.
jupyter lab notebooks/v7_dsl_dataset_preparation.ipynb
Scaffolds the real dataset workflow: inspect version/v7/data/spec04, materialize workspace artifacts, stage the workspace into $RUN/dataset/, regenerate dataset_viewer.html, refresh ir_hub.html, and surface the Python + CLI training handoff commands.
What the notebook lane gives you: a Python front door into the existing v7 runtime and artifact surface, not a separate execution stack.
Typical tiny-run outputs: python_authoring_plan.json, ir_report.html, embeddings.json, train_e2e_latest.json, and the shared ir_hub.html.
Conditional artifacts: dataset_viewer.html requires dataset manifests or a staged dataset workspace, and attention.json requires tokenizer plus probe artifacts.
Recommended Notebook Suite
The current notebook set now covers thin Python authoring plus DSL dataset prep. The broader notebook surface should still settle into three user-facing lanes:
- Getting Started + Viewer Tour: start a tiny run, open the IR visualizer, jump to the IR hub, and understand when the dataset viewer is available.
- DSL Dataset Prep + Inspection: stage SVG/DSL data, inspect manifests, regenerate
dataset_viewer.html, and explain how staged datasets flow into runs. - Training With C-Kernel-Engine: materialize, sanity-check, parity-check, and train from Python while still surfacing the underlying
ck_run_v7.pycommands and run-dir artifacts.
Suggested notebook names:
notebooks/v7_python_authoring_quickstart.ipynb: current entrypoint, keep focused on the smallest successful run plus viewer links.notebooks/v7_python_authoring_artifact_walkthrough.ipynb: current run-dir inspection notebook, keep focused on manifests, IR, codegen, and reports.notebooks/v7_dsl_dataset_preparation.ipynb: current dataset workspace notebook for manifests, staging, run-local snapshots, anddataset_viewer.html.notebooks/v7_training_operator_workbench.ipynb: planned training notebook for parity, sanity, full train, and artifact refresh.
The right design is notebook as guided front door, with ir_hub.html and run-local HTML artifacts remaining the durable operator surface.
IR Hub Quick Launch
Open the parent dashboard for all runs under $HOME/.cache/ck-engine-v7/models.
.venv/bin/python version/v7/tools/open_ir_hub.py --open
Your latest run will appear automatically when its run dir is under $HOME/.cache/ck-engine-v7/models/train/... (including svg_l16_d128_h512_v1024_ctx512).
Why SVG for v7
What to remember
- Step 0.5 is minimal and fast; use it for quick validation runs.
- Step 0.55 is for quality scaling before hyperparameter tuning.
- Full infographics are useful after syntax/closure is already stable.
Default progression
Simple primitives -> layout patterns -> full infographic structures.
Use Stage A checkpoint as the base for Stage B midtraining.
After that, run instruction-focused SFT/RL stages for prompt fidelity.
Training Pipeline Graph
Paste a training_plan.json to visualize compute stages + split-aware data flow. Default layout is stage + train/dev/test, with optional holdout when you need a canary or OOD bucket.
Step 0.0: Config Init + Pipeline Designer (Recommended)
Set model shape once (name/layers/embed/context/vocab), then auto-generate the full export + command block with consistent run/data paths.
Interactive Config Builder (JS)
Pick model settings and stage plan, then click Generate to produce copy-ready commands in strict operator sequence.
Preset note: family defaults here are toy-scale operator starters (not exact published model sizes).
Stage Script Activation
DPO/GRPO/PPO are generated in --plan-only mode by default (datasets + manifests + workflow artifacts). Remove --plan-only to execute CE-surrogate alignment updates.
Advanced: token budget override
Auto mode reads pretrain token count from $RUN/training_pipeline_latest.json after --prepare-only.
Pipeline Designer — Define Stages & Generate training_plan.json
Define which training stages to run, assign the primary train dataset and LR per stage, then append dev/test/holdout datasets as needed in JSON. Click Generate Plan to produce a training_plan.json you can commit alongside your run dir and load into the IR Visualizer.
# Click "Generate training_plan.json" to produce the plan.
Save this JSON as $RUN/training_plan.json. The IR Visualizer reads it to render the pipeline timeline. Toggle stages on/off to define exactly what you intend to run. The builder seeds one train dataset per stage; append dev/test/holdout entries under stages[].datasets as you formalize evaluation, and append each completed run under stages[].runs without overwriting history.
# Click "Generate Commands" to populate this block.
Generated sequence: config -> dataset build -> pipeline bootstrap -> tokenizer check -> pretrain -> parity gate -> midtrain -> sft -> dashboards.
Shell script alternative — same config, run directly in terminal instead of using the builder above:
# Print exports + full command plan (copy/paste result; does NOT execute training) bash version/v7/scripts/v7_train_init_preset.sh \ --name svg_l16_d128_h512_v1024_ctx512_clean02 \ --template qwen3 --init xavier_uniform \ --tokenizer ascii_bpe --optimizer adamw \ --layers 16 \ --embed-dim 128 \ --hidden-dim 512 \ --context-len 512 \ --vocab-size 1024 \ --num-heads 8 \ --num-kv-heads 4 \ --mode both # Optional: apply exports directly in current shell eval "$(bash version/v7/scripts/v7_train_init_preset.sh \ --name svg_l16_d128_h512_v1024_ctx512_clean02 \ --template qwen3 --init xavier_uniform \ --tokenizer ascii_bpe --optimizer adamw \ --layers 16 --embed-dim 128 --hidden-dim 512 \ --context-len 512 --vocab-size 1024 --mode env)" # Optional: execute generated command plan directly bash version/v7/scripts/v7_train_init_preset.sh \ --name svg_l16_d128_h512_v1024_ctx512_clean02 \ --template qwen3 --init xavier_uniform \ --tokenizer ascii_bpe --optimizer adamw \ --layers 16 --embed-dim 128 --hidden-dim 512 \ --context-len 512 --vocab-size 1024 --mode commands | bash
The helper derives RUN under $HOME/.cache/ck-engine-v7/models/train/$CK_NAME and binds stage dataset paths using the same prefix, so the IR hub and visualizer stay aligned.
Step 0: Prereqs + Paths
Run this once before any raw .venv/bin/python ... commands:
make v7-init make v7-doctor
Required Python packages (supported v7 workflows): numpy torch safetensors requests tqdm huggingface_hub gguf tokenizers transformers
This is the supported v7 bundle for run / train / parity workflows. Not every package is used on every execution path. If you prefer uv, conda, or manual pip, install that same package set into the interpreter you plan to use, then run make v7-doctor. The runbook examples assume that check already passed.
If you used the Config Builder above, $ROOT, $RUN, and $GEN_DIR are already exported — skip to Step 0A. Use this step when setting paths manually or after opening a new terminal.
export ROOT=/home/antshiv/Workspace/C-Kernel-Engine export RUN=$HOME/.cache/ck-engine-v7/models/train/v7_svg_assets_bpe_l24_full_e1 export DATA_DIR=$ROOT/version/v7/data export DATA_TXT=$DATA_DIR/svg_assets_train.txt mkdir -p "$RUN" "$DATA_DIR" if [ ! -f "$DATA_TXT" ]; then echo "[ERROR] Missing data file: $DATA_TXT" echo "Set DATA_TXT to a real file or create one in $DATA_DIR." else echo "[OK] Found data file: $DATA_TXT" fi
Important: use $HOME/.cache/..., not /.cache/.... These vars are exported for later Python steps; if you open a new terminal, re-run Step 0 first. The repo includes a bundled demo file at version/v7/data/svg_assets_train.txt.
Open Run Hub (all runs in one place):
# Generate + open consolidated run dashboard for ~/.cache/ck-engine-v7/models .venv/bin/python version/v7/tools/open_ir_hub.py --open # Optional explicit paths # .venv/bin/python version/v7/tools/open_ir_hub.py \ # --models-root "$HOME/.cache/ck-engine-v7/models" \ # --output "$HOME/.cache/ck-engine-v7/models/ir_hub.html" \ # --index-out "$HOME/.cache/ck-engine-v7/models/runs_hub_index.json" \ # --open
Use Run Hub as the parent operator view, then drill into each run's ir_report.html.
Guardrail: keep RUN under $HOME/.cache/ck-engine-v7/models/train/... for automatic hub indexing. Repo-local run dirs are easy to forget and usually need manual move/symlink later.
Step 0A: Pipeline Bootstrap (Recommended)
Use this once per new run to make the training pipeline explicit from the start: generate staged datasets, lock tokenizer corpus, seed run artifacts, and generate the first operator view.
Dataset workspace first: if you are using the staged dataset flow, prepare a split-aware workspace such as version/v7/data/spec04 in v7-svg-dataset-runbook.html first, then stage that workspace into the run with --dataset-workspace. The repo workspace is the seed template; the run-local copy under $RUN/dataset/ is the working dataset for that run.
Operator rule: do not split generated artifacts between the repo and cache. Keep the staged dataset, dataset_viewer.html, ir_report.html, checkpoints, parity JSON, and perf outputs together under one cache run directory so the run can be inspected, tarred, shared, and discovered by python3 version/v7/tools/open_ir_hub.py --open without hidden side locations.
# 1) Build staged SVG corpora + instruction corpus export GEN_DIR="$RUN/data" mkdir -p "$GEN_DIR" .venv/bin/python version/v7/scripts/build_svg_pretrain_corpus_v7.py \ --out-dir "$GEN_DIR" \ --prefix svg_pretrain_pack_l16d128 \ --assets-glob "$ROOT/docs/site/assets/*.svg" .venv/bin/python version/v7/scripts/generate_svg_instruction_dataset_v7.py \ --out-dir "$GEN_DIR" \ --prefix svg_pretrain_pack_l16d128 \ --num-samples 1000 --jsonl # 2) Canonical stage dataset bindings for this run export TOKENIZER_CORPUS="$GEN_DIR/svg_pretrain_pack_l16d128_tokenizer_corpus.txt" export PRETRAIN_DATA="$GEN_DIR/svg_pretrain_pack_l16d128_stage_a_plus_bridge.txt" export MIDTRAIN_DATA="$GEN_DIR/svg_pretrain_pack_l16d128_stage_b.txt" export SFT_DATA="$GEN_DIR/svg_pretrain_pack_l16d128_instruction_train.txt" # 3) Bootstrap run + tokenizer artifacts (no training yet) .venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \ --run "$RUN" \ --init-if-missing \ --init xavier_uniform \ --template qwen3 \ --curriculum-stage stage_a \ --tokenizer ascii_bpe \ --require-svg-rows \ --strict-data-gates \ --data "$TOKENIZER_CORPUS" \ --vocab-size 1024 --bpe-vocab-size 1024 \ --layers 16 --embed-dim 128 --hidden-dim 512 \ --seq-len 512 --total-tokens 1048576 \ --prepare-only \ --json-out "$RUN/train_prepare_stage_a.json" # 4) Generate initial operator artifacts .venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only --strict-run-artifacts .venv/bin/python version/v7/tools/open_ir_hub.py --open
Canonical path: Step 0A is the real bootstrap path. Use Step 0.7 only as a tiny tokenizer preview smoke-check (it uses stub dims and should not be your production bootstrap).
Split-aware staged-dataset bootstrap (spec04 default)
# 0) Prepare a split-aware workspace once (see v7-svg-dataset-runbook.html for import/normalize/classify/materialize) bash version/v7/scripts/init_data_workspace_v7.sh --spec spec04 --dataset-type svg # 1) Bootstrap a run with a staged dataset snapshot under $RUN/dataset .venv/bin/python version/v7/scripts/ck_run_v7.py init \ --run "$RUN" \ --layers 16 --vocab-size 1024 --embed-dim 128 --hidden-dim 512 \ --num-heads 8 --num-kv-heads 4 --context-len 512 \ --template qwen3 \ --dataset-workspace version/v7/data/spec04 \ --dataset-stage-mode copy \ --dataset-stage-force \ --generate-ir --generate-runtime --strict # 2) Regenerate both operator views after dataset changes .venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only --strict-run-artifacts --output "$RUN/ir_report.html" .venv/bin/python version/v7/scripts/build_svg_dataset_visualizer_v7.py \ --workspace "$RUN/dataset" \ --output "$RUN/dataset_viewer.html"
Use this flow when you want the whole run to be self-contained under $RUN: IR viewer, Dataset Viewer, manifests, stage split corpora (train/dev/test by default), tokenizer corpus, optional holdout, parity JSON, and perf artifacts all live under the same run directory.
Rule: after this bootstrap, keep tokenizer fixed and run later stages with --reuse-run-tokenizer to avoid vocabulary drift.
Step 0B: Tag Contract — Specialist Model Interface
This model is a tag→SVG specialist renderer, not a general instruction-following LLM. The interface is a closed vocabulary of structured tags — understand this contract before training or running inference.
Training Row Format
# Every training row and inference prompt uses the same closed-vocab tag prefix: [circle][palette:cool][style:minimal][layout:center]<svg width="140" height="140">...</svg><eos> [bar-chart][bars:5][ascending][palette:warm][axes][trend-line]<svg width="280" height="180">...</svg><eos> [infographic][palette:dark][style:gradient][labeled][complexity:rich]<svg width="320" height="180">...</svg><eos>
The old <task>...</task><svg> format is superseded. All spec-driven corpus rows use the tag prefix format above. Do not mix formats in the tokenizer corpus.
Tag Vocabulary
| Category | Tags |
|---|---|
| Shape | [circle] [rect] [line] [triangle] [ellipse] [polygon] [path] [arrow] [double-arrow] [polyline] |
| Chart | [bar-chart] [line-chart] [table] [pie-chart] [scatter] |
| Infographic | [infographic] [badge] [flow] [timeline] [legend] [progress] |
| Count | [bars:3] [bars:5] [bars:7] [points:4] [points:6] [points:8] [slices:3] [slices:4] [shapes:2] [shapes:3] |
| Data order | [ascending] [descending] [mixed] [axes] [trend-line] [labeled] [values] |
| Palette | [palette:neutral] [palette:bold] [palette:warm] [palette:cool] [palette:pastel] [palette:dark] |
| Style | [style:gradient] [style:minimal] [style:filled] [style:outline] |
| Layout | [layout:center] [layout:tiled] [layout:grid] [layout:stacked] [layout:horizontal] [layout:header] [layout:flow] |
| Complexity | [complexity:simple] [complexity:moderate] [complexity:rich] |
| Prompt style | [prompt:concise] [prompt:descriptive] [prompt:coordinate] [prompt:imperative] |
Custom BPE is trained on this corpus — frequent patterns like [bar-chart] merge into a single token. Verify this after Step 0A: [bar-chart] should appear as 1 token ID, not 11 characters.
Two-Model Cascade (Inference)
# Big model (GPT/Claude): natural language → structured tags # Small model (128d CK specialist): tags → SVG # User: "5-bar ascending chart, warm colors" # → Big model → [bar-chart][bars:5][ascending][palette:warm][axes] # → Small model → <svg>...</svg> # Direct inference after training: echo "[bar-chart][bars:5][ascending][palette:cool][axes]" | \ ./build/ck-cli-v7 --model "$CKPT_PATH" --max-tokens 512
Coverage Gate Rules
- ≥ 200 train variants per spec — enough diversity per tag combination
- ≥ 20 holdout variants per spec — holdout set is meaningful per spec
- ≥ 30 pairwise co-occurrences — every key tag pair appears together ≥ 30 times
Gate is enforced by --strict-coverage in the corpus builder. Spec catalog: version/v7/data/spec_catalog_v1.json (128 specs across 6 groups). Coverage manifests are emitted per stage and consumed by the pipeline gate in Step 3.7.
Step 0.1: Clean Start (Optional)
Use this when you want a true fresh runbook pass with no stale artifacts.
# Preview what will be removed (scoped to this RUN only) echo "$RUN" echo "$RUN/parity_canary" echo "$RUN/parity_svg_row1" echo "$RUN/parity_svg_row2" # Fresh start: remove previous artifacts for this run scope rm -rf "$RUN" # Recreate run dir and verify env is still exported mkdir -p "$RUN" echo "[OK] RUN=$RUN" echo "[OK] DATA_TXT=$DATA_TXT"
This does not touch repo source files. It only resets this specific run path under $HOME/.cache/ck-engine-v7/models/train/....
Step 0.5: Data Preparation (Training)
This is the minimal baseline data path. Training expects a UTF-8 plain text file. For a quick demo, use the bundled $DATA_TXT. For strict SVG + ASCII mode, treat cleanup as required before training.
- Fast path: use bundled demo file in this repo (no conversion step).
- Recommended: one logical sample per line (for SVG data, one SVG per line).
- JSONL is fine as source, but flatten to text first with an explicit source path.
- ASCII-only requirement for
--tokenizer ascii_bpe: first map common symbols to keyboard-style ASCII (example:→->->,±->+/-,×->x), then XML-escape any remaining non-ASCII chars (&#xNNNN;). - Training now has dataset QC gates: ASCII is enforced by default for
ascii_bpe; add--require-svg-rowsto fail on non-<svglines. - Avoid binary data; normalize line endings to
\n.
When to use Step 0.55: if the baseline file is small/repetitive and output quality stalls (closure/valid SVG rate), run Step 0.55 to scale corpus coverage before changing model hyperparameters.
# Fast path: bundled demo dataset (already in repo)
BASE_DATA_TXT="${DATA_DIR:-$ROOT/version/v7/data}/svg_assets_train.txt"
if [ ! -s "$DATA_TXT" ] && [ -s "$BASE_DATA_TXT" ]; then
echo "[WARN] DATA_TXT missing or empty: $DATA_TXT"
echo "[WARN] Falling back to bundled dataset: $BASE_DATA_TXT"
DATA_TXT="$BASE_DATA_TXT"
export DATA_TXT
fi
if [ ! -s "$DATA_TXT" ]; then
echo "[ERROR] Bundled dataset missing or empty: $DATA_TXT"
echo "Fix: git pull, or set DATA_TXT to your own text file."
else
echo "[OK] Using bundled dataset: $DATA_TXT"
wc -l -c "$DATA_TXT"
head -n 3 "$DATA_TXT"
fi
# Optional: replace with your own JSONL source
# Leave SRC_JSONL empty to skip this section.
SRC_JSONL=""
if [ -n "$SRC_JSONL" ]; then
if [ ! -f "$SRC_JSONL" ]; then
echo "[ERROR] Missing JSONL source: $SRC_JSONL"
echo "Set SRC_JSONL to a real path, or keep it empty to use bundled data."
else
jq -r '.text // empty' "$SRC_JSONL" | sed '/^[[:space:]]*$/d' > "$DATA_TXT"
perl -pe 's/\r\n?/\n/g' "$DATA_TXT" > "$DATA_TXT.tmp" && mv "$DATA_TXT.tmp" "$DATA_TXT"
echo "[OK] Replaced DATA_TXT from JSONL: $DATA_TXT"
wc -l -c "$DATA_TXT"
head -n 3 "$DATA_TXT"
fi
else
echo "[INFO] SRC_JSONL not set; keeping current DATA_TXT."
fi
# Required for strict SVG + ASCII path # This cleanup keeps bytes ASCII-only and drops non-SVG rows. SRC_DATA="$DATA_TXT" # or /absolute/path/raw.jsonl SRC_FMT="text" # text | jsonl | auto python3 version/v7/scripts/prepare_ascii_dataset_v7.py \ --input "$SRC_DATA" \ --output "$DATA_TXT" \ --input-format "$SRC_FMT" \ --jsonl-text-key text \ --ascii-map-common \ --ascii-mode xml_escape \ --svg-only wc -l -c "$DATA_TXT" head -n 3 "$DATA_TXT"
--ascii-map-common applies keyboard-style replacements for common symbols first; xml_escape then preserves any remaining Unicode semantics while keeping file bytes ASCII-only (example: é -> é).
At runtime, v7 training uses next-token windows from the token stream, not "one JSONL row = one batch sample".
Step 0.55: Build Larger ASCII SVG Corpus (Recommended)
This is the scale-up data path. It increases dataset diversity before you tune model hyperparameters. Output remains strictly ASCII while coverage expands using repo SVG assets + synthetic SVG generator data.
# Recommended: one-command builder (stitches existing scripts). python3 version/v7/scripts/build_svg_pretrain_corpus_v7.py \ --out-dir "$DATA_DIR" \ --prefix svg_pretrain_pack \ --assets-glob "$ROOT/docs/site/assets/*.svg" \ --spec-catalog "$ROOT/version/v7/data/spec_catalog_v1.json" \ --stage-a-samples 24000 \ --stage-b-samples 28000 \ --holdout-ratio 0.10 \ --seed 42 \ --strict-coverage DATA_TXT="$DATA_DIR/svg_pretrain_pack_stage_a_plus_bridge.txt" DATA_TXT_STAGE_B="$DATA_DIR/svg_pretrain_pack_stage_b.txt" echo "[OK] Stage-A+bridge: $DATA_TXT" echo "[OK] Stage-B: $DATA_TXT_STAGE_B" echo "[OK] Coverage gate: passed (strict)" wc -l -c "$DATA_TXT" "$DATA_TXT_STAGE_B"
This wrapper calls existing scripts (build_svg_corpus_from_assets_v7.py, generate_svg_instruction_dataset_v7.py, build_stage_a_bridge_svg_v7.py, prepare_ascii_dataset_v7.py) and emits train-ready Stage-A+bridge and Stage-B corpora.
Manual equivalent (advanced / debugging):
ASSET_UTF8="$DATA_DIR/svg_assets_docs_utf8.txt"
ASSET_UTF8_MANIFEST="$DATA_DIR/svg_assets_docs_utf8_manifest.json"
MAP_TSV="$DATA_DIR/svg_utf8_to_ascii_table.tsv"
MAP_JSON="$DATA_DIR/svg_utf8_to_ascii_table.json"
ASSET_ASCII="$DATA_DIR/svg_assets_docs_ascii.txt"
ASSET_MANIFEST="$DATA_DIR/svg_assets_docs_ascii_manifest.json"
SYN_PREFIX="svg_instruction_aug"
SYN_SVG="$DATA_DIR/${SYN_PREFIX}_svg_train.txt"
DATA_TXT_ASCII="$DATA_DIR/svg_assets_train_scaled_ascii.txt"
TMP_ASCII="$DATA_DIR/svg_assets_train_scaled_ascii.tmp.txt"
# 1) Flatten docs/site/assets/*.svg to one UTF-8 SVG-per-line corpus.
python3 version/v7/scripts/build_svg_corpus_from_assets_v7.py \
--assets-glob "$ROOT/docs/site/assets/*.svg" \
--output "$ASSET_UTF8" \
--manifest "$ASSET_UTF8_MANIFEST" \
--ascii-mode none \
--no-dedupe
# 2) Generate UTF-8 -> ASCII mapping coverage table.
python3 version/v7/scripts/generate_utf8_ascii_map_report_v7.py \
--input "$ASSET_UTF8" \
--tsv-out "$MAP_TSV" \
--json-out "$MAP_JSON"
# 3) Build strict ASCII SVG corpus: keyboard mapping first, XML escapes as fallback.
python3 version/v7/scripts/build_svg_corpus_from_assets_v7.py \
--assets-glob "$ROOT/docs/site/assets/*.svg" \
--output "$ASSET_ASCII" \
--manifest "$ASSET_MANIFEST" \
--ascii-map-common \
--ascii-mode xml_escape \
--no-dedupe
# 4) Generate additional synthetic SVG-only rows (already ASCII).
python3 version/v7/scripts/generate_svg_instruction_dataset_v7.py \
--out-dir "$DATA_DIR" \
--prefix "$SYN_PREFIX" \
--num-samples 20000 \
--holdout-ratio 0.10 \
--seed 42
# 5) Merge both sources, shuffle, and enforce strict ASCII +
What Step 0.55 does in plain terms: (1) flattens docs/site/assets/*.svg into line-per-SVG corpora, (2) reports UTF-8->ASCII mapping coverage, (3) builds strict ASCII corpus, (4) augments with synthetic SVG rows, (5) shuffles/filters and promotes the larger dataset for the next steps.
This keeps the training stream strictly ASCII while preserving intent: keyboard-style substitutions first, XML escapes second. The scaled corpus usually improves closure stability and lowers exact-copy rate on holdout checks compared with the tiny bundled demo file.
Step 0.56: Stage-A Syntax Bridge Pack (Optional, Recommended)
If Stage-A uses a tiny/simple corpus, inject a small bridge pack so Stage-A already sees core Stage-B syntax primitives (<g>, <defs>, gradients, markers, transforms, filters, viewBox).
STAGE_B_SOURCE="${ASSET_ASCII:-$DATA_DIR/svg_assets_docs_ascii.txt}"
BRIDGE_TXT="$DATA_DIR/svg_stage_a_bridge_small.txt"
BRIDGE_MANIFEST="$DATA_DIR/svg_stage_a_bridge_small_manifest.json"
STAGE_A_PLUS="$DATA_DIR/svg_stage_a_plus_bridge_small.txt"
python3 version/v7/scripts/build_stage_a_bridge_svg_v7.py \
--stage-a "$DATA_TXT" \
--stage-b "$STAGE_B_SOURCE" \
--out "$BRIDGE_TXT" \
--manifest "$BRIDGE_MANIFEST" \
--per-feature-cap 4 \
--max-total 48 \
--seed 42
{ cat "$DATA_TXT"; cat "$BRIDGE_TXT"; } | sed '/^[[:space:]]*$/d' | shuf > "$STAGE_A_PLUS"
python3 version/v7/scripts/prepare_ascii_dataset_v7.py \
--input "$STAGE_A_PLUS" \
--output "$STAGE_A_PLUS" \
--input-format text \
--ascii-map-common \
--ascii-mode xml_escape \
--svg-only
DATA_TXT="$STAGE_A_PLUS"
echo "[OK] Stage-A bridge enabled: DATA_TXT=$DATA_TXT"
wc -l "$BRIDGE_TXT" "$DATA_TXT"
This keeps Stage-A simple while preventing a hard syntax shock at Stage-B. You still keep Stage-A objective/hyperparameters; only the input syntax support broadens.
Step 0.6: Preview Dataset
Quick sanity pass before training to confirm file size/content shape.
echo "[INFO] DATA_TXT=$DATA_TXT" ls -lh "$DATA_TXT" wc -l -c "$DATA_TXT" head -n 5 "$DATA_TXT" tail -n 5 "$DATA_TXT" # Optional: show any non-SVG-looking lines grep -nEv '^[[:space:]]*
Step 0.7: Tokenize Dataset (true_bpe preview)
Optional preview only: skip this if you already ran Step 0A. Step 0.7 is a quick tokenizer smoke-check path with intentionally tiny/stub dims.
Create tokenizer artifacts and a token stream in the main run so operators use one dashboard path from start to finish.
TOKENIZE_RUN="$RUN" TOKENIZE_WORK="$TOKENIZE_RUN/.ck_pipeline/bpe_preview_$(date -u +%Y%m%d_%H%M%S)" mkdir -p "$TOKENIZE_RUN" cd "$ROOT" # Use strict ASCII tokenizer mode: # --tokenizer ascii_bpe .venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \ --run "$TOKENIZE_RUN" \ --init-if-missing \ --init xavier_uniform \ --template qwen3 \ --curriculum-stage stage_a \ --tokenizer ascii_bpe \ --require-svg-rows \ --strict-data-gates --min-valid-svg-rate 0.70 \ --roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \ --data "$DATA_TXT" \ --vocab-size 320 --bpe-vocab-size 320 \ --layers 2 --embed-dim 64 --hidden-dim 128 \ --epochs 1 --seq-len 8 --total-tokens 64 \ --grad-accum 1 --lr 5e-4 --max-grad-norm 1.0 --seed 42 \ --work-dir "$TOKENIZE_WORK" \ --json-out "$TOKENIZE_WORK/pipeline_report.json" \ --prepare-only echo "[OK] tokenizer: $TOKENIZE_WORK/tokenizer.json" echo "[OK] tokens: $TOKENIZE_WORK/train_tokens.txt" wc -l "$TOKENIZE_WORK/train_tokens.txt" head -n 40 "$TOKENIZE_WORK/train_tokens.txt"
This is a preflight path. --prepare-only builds tokenizer + token stream without launching training.
Open IR visualizer (canonical run path):
.venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$TOKENIZE_RUN" --html-only echo "[OK] Visualizer: $TOKENIZE_RUN/ir_report.html" xdg-open "$TOKENIZE_RUN/ir_report.html" >/dev/null 2>&1 || true
In the report, check: Architecture -> Tokenizer (click tokenizer.json) and Training -> Pipeline + Data Lab (dataset folder path + tokenizer path + roundtrip samples).
Step 0.8: Encode/Decode Determinism Gate
Most accurate correctness check for ascii_bpe: encode the full dataset, decode it back, and compare to original bytes.
.venv/bin/python version/v7/scripts/test_ascii_bpe_roundtrip_v7.py \ --run "$TOKENIZE_RUN" \ --dataset "$DATA_TXT" \ --require-ascii
Pass condition: ascii_bpe roundtrip PASS and matching byte counts.
Step 0.9: Data Lab Artifacts + Paths
Pipeline now writes Data Lab JSON artifacts directly under $RUN so the visualizer can show dataset folder path, tokenizer JSON path, and row-level roundtrip samples.
ls -lh "$TOKENIZE_RUN"/dataset_qc.json \
"$TOKENIZE_RUN"/dataset_profile.json \
"$TOKENIZE_RUN"/tokenizer_roundtrip.json || true
jq '{status, path, dataset_dir, checks, non_empty_lines}' "$TOKENIZE_RUN/dataset_qc.json"
jq '{dataset_dir, line_length, duplicates, top_chars: (.top_chars|length), top_tokens: (.top_tokens|length)}' "$TOKENIZE_RUN/dataset_profile.json"
jq '{status, exact_match, tokenizer_json_path, line_eval, sample_rows: (.sample_rows|length)}' "$TOKENIZE_RUN/tokenizer_roundtrip.json"
Visualizer path: switch to Training mode and open Data Lab. This tab shows: dataset_dir, dataset_path, tokenizer_json_path, and row → token-id → decoded previews.
Step 2: Optional Alternative Data Ingestion (true_bpe)
This section is an alternative ingest path. Step 0.5 already covers the canonical data-prep flow; if you completed Step 0.5, skip this section and go straight to Step 3.1.
JSONL -> text with jq
# Optional: set only if you want to replace bundled DATA_TXT. SRC_JSONL="" if [ -z "$SRC_JSONL" ]; then echo "[INFO] SRC_JSONL empty; keeping DATA_TXT as-is: $DATA_TXT" elif [ ! -f "$SRC_JSONL" ]; then echo "[ERROR] Missing JSONL source: $SRC_JSONL" echo "Set SRC_JSONL to a real path. Example: SRC_JSONL=$HOME/data/train.jsonl" else jq -r '.text // empty' "$SRC_JSONL" > "$DATA_TXT" echo "[OK] Wrote DATA_TXT from JSONL: $DATA_TXT" fi
JSONL -> text with Python
# Optional: set only if you want to replace bundled DATA_TXT. export SRC_JSONL="" python3 - <{dst}") PY
Step 3.1: Mandatory CK vs PyTorch Parity Gate (10 epochs)
Required gate before long CK-only training: run two tiny canaries (dataset rows 1 and 2) for 10 epochs with --with-torch-ref. This catches generated-runtime or codegen corruption early.
cd "$ROOT"
PARITY_ROOT="$RUN/parity_canary"
mkdir -p "$PARITY_ROOT"
# Use a stable canary source for strict CK-vs-PyTorch parity gating.
# You can override with: export V7_PARITY_CANARY_DATA=/path/to/your_two_row_svg_file.txt
CANARY_SRC="${V7_PARITY_CANARY_DATA:-$ROOT/version/v7/data/svg_assets_train.txt}"
if [ ! -s "$CANARY_SRC" ]; then
echo "[WARN] Missing default canary source: $CANARY_SRC"
echo "[WARN] Falling back to DATA_TXT for canaries: $DATA_TXT"
CANARY_SRC="$DATA_TXT"
fi
echo "[INFO] parity canary source: $CANARY_SRC"
sed -n '1p' "$CANARY_SRC" > "$PARITY_ROOT/svg_row1.txt"
sed -n '2p' "$CANARY_SRC" > "$PARITY_ROOT/svg_row2.txt"
if [ ! -s "$PARITY_ROOT/svg_row1.txt" ] || [ ! -s "$PARITY_ROOT/svg_row2.txt" ]; then
echo "[FAIL] Need at least 2 non-empty rows in DATA_TXT for parity canaries."
echo "[FAIL] Re-run Step 0A and Step 0.5/0.55/0.56, then retry Step 3.1."
exit 1
else
for idx in 1 2; do
ROW_FILE="$PARITY_ROOT/svg_row${idx}.txt"
ROW_RUN="$RUN/parity_svg_row${idx}"
rm -rf "$ROW_RUN"
.venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \
--run "$ROW_RUN" \
--init-if-missing \
--init xavier_uniform \
--template qwen3 \
--tokenizer ascii_bpe \
--require-svg-rows \
--strict-data-gates \
--data "$ROW_FILE" \
--vocab-size 2048 --bpe-vocab-size 2048 \
--layers 4 --embed-dim 96 --hidden-dim 192 --num-heads 8 --num-kv-heads 4 \
--epochs 10 --seq-len 64 --total-tokens 12288 \
--grad-accum 1 --lr 3e-4 --max-grad-norm 1.0 --seed 42 \
--train-driver ck_run \
--with-torch-ref \
--no-post-train-eval \
--no-open-visualizer \
--json-out "$ROW_RUN/parity_pipeline.json"
done
fi
Step 3.1 is a numerical parity gate. It intentionally disables post-train SVG quality eval so generation-quality noise does not mask CK-vs-PyTorch parity status. With --no-post-train-eval, strict output-quality SVG gating is skipped by design for this canary stage. The canary command uses --seq-len 64 on purpose so single-row SVG probes do not trigger wrap-heavy false drift.
Gate check: compare CK vs PyTorch loss curves and enforce thresholds.
python3 - <<'PY'
import json
import os
from pathlib import Path
from statistics import mean
# TH = threshold
TH_MAX = 1e-4
TH_MEAN = 5e-5
TH_PARAM = 1e-4
run_env = os.environ.get("RUN", "").strip()
if not run_env:
print("[FAIL] RUN env var is empty. Example: export RUN=$HOME/.cache/ck-engine-v7/models/train/my_run")
raise SystemExit(1)
ROOT = Path(run_env)
ok = True
for idx in (1, 2):
run_dir = ROOT / f"parity_svg_row{idx}" / ".ck_pipeline"
work_dirs = sorted([p for p in run_dir.glob("ascii_bpe_*") if p.is_dir()])
if not work_dirs:
print(f"[FAIL] row{idx}: missing {run_dir}/ascii_bpe_*")
ok = False
continue
w = work_dirs[-1]
ck = json.loads((w / "train_ck.json").read_text())
pt = json.loads((w / "train_torch_ref.json").read_text())
c = [float(x["loss_ck"]) for x in ck.get("loss_curve", [])]
t = [float(x["loss"]) for x in pt.get("loss_curve", [])]
n = min(len(c), len(t))
if n == 0:
print(f"[FAIL] row{idx}: empty loss curves")
ok = False
continue
diffs = [abs(c[i] - t[i]) for i in range(n)]
max_abs = max(diffs)
mean_abs = mean(diffs)
final_param = float(ck.get("final_param_max_abs_diff", 1.0))
passed = max_abs <= TH_MAX and mean_abs <= TH_MEAN and final_param <= TH_PARAM
print(f"[row{idx}] max_abs={max_abs:.6e} mean_abs={mean_abs:.6e} final_param={final_param:.6e} pass={passed}")
ok = ok and passed
print("PARITY_GATE=PASS" if ok else "PARITY_GATE=FAIL")
PY
Go/No-go: Continue CK-only training only after PARITY_GATE=PASS. If it fails, keep --with-torch-ref enabled and debug before scaling model size, epochs, or data.
Canonical operator path: use $RUN/ir_report.html as the single dashboard. Parity canaries run under $RUN/parity_svg_row1 and $RUN/parity_svg_row2 only for isolated diagnostics; Step 3.2 writes aggregated parity artifacts back into $RUN.
Canary visibility: regenerate $RUN/ir_report.html and open Training -> Train Dashboard / Training -> CK vs PyTorch to see a canary summary (row1/row2 pass/fail) without opening sub-run reports.
Step 3.2: Automated Training Parity Regimen (Recommended)
Run the full staged regimen (first/second-step parity, backend xray attribution, grad-accum sweeps, multi-epoch stability, replay/stitch checks) and save one operator table artifact.
cd "$ROOT" .venv/bin/python version/v7/scripts/run_training_parity_regimen_v7.py \ --run-dir "$RUN"
Outputs: $RUN/training_parity_regimen_latest.json, $RUN/training_parity_regimen_latest.md, $RUN/regimen_backend_xray.json, and $RUN/training_parity_regimen_logs/.
Operator gate mapping: this step supplies the D1/E1/F1 pass evidence used in the checklist GO criteria.
Backend xray adds per-backend first-step deltas (rmsnorm/swiglu/loss swap table), suspected divergence source, and reference forward/backward formulas so operators can inspect exact math paths without scrolling large logs.
Auto-skip behavior: if previous regimen passed and runtime/codegen fingerprint is unchanged, this script exits quickly with a reused PASS state. Use --force to rerun anyway.
ck_run operator options: training commands can now suggest, run, or require this regimen.
# Non-blocking recommendation only (default) python3 version/v7/scripts/ck_run_v7.py train --run "$RUN" --parity-regimen suggest # Run regimen after train, but continue even if it fails python3 version/v7/scripts/ck_run_v7.py train --run "$RUN" --parity-regimen run # Run regimen after train and fail command if regimen fails python3 version/v7/scripts/ck_run_v7.py train --run "$RUN" --parity-regimen require
Use --parity-regimen off only for deliberate fast-debug loops.
Step 3.3: Nightly Core Stabilization Matrix (1-4 Layers)
Run one nightly matrix that combines strict tokenizer gates and parity gates:
- Tokenizer gates:
ascii_bpeandbpewith strict encode/decode roundtrip exactness. - Parity matrix: layers
1,2,3,4, variable token budgets, grad-accum sweep2,4,8. - Replay/runtime checks: determinism + accum snapshot checks included by default.
cd "$ROOT" make v7-stabilization-nightly \ V7_STABILIZATION_RUN_ROOT="$RUN/nightly_stabilization" \ V7_STABILIZATION_DATA="$DATA_TXT"
Optional main-run validation: add V7_STABILIZATION_MAIN_RUN_DIR="$RUN" to include your current run-dir weights as an extra parity case.
Outputs: training_stabilization_scorecard_latest.json, training_stabilization_scorecard_latest.md, and append-only training_stabilization_history.jsonl.
Tune matrix size with: V7_STABILIZATION_LAYERS, V7_STABILIZATION_TOKEN_BUDGETS, V7_STABILIZATION_GRAD_ACCUM_SWEEP.
Step 3.7: Run Full Training Pipeline (After Parity Gates)
Default Step 3.7 below is a functional end-to-end run (parity-first, quality gate non-blocking) so operators can complete the pipeline in one pass.
For strict output-quality enforcement (valid_svg_rate gate), use the strict command in Step 9 after data scaling/SFT expansion.
Stage Pass Criteria (SVG Specialist Model)
Before advancing to the next training stage, verify these minimum bars. Check manually on 5–10 holdout samples using the tag sequences below.
| Stage | Pass condition | Quick check command |
|---|---|---|
| Stage A primitives |
Loss converges. [circle][palette:cool] → valid SVG containing <circle. Output parses without error. |
echo "[circle][palette:cool][style:minimal]" | ./build/ck-cli-v7 --model $CKPT |
| Stage B charts/infographic |
[bar-chart][bars:5][ascending] → bars are actually ascending in output. [palette:warm] → output uses warm-family hex colors, not random CSS names. |
echo "[bar-chart][bars:5][ascending][palette:warm][axes]" | ./build/ck-cli-v7 --model $CKPT |
| SFT enriched |
[infographic][palette:dark][labeled][complexity:moderate] → labeled dashboard with dark background. Multiple elements visible and in-bounds. |
echo "[infographic][palette:dark][style:gradient][labeled][complexity:moderate]" | ./build/ck-cli-v7 --model $CKPT |
If Stage A fails the circle test, check that [circle] merged to a single BPE token (Step 0B). If Stage B bars are not ascending, the bar_chart generator's ascending constraint may not be wired — check spec_catalog_v1.json constraint field vs generator logic.
Same pipeline, two train executors. Pick one:
C Variant (ck-cli)
Native training loop in build/ck-cli-v7 with live step/epoch loss + perplexity logs.
cd "$ROOT" .venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \ --run "$RUN" \ --init-if-missing \ --init xavier_uniform \ --template qwen3 \ --curriculum-stage stage_a \ --tokenizer ascii_bpe \ --require-svg-rows \ --roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \ --data "$DATA_TXT" \ --vocab-size 320 --bpe-vocab-size 320 \ --layers 24 --embed-dim 64 --hidden-dim 128 \ --epochs 1 --seq-len 32 --total-tokens 841472 \ --grad-accum 1 --lr 5e-4 --max-grad-norm 1.0 --seed 42 \ --train-driver ck_cli --ck-cli-log-every 200 \ --json-out version/v7/reports/v7_svg_assets_bpe_pipeline_l24_full_e1_latest.json
Python Variant (ck_run)
Python harness path (ck_run_v7.py train --backend ck) with parity/debug flexibility.
cd "$ROOT" .venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \ --run "$RUN" \ --init-if-missing \ --init xavier_uniform \ --template qwen3 \ --curriculum-stage stage_a \ --tokenizer ascii_bpe \ --require-svg-rows \ --roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \ --data "$DATA_TXT" \ --vocab-size 320 --bpe-vocab-size 320 \ --layers 24 --embed-dim 64 --hidden-dim 128 \ --epochs 1 --seq-len 32 --total-tokens 841472 \ --grad-accum 1 --lr 5e-4 --max-grad-norm 1.0 --seed 42 \ --train-driver ck_run \ --json-out version/v7/reports/v7_svg_assets_bpe_pipeline_l24_full_e1_latest.json
Step 4: Artifact Locations + Data Format
| Artifact | Path Pattern | Purpose |
|---|---|---|
| Training BPE JSON | $RUN/.ck_pipeline/bpe_<ts>/tokenizer.json |
Tokenizer produced by BPE training pass. |
| Training BPE binary | $RUN/.ck_pipeline/bpe_<ts>/bpe_bin/* |
Binary artifacts consumed by CK true_bpe runtime. |
| Canonical run tokenizer | $RUN/tokenizer.json + $RUN/tokenizer_bin/* |
Single source of truth copied into run dir for infer reuse. |
| Token stream file | $RUN/.ck_pipeline/bpe_<ts>/train_tokens.txt |
One integer token id per line. Used by --train-token-file. |
| Dataset QC artifact | $RUN/dataset_qc.json |
ASCII/SVG gate status, row/byte counts, dataset path metadata. |
| Dataset profile artifact | $RUN/dataset_profile.json |
Length histogram, duplicate rows, top chars/tokens. |
| Tokenizer roundtrip artifact | $RUN/tokenizer_roundtrip.json |
Encode/decode exactness + per-line sample browser rows. |
| Post-train eval artifact | $RUN/post_train_eval.json |
valid SVG rate, closure success, repetition/loop score. If strict gate fails here, treat it as data/task quality (not CK-vs-PyTorch parity math); increase corpus coverage and add instruction-to-SVG SFT pairs. |
| Train report | .../train_ck.json + your --json-out file |
Loss stats + pipeline metadata. |
| PyTorch parity report | .../train_torch_ref.json + $RUN/training_parity.json |
Reference loss curve + stepwise parity diffs for CK-vs-PyTorch training validation. |
Training is next-token LM over the token stream in
seq_len windows; it is not "one JSONL row = one sample" at runtime.
Step 5: Promote Latest Checkpoint to Inference Weights
Training writes runtime checkpoints under $RUN/checkpoints, while inference consumes $RUN/weights.bump + $RUN/weights_manifest.json. Promote by stage/pass for deterministic analysis.
# Inspect stage history + pass numbers first python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --list-runs # Promote latest checkpoint (legacy default: max step) python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" # Optional: preview selected checkpoint only python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --dry-run # Optional: promote a specific step python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --step 1234 # Preferred: promote latest available pass for a stage (scales to any model) python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage pretrain python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage midtrain python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage sft # Optional: exact stage pass (when multiple passes exist) python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage sft --stage-pass 2 # Optional: promote exact run_id python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --run-id ascii_bpe_YYYYmmdd_HHMMSS
Promotion is atomic and updates weights.bump + weights_manifest.json in $RUN. Use --stage/--stage-pass for apples-to-apples stage comparisons.
Step 5.1: Stage Completion Test Loop (Promote -> Infer -> Refresh)
Run this loop after each completed stage to validate behavior before continuing.
# 0) List runs once (find stage_pass values) python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --list-runs # 1) Promote target stage checkpoint (latest pass for that stage) python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage pretrain # python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage midtrain # python3 version/v7/scripts/promote_latest_checkpoint_v7.py --run "$RUN" --stage sft # 2) Build inference runtime from promoted weights .venv/bin/python version/v7/scripts/ck_run_v7.py run "$RUN" --generate-only --context-len 512 # 3) Probe behavior for this stage .venv/bin/python scripts/ck_chat.py \ --model-dir "$RUN/.ck_build" \ --python-tokenizer --chat-template none \ --prompt "[circle][palette:cool][style:minimal]<svg" \ --max-tokens 96 --temperature 0 --top-p 1.0 --repeat-penalty 1.05 --repeat-last-n 256 \ --stop-at-eos # 4) Refresh visualizer to capture this state python3 version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only --strict-run-artifacts
Recommended probes by stage: pretrain: shape syntax + closure; midtrain: chart/layout structure; sft: strict tag-following + SVG validity.
Step 6: Build Inference from Trained Run
Generate runtime and chat from the same run
.venv/bin/python version/v7/scripts/ck_run_v7.py run "$RUN" --generate-only --context-len 128 .venv/bin/python scripts/ck_chat.py \ --model-dir "$RUN/.ck_build" \ --python-tokenizer --chat-template none \ --show-token-ids --prompt "<svg" --max-tokens 80 --temperature 0.0
Use $RUN/.ck_build for chat so model + tokenizer artifacts come from the same run.
Step 7: Open IR Visualizer
Generate a fresh visualizer report for this run and open it locally.
.venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only echo "[OK] Visualizer: $RUN/ir_report.html" # Optional: open in browser xdg-open "$RUN/ir_report.html" >/dev/null 2>&1 || true
The training pipeline already generates this by default; this command regenerates it on demand.
Regression gate: make visualizer validates run-dir path wiring + decode/profile artifact loading. make visualizer-full also validates train-runtime ASan artifact loading via a tiny fixture run.
Runbook E2E gate: make v7-runbook-e2e executes training-path Steps 1-7 (smoke mode) and fails fast on path/gate/artifact drift.
Regression memory: the visualizer can load version/v7/reports/REGRESSION_LEDGER.json to show known root causes, detection gates, and proof artifacts directly in Training run-gates.
| Checkpoint | Run Path | What to inspect |
|---|---|---|
| After Step 0.7 tokenization preview | $RUN | Architecture -> Tokenizer, Training -> Data Lab (dataset folder + tokenizer path + roundtrip samples + Tokenizer Internals merges/pieces) |
| After Step 3.7 training | $RUN | Training -> Data Lab + Train Dashboard (valid SVG/closure/loop metrics) |
| After Step 6 inference build | $RUN | kernel graph, runtime files, and artifact links |
Step 7.1: Live Training Monitor
Watch loss, grad norm, step count, and tok/s update in real-time while training is running — no need to regenerate the report or restart the training process. Open a second terminal alongside your training run and pick the path that suits your setup.
Path A — Zero-config (python3 -m http.server)
Works with any static HTTP server.
The visualizer auto-detects it is served over HTTP (not file://) and
starts polling the JSON artifact files that live next to ir_report.html.
No flags, no server code — just stdlib.
# In a second terminal while training is running: python3 -m http.server 7700 -d "$RUN" # Open in browser: xdg-open http://localhost:7700/ir_report.html 2>/dev/null || \ echo "Open: http://localhost:7700/ir_report.html"
$RUN is the training run directory (e.g.
~/.cache/ck-engine-v7/models/train/svg_l16_d128_h512_v1024_ctx512).
The ir_report.html must already exist in that directory (generated in Step 7 or by the
training pipeline's --generate-visualizer flag).
Path B — Custom live server (single snapshot endpoint)
Generates a fresh report and starts a dedicated HTTP server that exposes a
/api/snapshot endpoint — one request per poll cycle instead of N file fetches.
Injects window.CK_LIVE_MODE automatically so the visualizer uses the snapshot path.
# Generates report, starts server, opens browser: .venv/bin/python version/v7/tools/open_ir_visualizer.py \ --run "$RUN" \ --html-only \ --live \ --live-port 7700 \ --live-interval 4000 # Press Ctrl+C to stop the server when done.
--live-interval is in milliseconds (default 5000 = 5 s).
Reduce to 2000–3000 for faster feedback during early training.
What the live bar shows
A fixed status bar appears at the bottom of every page while live mode is active:
| Field | Source file | What to watch for |
|---|---|---|
step N | training_checkpoint_policy_latest.json | Steps should increment steadily; gaps indicate stalls or restarts. |
loss X.XXXX | training_loss_curve_latest.json | Healthy curve trends down. Spikes (>2× recent baseline) may signal LR or data issues. |
X tok/s | training_step_profile_latest.json | Drop in tok/s without loss spike → hardware throttle or memory pressure. |
↺ prefix | — | Appears when any artifact changed this cycle. Absent = no new steps yet. |
Which files are polled:
training_loss_curve_latest.json,
training_grad_norms_latest.json,
training_parity_latest.json,
training_step_profile_latest.json,
training_checkpoint_policy_latest.json.
All are <5 KB and written atomically after each step by the training harness.
The active training tab (Train Dashboard, Grad Health, Parity, etc.) re-renders automatically on each change.
Stage detection: the active curriculum stage (active_stage,
curriculum_stage) is embedded from the last report generation.
It updates automatically if you regenerate the report mid-run:
python3 version/v7/tools/open_ir_visualizer.py --run "$RUN" --html-only
(no server restart needed for Path A — just refresh the browser after regeneration).
Step 8: Inference from GGUF (HF URLs)
The GGUF runtime path is documented and supported on Linux. macOS and Windows/WSL may work for local smoke tests, but parity and profiling workflows remain Linux-first and are not the supported operator baseline.
Step 8A: Plain Runtime Path
Use this when you only want the compiled runtime and local chat loop.
make v7-init make v7-demo-runtime \ V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf
Scope: runtime only. This path does not promise a complete operator artifact set.
Step 8B: Full Artifact Capture Path (Non-privileged)
Use this when you want the run directory, HTML report, and parent Run Hub in one pass.
make v7-init make v7-capture-artifacts \ V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf
Scope: runtime + report + hub. Deep profiling artifacts (perf, VTune, Advisor, flamegraphs) remain separate Linux-only steps and may require elevated privileges or host tooling.
Step 8C: Full Profiling Dashboard Path
Use this on Linux when you want runtime, report, hub, and profiling artifacts refreshed together.
make v7-init make v7-profile-dashboard \ V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf
Scope: runtime + report + hub + profiling refresh. Profiling subtargets rely on Linux host tools; missing tools may cause portions of the capture to skip.
Use this path when you want inference-only runs without training a local model.
These raw CLI examples use the shell wrapper so a first-time user can be prompted to create .venv and install requirements-v7.txt if the repo-local environment is missing.
Gemma 3 270M
version/v7/scripts/cks-v7-run run \ hf://unsloth/gemma-3-270m-it-GGUF/gemma-3-270m-it-Q5_K_M.gguf \ --context-len 1024 --force-compile --force-convert --chat-template=none \ --generate-visualizer
Qwen2 0.5B Instruct
version/v7/scripts/cks-v7-run run \ hf://Qwen/Qwen2-0.5B-Instruct-GGUF/qwen2-0_5b-instruct-q4_k_m.gguf \ --context-len 1024 --force-compile --force-convert \ --generate-visualizer
Qwen3 0.6B
version/v7/scripts/cks-v7-run run \ hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \ --context-len 1024 --force-compile --force-convert \ --generate-visualizer
Qwen3.5 0.8B
python3 version/v7/scripts/ck_run_v7.py run \ hf://unsloth/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q4_K_M.gguf \ --force-convert --force-compile \ --context-len 1034
The canonical v7 bring-up path is the hf://... URI. That materializes the run under ${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v7/models}/unsloth--Qwen3.5-0.8B-GGUF. Local GGUF paths are still supported when you intentionally want an offline or copied artifact.
If the first reply echoes
<|im_start|>assistant or starts with <think>, the prompt is being fed with the wrong chat wrapper or stop markers. For NaanBeige, keep the default --chat-template auto, prefer --python-tokenizer on first bring-up, and do not force --chat-template none unless you are testing raw logits on purpose.
NaanBeige 4.1 3B
version/v7/scripts/cks-v7-run run \ hf://mradermacher/Nanbeige4.1-3B-GGUF/Nanbeige4.1-3B.Q4_K_M.gguf \ --context-len 1024 --force-compile --force-convert \ --chat-template auto \ --generate-visualizer
Current scope: documented inference/parity bring-up plus the supported v7 training families qwen2, qwen3, qwen35, gemma3, and nanbeige.
Step 8.1: Inference + IR Visualizer for GGUF Runs
Inference runs can come from either hf://... URIs or local GGUF paths.
HF-backed runs are stored under
${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v7/models}/<org--repo>.
If you used --generate-visualizer in Step 8, ir_report.html is already in the same folder.
Otherwise generate visualizer reports directly from those run directories:
CACHE_MODELS="${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v7/models}"
GEMMA_RUN="$CACHE_MODELS/unsloth--gemma-3-270m-it-GGUF"
QWEN2_RUN="$CACHE_MODELS/Qwen--Qwen2-0.5B-Instruct-GGUF"
QWEN3_RUN="$CACHE_MODELS/Qwen--Qwen3-0.6B-GGUF"
QWEN35_RUN="$CACHE_MODELS/unsloth--Qwen3.5-0.8B-GGUF"
NANBEIGE_RUN="$CACHE_MODELS/mradermacher--Nanbeige4.1-3B-GGUF"
for RUN_DIR in "$GEMMA_RUN" "$QWEN2_RUN" "$QWEN3_RUN" "$QWEN35_RUN" "$NANBEIGE_RUN"; do
if [ ! -d "$RUN_DIR" ]; then
echo "[Missing] $RUN_DIR (run Step 8 command first)"
continue
fi
.venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$RUN_DIR" --html-only
echo "[OK] $RUN_DIR/ir_report.html"
done
# Optional: open generated reports
CACHE_MODELS="${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v7/models}"
xdg-open "$CACHE_MODELS/unsloth--gemma-3-270m-it-GGUF/ir_report.html" >/dev/null 2>&1 || true
xdg-open "$CACHE_MODELS/Qwen--Qwen2-0.5B-Instruct-GGUF/ir_report.html" >/dev/null 2>&1 || true
xdg-open "$CACHE_MODELS/Qwen--Qwen3-0.6B-GGUF/ir_report.html" >/dev/null 2>&1 || true
xdg-open "$CACHE_MODELS/unsloth--Qwen3.5-0.8B-GGUF/ir_report.html" >/dev/null 2>&1 || true
xdg-open "$CACHE_MODELS/mradermacher--Nanbeige4.1-3B-GGUF/ir_report.html" >/dev/null 2>&1 || true
Report files are written to each run root as ir_report.html (not inside .ck_build).
Step 9: Quality Improvement Phases (SVG)
Use this phased path for small custom datasets. The goal is not "ChatGPT quality"; the goal is stable, valid SVG generation that improves step-by-step.
| Phase | Goal | Pass Criteria |
|---|---|---|
| Phase 1 | Get baseline working | Roundtrip passes, loss drops, and at least some valid SVG samples are generated |
| Phase 2 | Improve structure stability | Higher valid-SVG rate and fewer broken tails in longer generations |
| Phase 3 | Generalize (not memorize only) | Cleaner outputs on holdout prompts/data and reduced exact-copy rate |
Phase 1: Make It Work (Baseline)
cd "$ROOT" .venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \ --run "$RUN" \ --init-if-missing \ --init xavier_uniform \ --template qwen3 \ --curriculum-stage stage_a \ --tokenizer ascii_bpe \ --require-svg-rows \ --strict-data-gates --min-valid-svg-rate 0.70 \ --roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \ --data "$DATA_TXT" \ --vocab-size 320 --bpe-vocab-size 320 \ --layers 24 --embed-dim 64 --hidden-dim 128 \ --epochs 1 --seq-len 32 --total-tokens 841472 \ --grad-accum 1 --lr 5e-4 --max-grad-norm 1.0 --seed 42 \ --train-driver ck_cli --ck-cli-log-every 200 \ --json-out version/v7/reports/v7_svg_assets_bpe_pipeline_l24_full_e1_latest.json .venv/bin/python version/v7/scripts/test_ascii_bpe_roundtrip_v7.py \ --run "$RUN" \ --dataset "$DATA_TXT" \ --require-ascii
If this phase fails, do not tune hyperparameters yet. First fix data cleanliness and tokenizer roundtrip.
Phase 2: Expand Data + Continue From Latest Checkpoint (same RUN)
If Phase 1 converges but outputs are still brittle, expand corpus coverage and continue training from your latest checkpoint in the same run directory.
# 1) Expand dataset (Stage A data + richer docs SVG corpus), keep ASCII + one-row-per-svg.
STAGE_B_SOURCE="$DATA_DIR/svg_assets_docs_ascii.txt"
RESUME_DATA="$RUN/svg_resume_stageb_mix.txt"
if [ ! -s "$STAGE_B_SOURCE" ]; then
echo "[ERROR] Missing $STAGE_B_SOURCE"
echo "Run Step 0.55 first to build the larger docs corpus."
else
{ cat "$DATA_TXT"; cat "$STAGE_B_SOURCE"; } \
| sed '/^[[:space:]]*$/d' \
| awk 'length($0) <= 4096' \
| sort -u > "$RESUME_DATA"
python3 version/v7/scripts/prepare_ascii_dataset_v7.py \
--input "$RESUME_DATA" \
--output "$RESUME_DATA" \
--input-format text \
--ascii-map-common \
--ascii-mode xml_escape \
--svg-only
DATA_TXT="$RESUME_DATA"
fi
# 2) Continue from latest checkpoint in SAME run (no re-init).
.venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \
--run "$RUN" \
--resume-latest-checkpoint \
--template qwen3 \
--curriculum-stage stage_b \
--tokenizer ascii_bpe \
--require-svg-rows \
--strict-data-gates \
--roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \
--data "$DATA_TXT" \
--vocab-size 320 --bpe-vocab-size 320 \
--layers 24 --embed-dim 64 --hidden-dim 128 \
--epochs 1 --seq-len 128 --total-tokens 1200000 \
--grad-accum 4 --lr 2e-4 --max-grad-norm 1.0 --seed 42 \
--train-driver ck_cli --ck-cli-log-every 200 \
--json-out version/v7/reports/v7_svg_assets_bpe_pipeline_l24_resume_stageb_latest.json
Keep vocab/dims aligned with the existing run when resuming. If you want a larger vocab (for example 640 instead of 320), start a fresh run directory and re-init.
Phase 3: Add Data + Check Memorization
More data helps only when it stays clean and you measure train-vs-holdout behavior.
# Build a simple 90/10 line split for memorization checks SHUF="$RUN/svg_all_shuf.txt" TRAIN_TXT="$RUN/svg_train_90.txt" HOLDOUT_TXT="$RUN/svg_holdout_10.txt" shuf "$DATA_TXT" > "$SHUF" TOTAL=$(wc -l < "$SHUF") HOLDOUT=$(( TOTAL / 10 )) TRAIN=$(( TOTAL - HOLDOUT )) head -n "$TRAIN" "$SHUF" > "$TRAIN_TXT" tail -n "$HOLDOUT" "$SHUF" > "$HOLDOUT_TXT" echo "[split] train_lines=$(wc -l < "$TRAIN_TXT") holdout_lines=$(wc -l < "$HOLDOUT_TXT")" echo "[next] retrain using --data \"$TRAIN_TXT\" and compare sample quality against holdout prompts"
If outputs are mostly exact copies of training lines, you are memorizing. If outputs stay valid and varied on holdout prompts, you are generalizing.
Step 10: Ablation Matrix Sweep (one command)
Run a ranked matrix over embed dim, dataset size, token vocab size, layer count, and GQA ratio. This is the fastest way to see which configs stay stable and improve quality.
AB_RUN_ROOT=$HOME/.cache/ck-engine-v7/models/train/v7_ablation_matrix_svg_ascii python3 version/v7/scripts/run_ablation_matrix_v7.py \ --run-root "$AB_RUN_ROOT" \ --data "$DATA_TXT" \ --tokenizer ascii_bpe \ --require-svg-rows \ --embed-dims 64,96 \ --layers-list 12,24 \ --bpe-vocab-sizes 320,640 \ --dataset-fracs 0.5,1.0 \ --total-tokens-list 131072 \ --num-heads 8 \ --kv-divisors 1,2 \ --seq-len 96 \ --epochs 3 \ --grad-accum 4 \ --lr 2e-4 \ --max-grad-norm 1.0 \ --seed 42
Outputs: $AB_RUN_ROOT/ablation_results_ranked.json, $AB_RUN_ROOT/ablation_results.csv, $AB_RUN_ROOT/ablation_summary.md.
# Optional: dry-run preview (no training execution) python3 version/v7/scripts/run_ablation_matrix_v7.py \ --run-root "$AB_RUN_ROOT" \ --data "$DATA_TXT" \ --dry-run --max-runs 2
CLI Options Cheat Sheet
List all options
python3 version/v7/scripts/ck_run_v7.py --help python3 version/v7/scripts/ck_run_v7.py run --help python3 version/v7/scripts/ck_run_v7.py train --help python3 version/v7/scripts/train_data_pipeline_v7.py --help python3 version/v7/scripts/promote_latest_checkpoint_v7.py --help python3 version/v7/scripts/run_ablation_matrix_v7.py --help python3 version/v7/tools/open_ir_visualizer.py --help python3 scripts/ck_chat.py --help
Most-used flags
| Command | Useful flags |
|---|---|
ck_run_v7.py run | --context-len, --force-compile, --force-convert, --chat-template, --generate-only, --generate-visualizer |
train_data_pipeline_v7.py | --template qwen2|qwen3|qwen35|gemma3|nanbeige|llama, --curriculum-stage stage_a|stage_b, --tokenizer bpe|ascii_bpe, --require-svg-rows, --require-ascii-data, --bpe-vocab-size, --seq-len, --total-tokens, --lr, --train-driver ck_cli, --ck-cli-log-every, --prepare-only, --roundtrip-max-lines, --strict-data-gates, --min-valid-svg-rate, --post-train-eval |
promote_latest_checkpoint_v7.py | --run, --list-runs, --stage, --stage-pass, --run-id, --dry-run, --step |
run_ablation_matrix_v7.py | --embed-dims, --layers-list, --bpe-vocab-sizes, --dataset-fracs, --kv-divisors, --resume |
ck_chat.py | --python-tokenizer, --show-token-ids, --show-token-pieces, --ascii-display, --temperature |
Human-Readable Output Check
- Human-readable output: run chat without
--show-token-ids. - Debug token bytes/pieces: add
--show-token-ids --show-token-pieces --ascii-display. - Tokenizer source-of-truth check: startup log should include
Loaded CK true_bpe tokenizer ....
.venv/bin/python scripts/ck_chat.py \ --model-dir "$RUN/.ck_build" \ --python-tokenizer --chat-template none \ --prompt "<svg" --max-tokens 120 --temperature 0.0