v7 SVG Dataset Generation Runbook
Operator-focused runbook to build clean SVG training corpora from this repo, split progression into pretrain and midtrain data, and hand off directly into v7 training.
Demo goal: start from repo SVG assets and reach coherent, closed SVG/infographic outputs that are visibly useful, then iterate with Stage A -> Stage B and parity checks.
What this page covers
Stage A (pretrain): syntax + closure stability on strict ASCII SVG rows.
Stage B (midtrain): richer composition from docs assets and continuation training from Stage A checkpoint.
Source of truth: docs/site/assets/*.svg + generated synthetic rows in version/v7/data.
Step 0: Paths and Environment
Use one canonical run path and data directory so artifacts and IR visualizer remain aligned.
Split-aware workspace note: the current staged-dataset workflow should use a split-aware SVG workspace such as version/v7/data/spec04. This page prepares that workspace; the main v7 Runbook then stages it into $RUN/dataset/ with --dataset-workspace version/v7/data/spec04.
export ROOT=/home/antshiv/Workspace/C-Kernel-Engine export RUN=$HOME/.cache/ck-engine-v7/models/train/v7_svg_dataset_run export DATA_DIR=$ROOT/version/v7/data export DATA_TXT=$DATA_DIR/svg_assets_train.txt mkdir -p "$RUN" "$DATA_DIR" echo "[OK] RUN=$RUN" echo "[OK] DATA_DIR=$DATA_DIR" echo "[OK] DATA_TXT=$DATA_TXT"
Optional one-time workspace scaffold if your split-aware workspace does not exist yet:
bash version/v7/scripts/init_data_workspace_v7.sh --spec spec04 --dataset-type svg
After scaffold + data prep, keep working in version/v7/data/spec04 until you are ready to stage a run-local copy into $RUN/dataset/.
Operator rule: version/v7/data/spec04 is a seed workspace, not the operator artifact home. Once staged, keep the working dataset, dataset_viewer.html, ir_report.html, checkpoints, and training/parity/perf outputs together under the same cache run directory.
Guardrail: keep RUN in $HOME/.cache/ck-engine-v7/models/train/... so python3 version/v7/tools/open_ir_hub.py --open can discover the run automatically.
Step 1: Baseline ASCII SVG Preparation (0.5)
Fast baseline path: keep one SVG per line, enforce ASCII bytes, and reject non-SVG rows.
# Fast path bundled dataset if [ ! -s "$DATA_TXT" ]; then echo "[ERROR] Missing bundled dataset: $DATA_TXT" exit 1 fi python3 version/v7/scripts/prepare_ascii_dataset_v7.py \ --input "$DATA_TXT" \ --output "$DATA_TXT" \ --input-format text \ --ascii-map-common \ --ascii-mode xml_escape \ --svg-only wc -l -c "$DATA_TXT" head -n 3 "$DATA_TXT"
This is the minimum clean dataset gate for Stage A.
Step 2: Scale Corpus from Docs Assets (0.55)
Build larger corpora from docs/site/assets/*.svg and synthetic rows, then promote a strict ASCII merged file for training.
ASSET_UTF8="$DATA_DIR/svg_assets_docs_utf8.txt"
ASSET_UTF8_MANIFEST="$DATA_DIR/svg_assets_docs_utf8_manifest.json"
MAP_TSV="$DATA_DIR/svg_utf8_to_ascii_table.tsv"
MAP_JSON="$DATA_DIR/svg_utf8_to_ascii_table.json"
ASSET_ASCII="$DATA_DIR/svg_assets_docs_ascii.txt"
ASSET_MANIFEST="$DATA_DIR/svg_assets_docs_ascii_manifest.json"
SYN_PREFIX="svg_instruction_aug"
SYN_SVG="$DATA_DIR/${SYN_PREFIX}_svg_train.txt"
DATA_TXT_ASCII="$DATA_DIR/svg_assets_train_scaled_ascii.txt"
TMP_ASCII="$DATA_DIR/svg_assets_train_scaled_ascii.tmp.txt"
python3 version/v7/scripts/build_svg_corpus_from_assets_v7.py \
--assets-glob "$ROOT/docs/site/assets/*.svg" \
--output "$ASSET_UTF8" \
--manifest "$ASSET_UTF8_MANIFEST" \
--ascii-mode none \
--no-dedupe
python3 version/v7/scripts/generate_utf8_ascii_map_report_v7.py \
--input "$ASSET_UTF8" \
--tsv-out "$MAP_TSV" \
--json-out "$MAP_JSON"
python3 version/v7/scripts/build_svg_corpus_from_assets_v7.py \
--assets-glob "$ROOT/docs/site/assets/*.svg" \
--output "$ASSET_ASCII" \
--manifest "$ASSET_MANIFEST" \
--ascii-map-common \
--ascii-mode xml_escape \
--no-dedupe
python3 version/v7/scripts/generate_svg_instruction_dataset_v7.py \
--out-dir "$DATA_DIR" \
--prefix "$SYN_PREFIX" \
--num-samples 20000 \
--holdout-ratio 0.10 \
--seed 42
{ cat "$ASSET_ASCII"; cat "$SYN_SVG"; } \
| sed '/^[[:space:]]*$/d' \
| shuf > "$TMP_ASCII"
python3 version/v7/scripts/prepare_ascii_dataset_v7.py \
--input "$TMP_ASCII" \
--output "$DATA_TXT_ASCII" \
--input-format text \
--ascii-map-common \
--ascii-mode xml_escape \
--svg-only
export DATA_TXT="$DATA_TXT_ASCII"
echo "[OK] DATA_TXT=$DATA_TXT"
wc -l -c "$ASSET_UTF8" "$ASSET_ASCII" "$SYN_SVG" "$DATA_TXT"
One-command corpus builder (recommended): stitch existing scripts into a single run that emits both pretrain and midtrain corpora.
python3 version/v7/scripts/build_svg_pretrain_corpus_v7.py \ --out-dir "$DATA_DIR" \ --prefix svg_pretrain_pack \ --assets-glob "$ROOT/docs/site/assets/*.svg" \ --stage-a-samples 24000 \ --stage-b-samples 28000 \ --holdout-ratio 0.10 \ --seed 42 export DATA_TXT="$DATA_DIR/svg_pretrain_pack_stage_a_plus_bridge.txt" export DATA_TXT_STAGE_B="$DATA_DIR/svg_pretrain_pack_stage_b.txt" echo "[OK] Stage-A+bridge: $DATA_TXT" echo "[OK] Stage-B: $DATA_TXT_STAGE_B" wc -l "$DATA_TXT" "$DATA_TXT_STAGE_B"
Synthetic families are produced by generate_svg_instruction_dataset_v7.py (line, triangle, rounded_triangle, ellipse, polygon, polyline, path, arrow, double_arrow, bar_chart, comparison_table, text, rect_circle) and then combined with docs/site/assets/*.svg via existing ASCII + bridge scripts.
Step 3: Stage-A Bridge Pack (0.56)
Inject a small syntax bridge so Stage A includes core Stage B primitives (g, defs, gradients, markers, transforms, filters).
STAGE_B_SOURCE="${ASSET_ASCII:-$DATA_DIR/svg_assets_docs_ascii.txt}"
BRIDGE_TXT="$DATA_DIR/svg_stage_a_bridge_small.txt"
BRIDGE_MANIFEST="$DATA_DIR/svg_stage_a_bridge_small_manifest.json"
STAGE_A_PLUS="$DATA_DIR/svg_stage_a_plus_bridge_small.txt"
python3 version/v7/scripts/build_stage_a_bridge_svg_v7.py \
--stage-a "$DATA_TXT" \
--stage-b "$STAGE_B_SOURCE" \
--out "$BRIDGE_TXT" \
--manifest "$BRIDGE_MANIFEST" \
--per-feature-cap 4 \
--max-total 48 \
--seed 42
{ cat "$DATA_TXT"; cat "$BRIDGE_TXT"; } \
| sed '/^[[:space:]]*$/d' \
| shuf > "$STAGE_A_PLUS"
python3 version/v7/scripts/prepare_ascii_dataset_v7.py \
--input "$STAGE_A_PLUS" \
--output "$STAGE_A_PLUS" \
--input-format text \
--ascii-map-common \
--ascii-mode xml_escape \
--svg-only
export DATA_TXT="$STAGE_A_PLUS"
echo "[OK] Stage-A dataset with bridge: $DATA_TXT"
wc -l "$DATA_TXT"
Step 4: Dataset Validation and Tokenizer Gate
Run prepare-only tokenizer build and strict roundtrip gate on the same run path.
.venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \ --run "$RUN" \ --init-if-missing \ --init xavier_uniform \ --template qwen3 \ --curriculum-stage stage_a \ --tokenizer ascii_bpe \ --require-svg-rows \ --strict-data-gates --min-valid-svg-rate 0.70 \ --roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \ --data "$DATA_TXT" \ --vocab-size 320 --bpe-vocab-size 320 \ --layers 2 --embed-dim 64 --hidden-dim 128 \ --epochs 1 --seq-len 8 --total-tokens 64 \ --grad-accum 1 --lr 5e-4 --max-grad-norm 1.0 --seed 42 \ --prepare-only \ --json-out "$RUN/train_prepare_stage_a.json" .venv/bin/python version/v7/scripts/test_ascii_bpe_roundtrip_v7.py \ --run "$RUN" \ --dataset "$DATA_TXT" \ --require-ascii
Step 5: Stage A Pretraining Run
Train Stage A from current dataset and write run-scoped telemetry/artifacts.
.venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \ --run "$RUN" \ --template qwen3 \ --curriculum-stage stage_a \ --tokenizer ascii_bpe \ --require-svg-rows \ --strict-data-gates --min-valid-svg-rate 0.70 \ --roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \ --data "$DATA_TXT" \ --vocab-size 320 --bpe-vocab-size 320 \ --layers 24 --embed-dim 64 --hidden-dim 128 \ --epochs 1 --seq-len 128 --total-tokens 1200000 \ --grad-accum 4 --lr 2e-4 --max-grad-norm 1.0 --seed 42 \ --train-driver ck_cli --ck-cli-log-every 200 \ --json-out version/v7/reports/v7_svg_stage_a_latest.json
Step 6: Stage B Midtraining Resume Run
Continue from latest Stage A checkpoint in the same run, using richer docs SVG corpus.
STAGE_B_SOURCE="$DATA_DIR/svg_assets_docs_ascii.txt"
RESUME_DATA="$RUN/svg_resume_stageb_mix.txt"
if [ ! -s "$STAGE_B_SOURCE" ]; then
echo "[ERROR] Missing $STAGE_B_SOURCE"
echo "Run Step 2 first."
exit 1
fi
{ cat "$DATA_TXT"; cat "$STAGE_B_SOURCE"; } \
| sed '/^[[:space:]]*$/d' \
| awk 'length($0) <= 4096' \
| sort -u > "$RESUME_DATA"
python3 version/v7/scripts/prepare_ascii_dataset_v7.py \
--input "$RESUME_DATA" \
--output "$RESUME_DATA" \
--input-format text \
--ascii-map-common \
--ascii-mode xml_escape \
--svg-only
export DATA_TXT="$RESUME_DATA"
.venv/bin/python version/v7/scripts/train_data_pipeline_v7.py \
--run "$RUN" \
--resume-latest-checkpoint \
--template qwen3 \
--curriculum-stage stage_b \
--tokenizer ascii_bpe \
--require-svg-rows \
--strict-data-gates \
--roundtrip-max-lines 2048 --roundtrip-sample-limit 16 \
--data "$DATA_TXT" \
--vocab-size 320 --bpe-vocab-size 320 \
--layers 24 --embed-dim 64 --hidden-dim 128 \
--epochs 1 --seq-len 128 --total-tokens 1200000 \
--grad-accum 4 --lr 2e-4 --max-grad-norm 1.0 --seed 42 \
--train-driver ck_cli --ck-cli-log-every 200 \
--json-out version/v7/reports/v7_svg_stage_b_resume_latest.json
When resuming, keep model dims and vocab aligned to Stage A. Change dims/vocab only in a new run directory.
Step 7: Visualizer and Artifacts
Generate a run-scoped visualizer and inspect data lineage + stage progression.
.venv/bin/python version/v7/tools/open_ir_visualizer.py \ --generate \ --run "$RUN" \ --html-only \ --strict-run-artifacts \ --output "$RUN/ir_report.html" echo "[OK] $RUN/ir_report.html" xdg-open "$RUN/ir_report.html" >/dev/null 2>&1 || true
In the report, open Training -> Data Lab and check Stage Flow, Dataset Catalog, tokenizer roundtrip, and data artifact paths.
For full end-to-end train + parity + inference runbook, continue in v7-runbook.html.
Step 8: Optional Autopilot Loop (overnight)
Use the autopilot script to continuously expand synthetic SVG data, resume from latest checkpoint, and emit per-iteration metrics for morning review.
export RUN="$HOME/.cache/ck-engine-v7/models/train/v7_svg_autopilot_seq128" export HOURS=8 export SEQ_LEN=128 export TOTAL_TOKENS=1200000 export GRAD_ACCUM=8 bash version/v7/scripts/svg_autopilot_train_loop_v7.sh # Inspect results tail -n 80 "$RUN/autopilot/autopilot.log" tail -n 20 "$RUN/autopilot/summary.jsonl"
Autopilot is optimization workflow, not parity replacement. Keep periodic CK-vs-PyTorch parity gates in your main runbook path.