v8 Inference Runbook

This is the operator runbook for the current v8 inference lane: promoted text-family bring-up through the v8 runner surface plus the validated Qwen3-VL-8B-Instruct multimodal path with the matching mmproj-Qwen3VL-8B-Instruct encoder projection file.

Validated Scope
v8 now has a credible inference surface for promoted text-family bring-up and the tested Qwen3-VL pair runs encoder -> bridge -> decoder end to end with coherent captions instead of stop-marker or prompt-shell garbage.

What This Runbook Covers

Two things only: the promoted text-family v8 bring-up commands and the one multimodal family that is validated today. It does not claim that unrelated multimodal families are ready in v8.

version/v8/scripts/cks-v8-run ck_run_v8.py Qwen3-VL inference only

Validated Pair

Decoder GGUF:
hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf

mmproj GGUF example path:
./mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf

Use a matching decoder/mmproj pair. If those drift apart, the bridge result is not meaningful.

Boundary
v8 is the inference bring-up lane. Training workflows remain in v7.

How The Vision Side Works

If you want the architecture view instead of the operator commands, read v8 Vision Encoder Architecture. That page explains how the encoder graph is derived from GGUF + template + lowering and how the bridge hands the prefix into the decoder.

For parity-specific investigation, see Vision Encoder Parity.

Prerequisites

Supported Host

Linux host with a working C toolchain and Python 3.
Repo-local .venv with requirements-v8.txt, or let cks-v8-run bootstrap it interactively.
A local Qwen3-VL mmproj file available at a known path.

python3 -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements-v8.txt

Canonical Text Bring-Up

Use this path when you want inference-only runs without training a local model.

These raw CLI examples use the shell wrapper so a first-time user can be prompted to create .venv and install requirements-v8.txt if the repo-local environment is missing.

Gemma 3 270M

version/v8/scripts/cks-v8-run run \
  hf://unsloth/gemma-3-270m-it-GGUF/gemma-3-270m-it-Q5_K_M.gguf \
  --context-len 1024 --force-compile --force-convert --chat-template=auto \
  --generate-visualizer

Gemma 4 E4B IT

.venv/bin/python version/v8/scripts/ck_run_v8.py run \
  hf://unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-Q4_K_M.gguf \
  --context-len 2048 \
  --force-convert --force-compile \
  --prompt 'Give me a detailed example of C code.' \
  --chat-template gemma4 \
  --max-tokens 1024 \
  --temperature 0.0

Use this as the current Gemma4 GGUF coherence smoke. A healthy run should produce a normal long answer and stop by EOS or --max-tokens, not by prompt-marker echo or language corruption. The tested local Q4_K_M run generated 1024 coherent C-code tokens; performance work remains separate.

Gemma 4 Assistant / MTP Drafter

.venv/bin/python version/v8/scripts/ck_run_v8.py run \
  hf://google/gemma-4-E4B-it-assistant \
  --run /tmp/ck-gemma4-assistant-runtime \
  --context-len 1024 \
  --force-convert --force-compile \
  --generate-only \
  --chat-template none \
  --allow-raw-prompt

The assistant artifact is a tiny speculative/MTP drafter, not a standalone chat model. A healthy bring-up converts safetensors directly to BUMP, lowers q-only shared-KV attention, generates C, and compiles libmodel.so. Real chat speedup comes only after pairing this runtime with a compatible Gemma4 backbone via --speculative-draft-model-dir.

Nemotron Nano 9B v2

.venv/bin/python version/v8/scripts/ck_run_v8.py run \
  hf://bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF/nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q4_K_M.gguf \
  --context-len 2048 \
  --force-convert --force-compile \
  --prompt 'Give me a concise example of C code.' \
  --chat-template auto \
  --max-tokens 256 \
  --temperature 0.0

Use this as the current Nemotron-H / Mamba2 GGUF smoke. Native BPE text encode is enabled by default for generated BPE runtimes; set CK_DISABLE_FULL_BPE_TOKENIZER=1 only when you intentionally want token-id prompts plus lightweight token display. A healthy run should produce normal instructional text, not token-zero collapse or repeated punctuation.

GLM4

# Low-memory validation lane: template + conversion + comparator syntax
.venv/bin/python -m unittest tests.test_v8_glm4_template
.venv/bin/python -m py_compile \
  version/v8/scripts/compare_glm4_torch_ck_v8.py \
  version/v8/scripts/convert_gguf_to_bump_v8.py \
  version/v8/scripts/convert_safetensors_to_bump_v8.py \
  tests/test_v8_glm4_template.py \
  tests/test_v8_safetensors_to_bump.py

GLM4 is currently tracked as a v8 template/conversion/parity-harness lane. The patch adds GLM4 GGUF metadata handling, declarative safetensors mapping, a tiny synthetic safetensors conversion test, and a PyTorch-vs-CK comparator. Run full real-model GLM4 smoke on a higher-memory host before promoting it to the high-memory runtime lane.

Kimi / MLA Decoder Contract

# Low-memory validation lane: MLA template + explicit decode cache kernels
.venv/bin/python -m py_compile version/v8/scripts/build_ir_v8.py
make build/libckernel_engine.so
.venv/bin/python unittest/test_deepseek_reference_kernels.py
.venv/bin/python -m unittest \
  tests.test_v8_kimi_template \
  tests.test_v8_model_contract_inspector \
  tests.test_v8_template_circuit_audit

Kimi/MLA is currently a v8 compiler/kernel-contract lane. The template now declares explicit MLA decode-cache behavior, lowering inserts mla_kv_cache_store / mla_kv_cache_batch_store, and decode attention switches to deepseek_mla_attention_decode_f32. See v8 MLA / Kimi Decode Cache for the architecture notes. Full Kimi runtime smoke belongs on a high-memory host.

Qwen2 0.5B Instruct

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen2-0.5B-Instruct-GGUF/qwen2-0_5b-instruct-q4_k_m.gguf \
  --context-len 1024 --force-compile --force-convert \
  --generate-visualizer

Qwen3 0.6B

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \
  --context-len 1024 \
  --force-convert --force-compile \
  --generate-visualizer

Qwen3.5 0.8B

python3 version/v8/scripts/ck_run_v8.py run \
  hf://unsloth/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q4_K_M.gguf \
  --force-convert --force-compile \
  --context-len 1034

The canonical v8 bring-up path is the hf://... URI. That materializes the run under ${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v8/models}/unsloth--Qwen3.5-0.8B-GGUF. Local GGUF paths are still supported when you intentionally want an offline or copied artifact.

NaanBeige / Llama-family template symptom
If the first reply echoes <|im_start|>assistant or starts with <think>, the prompt is being fed with the wrong chat wrapper or stop markers. For NaanBeige, keep the default --chat-template auto, prefer --python-tokenizer on first bring-up, and do not force --chat-template none unless you are testing raw logits on purpose.

NaanBeige 4.1 3B

version/v8/scripts/cks-v8-run run \
  hf://mradermacher/Nanbeige4.1-3B-GGUF/Nanbeige4.1-3B.Q4_K_M.gguf \
  --context-len 1024 --force-compile --force-convert \
  --chat-template auto \
  --generate-visualizer

Current scope: these are the validated v8 text-family operator surfaces. The multimodal Qwen3-VL path remains the promoted vision baseline; Gemma4 vision now has a high-memory bridge smoke lane for second-family coverage.

High-Memory v8 Smoke Targets

Gemma4 text, Gemma4 vision, and Nemotron Nano 9B v2 are intentionally tracked as high-memory smoke lanes. They should appear in nightly/test reports, but skip cleanly on smaller runners instead of pretending the model family is untested.

make test-v8-gemma4-highmem
make test-v8-gemma4-vision-smoke
make test-v8-nemotron9-highmem

# Lower the threshold only when you intentionally want to run locally:
V8_GEMMA4_MIN_MEM_GB=24 make test-v8-gemma4-highmem
V8_NEMOTRON9_MIN_MEM_GB=24 make test-v8-nemotron9-highmem

These tests are coherence/runtime smokes, not performance gates. They validate that the v8 conversion, compile, tokenizer/chat-template path, and first generated tokens remain sane for large model families.

Text-Family Notes

Gemma 3: use --chat-template auto for normal instruction/chat runs. --chat-template none is raw continuation mode now and needs --allow-raw-prompt if you intentionally want it.
Gemma 4: use --chat-template gemma4. The template uses explicit per-layer/direct split-half RoPE metadata; do not replace it with the older global split-half RoPE path when debugging.
Nemotron Nano 9B v2: use --chat-template auto. The BPE runtime should encode raw prompts by default; set CK_DISABLE_FULL_BPE_TOKENIZER=1 only for token-id fallback/debug runs.
GLM4: start with the template and safetensors/GGUF conversion tests. It has partial-RoPE and BPE contract coverage, but should stay in the conversion/parity lane until real-model smoke runs on a higher-memory host.
Qwen2 / Qwen3 / Qwen3.5: the v8 runner now reproduces the same public command shapes as v7 and clean short prompt smokes succeed on the promoted examples.
NaanBeige / llama-family symptom: if the first reply echoes <|im_start|>assistant or starts with <think>, keep --chat-template auto, do not force none, and treat it as a prompt-wrapper/chat-contract symptom rather than the expected reply.

New Model Compatibility Workflow

Start with safetensors when available. Convert safetensors to BUMP and compare CK against PyTorch with BF16/FP32 weights. This isolates graph stitching, RoPE placement, activation math, and layer contracts before quantization enters the picture.
Then bring up GGUF. Convert GGUF to BUMP and compare CK against llama.cpp. This catches tokenizer/template behavior, tied versus untied heads, quantized projection layouts, and stop-token policy.
Only optimize after parity is understood. Once BF16/FP32 graph parity and GGUF coherence are clean, profile Q4_K/Q5_K/Q6_K/Q8_K kernels and threadpool scheduling separately.

Gemma4 followed this route: BF16 safetensors proved the split-direct RoPE circuit against PyTorch, then the GGUF Q4_K_M path reused the same IR/kernel contract for coherent long-answer smoke testing.

Canonical Gemma4 Vision Bridge Smoke

Second-Family Vision Bridge

Gemma4 vision uses the Gemma4 text decoder with the matching mmproj-F16.gguf encoder artifact. Use the checked-in PPM test image so the smoke path does not depend on Pillow.

version/v8/scripts/cks-v8-run run \
  hf://unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-Q4_K_M.gguf \
  --mmproj hf://unsloth/gemma-4-E4B-it-GGUF/mmproj-F16.gguf \
  --image-path version/v8/test_assets/v8_vision_doc_card_72.ppm \
  --prompt "Explain this image." \
  --context-len 1024 \
  --force-compile \
  --chat-template gemma4 \
  --max-tokens 8 \
  --temperature 0.0

A healthy smoke converts or reuses the Gemma4 vision encoder, produces a 196-token vision prefix, completes mixed prefill in the Gemma4 decoder, and writes a bridge report. Current scope is bridge correctness; longer semantic parity remains active work.

Canonical Qwen3-VL Vision Run

Real Image Bring-Up

Replace the --mmproj value below if your local mmproj file lives somewhere else.

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf \
  --mmproj hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \
  --image-path version/v8/test_assets/v8_vision_doc_card_72.ppm \
  --prompt "Explain this image." \
  --context-len 1024 \
  --force-convert --force-compile \
  --thinking-mode suppressed

The first run converts the GGUF, lowers/codegens the runtimes, compiles them, runs the encoder bridge, then generates from the decoder.

For a cleaner caption sanity check, add --max-tokens 48 so the current bridge generation loop does not ramble into a long repetition tail.

Synthetic Prefix Probe

For a seam-only smoke test without a real image file, keep the decoder pair but switch to a synthetic prefix:

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf \
  --mmproj ./mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \
  --image-mode checker \
  --prompt "Describe the image." \
  --context-len 1024 \
  --force-convert --force-compile

What Success Looks Like

The bridge logs reach [v8-bridge] done report=....
A bridge report exists under the model cache and contains "status": "ok".
The generated caption is coherent natural language, not raw stop tokens, /no_think, or an empty assistant turn.

A healthy prompt shell uses the GGUF chat template in auto mode and preserves the vision markers around <image_embeds>.

IR Hub Quick Launch

Generate and open the parent dashboard for all v8 runs under $HOME/.cache/ck-engine-v8/models.

.venv/bin/python version/v8/tools/open_ir_hub_v8.py --open

Cache-backed inference, training, profiling, and multimodal runs appear automatically when their artifacts use the canonical v8 model root.

Headless Server Access

When the server has no window manager, serve the canonical model root and use an SSH tunnel. Keeping the HTTP server on loopback avoids exposing unauthenticated reports to the surrounding network.

# Run on the headless CKE server.
cd "${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v8/models}"
python3 -m http.server 7021 --bind 127.0.0.1

# Run on the workstation that has a browser.
ssh -L 7021:127.0.0.1:7021 USER@SERVER

# Then open:
http://127.0.0.1:7021/ir_hub.html

On a trusted LAN or Tailscale network, binding with --bind 0.0.0.0 makes the hub reachable at http://SERVER_IP:7021/ir_hub.html. Python's development server provides no authentication or TLS; do not expose it directly to the public internet.

Artifacts To Inspect

v8 has its own operator artifact surface. It is not only a decoder dump: text runs can emit an ir_report.html, the v8 hub indexes run directories, and multimodal runs carry encoder → bridge → decoder artifacts that the visualizer can render as a single circuit.

IR Visualizer

ir_report.html shows lowered ops, memory layout, kernel flow, profile artifacts, parity notes, and generated commands for a single v8 run.

v8 IR Hub

ir_hub_v8.html indexes cache-backed v8 runs and links reports, bridge outputs, dataset viewers, embeddings, attention exports, and probe reports.

Vision Artifacts

Qwen3-VL/Gemma4V bridge reports expose encoder runtime, projected visual prefix, decoder mixed prefill, generated text, and profiler summaries.

Dataset Viewer

The v8 viewer reuses the contract-tested tab model for staged data, tokenizer, vocabulary, quality, embeddings, and attention inspection.

Artifact	Purpose
`~/.cache/ck-engine-v8/models/.../ir_report.html`	Single-run v8 IR visualizer report. Generated automatically by `--generate-visualizer` or explicitly with `open_ir_visualizer_v8.py`.
`~/.cache/ck-engine-v8/models/ir_hub_v8.html`	v8 run hub for scanning all cache-backed inference, bridge, probe, and viewer artifacts.
`~/.cache/ck-engine-v8/models/.../multimodal_bridge/bridge_report.json`	Final bridge status, prompt accounting, prefix grid, and generated text.
`~/.cache/ck-engine-v8/models/.../multimodal_bridge/encoder/`	Encoder-side lowered IR, compiled runtime, and bridge artifacts.
`~/.cache/ck-engine-v8/models/.../multimodal_bridge/decoder/`	Decoder lowered IR, generated C, compiled shared library, and decode bridge artifacts.

# Generate or refresh a v8 IR visualizer report for one run.
.venv/bin/python version/v8/tools/open_ir_visualizer_v8.py \
  --generate --run "$RUN" --html-only --strict-run-artifacts \
  --output "$RUN/ir_report.html"

# Refresh the v8 run hub.
.venv/bin/python version/v8/tools/open_ir_hub_v8.py \
  --models-root "${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v8/models}" \
  --output "${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v8/models}/ir_hub_v8.html"

# List local v8 runtimes.
./build/ck-cli-v8 --list

Visualizer And Hub Gates

Use these gates when changing v8 visualizer tabs, embedded JSON, run-hub discovery, dataset viewer contracts, or multimodal bridge artifacts. They are intentionally fast enough to run locally before a PR.

# Source-level tab/function/DOM contracts plus pure JS utility checks.
make v8-visualizer-health

# Generate and validate ir_report.html, dataset_viewer.html, and ir_hub_v8.html.
make v8-visualizer-generated-e2e

# Validate encoder + bridge + decoder visualizer rendering for a synthetic vision run.
make v8-visualizer-vision-artifacts

v7 still owns the promoted training/backprop visualizer lane. v8 now owns inference and multimodal observability: lowered decoder graphs, generated C metadata, logical memory, kernel-flow summaries, bridge reports, and vision-prefix dataflow.

Troubleshooting

If you omit --image-path, the bridge uses a synthetic image path. That is useful for seam probes, not for validating real-image captioning.
If the bridge report is missing, check disk space first. The first compile path writes large intermediate artifacts.
If output starts echoing raw chat markers or thinking-control strings, keep --chat-template auto and do not force a manual template override on this model.
If captions are incoherent again, verify that the decoder GGUF and mmproj GGUF are the matching Qwen3-VL pair.

Support Statement

What We Are Claiming
The tested Qwen3-VL artifact pair works in v8 for end-to-end multimodal inference bring-up. That does not automatically extend to unrelated model families or future multimodal templates that need kernels the engine does not have yet.