v8 Inference Runbook
This is the operator runbook for the current v8 inference lane:
promoted text-family bring-up through the v8 runner surface plus the validated
Qwen3-VL-8B-Instruct multimodal path with the matching
mmproj-Qwen3VL-8B-Instruct encoder projection file.
v8 now has a credible inference surface for promoted text-family bring-up and the tested Qwen3-VL pair runs encoder -> bridge -> decoder end to end with coherent captions instead of stop-marker or prompt-shell garbage.
What This Runbook Covers
Two things only: the promoted text-family v8 bring-up commands and the one multimodal family that is validated today.
It does not claim that unrelated multimodal families are ready in v8.
Validated Pair
Decoder GGUF:hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf
mmproj GGUF example path:./mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf
Use a matching decoder/mmproj pair. If those drift apart, the bridge result is not meaningful.
v8 is the inference bring-up lane. Training workflows remain in v7.
How The Vision Side Works
If you want the architecture view instead of the operator commands, read v8 Vision Encoder Architecture. That page explains how the encoder graph is derived from GGUF + template + lowering and how the bridge hands the prefix into the decoder.
For parity-specific investigation, see Vision Encoder Parity.
Prerequisites
Supported Host
- Linux host with a working C toolchain and Python 3.
- Repo-local
.venvwithrequirements-v8.txt, or letcks-v8-runbootstrap it interactively. - A local Qwen3-VL mmproj file available at a known path.
python3 -m venv .venv . .venv/bin/activate python -m pip install --upgrade pip python -m pip install -r requirements-v8.txt
Canonical Text Bring-Up
Use this path when you want inference-only runs without training a local model.
These raw CLI examples use the shell wrapper so a first-time user can be prompted to create .venv and install requirements-v8.txt if the repo-local environment is missing.
Gemma 3 270M
version/v8/scripts/cks-v8-run run \ hf://unsloth/gemma-3-270m-it-GGUF/gemma-3-270m-it-Q5_K_M.gguf \ --context-len 1024 --force-compile --force-convert --chat-template=auto \ --generate-visualizer
Qwen2 0.5B Instruct
version/v8/scripts/cks-v8-run run \ hf://Qwen/Qwen2-0.5B-Instruct-GGUF/qwen2-0_5b-instruct-q4_k_m.gguf \ --context-len 1024 --force-compile --force-convert \ --generate-visualizer
Qwen3 0.6B
version/v8/scripts/cks-v8-run run \ hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \ --context-len 1024 \ --force-convert --force-compile \ --generate-visualizer
Qwen3.5 0.8B
python3 version/v8/scripts/ck_run_v8.py run \ hf://unsloth/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-Q4_K_M.gguf \ --force-convert --force-compile \ --context-len 1034
The canonical v8 bring-up path is the hf://... URI. That materializes the run under ${CK_CACHE_DIR:-$HOME/.cache/ck-engine-v8/models}/unsloth--Qwen3.5-0.8B-GGUF. Local GGUF paths are still supported when you intentionally want an offline or copied artifact.
If the first reply echoes
<|im_start|>assistant or starts with <think>, the prompt is being fed with the wrong chat wrapper or stop markers. For NaanBeige, keep the default --chat-template auto, prefer --python-tokenizer on first bring-up, and do not force --chat-template none unless you are testing raw logits on purpose.
NaanBeige 4.1 3B
version/v8/scripts/cks-v8-run run \ hf://mradermacher/Nanbeige4.1-3B-GGUF/Nanbeige4.1-3B.Q4_K_M.gguf \ --context-len 1024 --force-compile --force-convert \ --chat-template auto \ --generate-visualizer
Current scope: these are the validated v8 text-family operator surfaces. The multimodal Qwen3-VL path is separate and remains the only promoted vision family today.
Text-Family Notes
- Gemma 3: use
--chat-template autofor normal instruction/chat runs.--chat-template noneis raw continuation mode now and needs--allow-raw-promptif you intentionally want it. - Qwen2 / Qwen3 / Qwen3.5: the
v8runner now reproduces the same public command shapes asv7and clean short prompt smokes succeed on the promoted examples. - NaanBeige / llama-family symptom: if the first reply echoes
<|im_start|>assistantor starts with<think>, keep--chat-template auto, do not forcenone, and treat it as a prompt-wrapper/chat-contract symptom rather than the expected reply.
Canonical Qwen3-VL Vision Run
Real Image Bring-Up
Replace the --mmproj value below if your local mmproj file lives somewhere else.
version/v8/scripts/cks-v8-run run \ hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf \ --mmproj hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \ --image-path version/v8/test_assets/v8_vision_doc_card_72.png \ --prompt "Explain this image." \ --context-len 1024 \ --force-convert --force-compile \ --thinking-mode suppressed
The first run converts the GGUF, lowers/codegens the runtimes, compiles them, runs the encoder bridge, then generates from the decoder.
For a cleaner caption sanity check, add --max-tokens 48 so the current bridge generation loop does not ramble into a long repetition tail.
Synthetic Prefix Probe
For a seam-only smoke test without a real image file, keep the decoder pair but switch to a synthetic prefix:
version/v8/scripts/cks-v8-run run \ hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf \ --mmproj ./mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \ --image-mode checker \ --prompt "Describe the image." \ --context-len 1024 \ --force-convert --force-compile
What Success Looks Like
- The bridge logs reach
[v8-bridge] done report=.... - A bridge report exists under the model cache and contains
"status": "ok". - The generated caption is coherent natural language, not raw stop tokens,
/no_think, or an empty assistant turn.
A healthy prompt shell uses the GGUF chat template in auto mode and preserves the vision markers around <image_embeds>.
Artifacts To Inspect
| Artifact | Purpose |
|---|---|
~/.cache/ck-engine-v8/models/.../multimodal_bridge/bridge_report.json |
Final bridge status, prompt accounting, prefix grid, and generated text. |
~/.cache/ck-engine-v8/models/.../multimodal_bridge/encoder/ |
Encoder-side lowered IR, compiled runtime, and bridge artifacts. |
~/.cache/ck-engine-v8/models/.../multimodal_bridge/decoder/ |
Decoder lowered IR, generated C, compiled shared library, and decode bridge artifacts. |
./build/ck-cli-v8 --list
Troubleshooting
- If you omit
--image-path, the bridge uses a synthetic image path. That is useful for seam probes, not for validating real-image captioning. - If the bridge report is missing, check disk space first. The first compile path writes large intermediate artifacts.
- If output starts echoing raw chat markers or thinking-control strings, keep
--chat-template autoand do not force a manual template override on this model. - If captions are incoherent again, verify that the decoder GGUF and mmproj GGUF are the matching Qwen3-VL pair.
Support Statement
The tested Qwen3-VL artifact pair works in
v8 for end-to-end multimodal inference bring-up. That does not automatically extend to unrelated model families or future multimodal templates that need kernels the engine does not have yet.