v8 Vision Encoder Architecture
This page explains how the validated v8 vision lane works under the hood:
how a real image becomes a deterministic encoder prefix, how that prefix crosses the bridge,
and where the current system is already template-driven versus where the bridge host still carries explicit policy.
The working path today is the validated
v8 Qwen3-VL encoder + decoder pair. This page is about that architecture and why it matters, not a claim that all multimodal families are already solved.
Why This Matters
This is the first point where the same C-Kernel-Engine inference stack is visibly spanning both sides of a multimodal transformer:
a vision encoder graph on one end and a decoder graph on the other.
That is a real architecture milestone. The system is no longer only “text inference with extra kernels.” It now proves that templates, lowering, memory planning, codegen, and runtime stitching can carry an image-derived prefix into a decoder without a separate one-off runtime.
Validated Components
- Template:
version/v8/templates/qwen3_vl_vision.json - Converter:
version/v8/scripts/convert_gguf_to_bump_v8.py - Lowering:
version/v8/scripts/build_ir_v8.py - Host bridge:
version/v8/scripts/run_multimodal_bridge_v8.py - Operator surface:
version/v8/scripts/ck_run_v8.py
End-to-End Flow
GGUF Intake
The converter reads tensor names, dtypes, shapes, tokenizer metadata, and Qwen3-VL vision metadata from the GGUF files. This is where artifact-specific truth enters the system.
Template Resolution
The encoder graph shape comes from qwen3_vl_vision.json:
patch frontend, position path, attention body, deepstack branch, and projector footer.
IR + Layout
build_ir_v8.py lowers that template into concrete kernel calls, activation buffers,
and offsets. The result is deterministic IR plus a deterministic memory plan.
Bridge Prefix
The encoder runtime writes a named activation carrying projector output rows. Those rows become the multimodal prefix consumed by the decoder prefill/decode bridge.
Decoder Continuation
The decoder runs on the decode-layout runtime and treats the encoder rows as a staged prefix. Text tokens resume after the vision grid using the bridge contract’s row/grid semantics.
Source of Truth Stack
1. GGUF
The artifact-specific source of truth.
- Tensor names and shapes
- Weight dtypes and quantization types
- Vision metadata like patch size, image size, projector dims, deepstack hints
- Tokenizer and chat template metadata
2. Template
The structural source of truth.
- Header/body/footer graph order
- Vision-specific operator sequence
- Kernel ids and contract sections
- What the encoder is, not the exact artifact numbers
3. Lowered Manifest + Layout
The execution source of truth.
- Concrete call graph
- Resolved buffers and offsets
- Quant/kernel decisions
- Bridge-visible activation locations
4. Bridge Host
The orchestration source of truth.
- Loads image input
- Runs encoder and decoder runtimes
- Maps encoder output into decoder prefix rows
- Owns the remaining family-specific bridge glue
What Is Already Derived
Mostly Derived Today
- Vision graph structure from
qwen3_vl_vision.json - Patch/grid/projector dimensions from GGUF metadata
- Quant summary and weight mapping from the actual artifact
- Kernel call order from lowering, not handwritten encoder code
- Memory buffer sizes and offsets from the layout planner
- Named activation resolution for the projector output bridge
Still Explicit In The Host
- Qwen3-VL smart-resize/image preprocessing logic
- Synthetic image generators for seam-only probes
- Some bridge row/grid fallback rules
- Chat-template fallback parsing in the bridge runner
- The current choice of which bridge decoder API to call
The encoder runtime itself is already mostly template/manifest/lowering driven. The remaining hardcoded family knowledge is concentrated in the bridge shell, not in the generated encoder graph.
The Actual Encoder Shape
The current Qwen3-VL encoder template is not a toy wrapper. It encodes a real transformer-style path:
- Patch frontend: patchify, dual patch projection, stream merge, patch bias, tiled 2D position addition
- Position contract: 2D position ids plus multi-section RoPE for vision attention
- Transformer body: layernorm, packed QKV projection, split Q/K/V, full attention, out projection, residual, MLP
- Deepstack branch: selected layer taps are spatially merged and projected through branch MLP blocks
- Projector footer: the resulting rows become the bridge-visible vision embedding prefix for the decoder
That matters because the engine is not “calling an external encoder.” It is lowering and running the encoder as part of the same system.
Important Artifacts
| Artifact | Role |
|---|---|
encoder/weights_manifest.json |
Resolved config, embedded template, quant summary, and weight inventory |
encoder/ir1.json |
Pre-lowering operator expansion from the template |
encoder/layout.json |
Activation buffers and memory offsets used by the generated runtime |
encoder/call.json |
Call-ready IR consumed by code generation |
encoder/encoder_v8.c |
Generated C runtime for the vision encoder graph |
multimodal_bridge/bridge_report.json |
Final bridge contract: prefix rows, grid, prompt segmentation, generated text |
What “Template-Driven Bridge” Means
The next cleanup step is not to add more handwritten Qwen logic. It is to promote bridge policy into explicit contracts:
- Image preprocess contract: resize policy, alignment rule, normalization stats, channel layout
- Multimodal bridge contract: which activation is the prefix, how many rows it has, what grid metadata it carries, and which decoder entrypoint should consume it
- Chat contract: prompt shell, stop markers, and image marker placement
Once that happens, the bridge shell stops asking “is this Qwen3-VL?” and starts asking “what does the resolved contract say?”
Why This Is A Real Milestone
The important thing is not only that Qwen3-VL now captions an image. It is that the same deterministic stack now spans GGUF intake, template lowering, generated vision execution, bridge stitching, and decoder continuation. That is the shape of a real transformer runtime, not an application wrapper.