v8 Vision Encoder Architecture

This page explains how the validated v8 vision lane works under the hood: how a real image becomes a deterministic encoder prefix, how that prefix crosses the bridge, and where the current system is already template-driven versus where the bridge host still carries explicit policy.

Current Scope
The working path today is the validated v8 Qwen3-VL encoder + decoder pair. This page is about that architecture and why it matters, not a claim that all multimodal families are already solved.

Why This Matters

This is the first point where the same C-Kernel-Engine inference stack is visibly spanning both sides of a multimodal transformer: a vision encoder graph on one end and a decoder graph on the other.

That is a real architecture milestone. The system is no longer only “text inference with extra kernels.” It now proves that templates, lowering, memory planning, codegen, and runtime stitching can carry an image-derived prefix into a decoder without a separate one-off runtime.

template-driven graph GGUF-derived config deterministic memory encoder → bridge → decoder

Validated Components

  • Template: version/v8/templates/qwen3_vl_vision.json
  • Converter: version/v8/scripts/convert_gguf_to_bump_v8.py
  • Lowering: version/v8/scripts/build_ir_v8.py
  • Host bridge: version/v8/scripts/run_multimodal_bridge_v8.py
  • Operator surface: version/v8/scripts/ck_run_v8.py

End-to-End Flow

Step 1

GGUF Intake

The converter reads tensor names, dtypes, shapes, tokenizer metadata, and Qwen3-VL vision metadata from the GGUF files. This is where artifact-specific truth enters the system.

Step 2

Template Resolution

The encoder graph shape comes from qwen3_vl_vision.json: patch frontend, position path, attention body, deepstack branch, and projector footer.

Step 3

IR + Layout

build_ir_v8.py lowers that template into concrete kernel calls, activation buffers, and offsets. The result is deterministic IR plus a deterministic memory plan.

Step 4

Bridge Prefix

The encoder runtime writes a named activation carrying projector output rows. Those rows become the multimodal prefix consumed by the decoder prefill/decode bridge.

Step 5

Decoder Continuation

The decoder runs on the decode-layout runtime and treats the encoder rows as a staged prefix. Text tokens resume after the vision grid using the bridge contract’s row/grid semantics.

Source of Truth Stack

1. GGUF

The artifact-specific source of truth.

  • Tensor names and shapes
  • Weight dtypes and quantization types
  • Vision metadata like patch size, image size, projector dims, deepstack hints
  • Tokenizer and chat template metadata

2. Template

The structural source of truth.

  • Header/body/footer graph order
  • Vision-specific operator sequence
  • Kernel ids and contract sections
  • What the encoder is, not the exact artifact numbers

3. Lowered Manifest + Layout

The execution source of truth.

  • Concrete call graph
  • Resolved buffers and offsets
  • Quant/kernel decisions
  • Bridge-visible activation locations

4. Bridge Host

The orchestration source of truth.

  • Loads image input
  • Runs encoder and decoder runtimes
  • Maps encoder output into decoder prefix rows
  • Owns the remaining family-specific bridge glue

What Is Already Derived

Mostly Derived Today

  • Vision graph structure from qwen3_vl_vision.json
  • Patch/grid/projector dimensions from GGUF metadata
  • Quant summary and weight mapping from the actual artifact
  • Kernel call order from lowering, not handwritten encoder code
  • Memory buffer sizes and offsets from the layout planner
  • Named activation resolution for the projector output bridge

Still Explicit In The Host

  • Qwen3-VL smart-resize/image preprocessing logic
  • Synthetic image generators for seam-only probes
  • Some bridge row/grid fallback rules
  • Chat-template fallback parsing in the bridge runner
  • The current choice of which bridge decoder API to call
Honest State
The encoder runtime itself is already mostly template/manifest/lowering driven. The remaining hardcoded family knowledge is concentrated in the bridge shell, not in the generated encoder graph.

The Actual Encoder Shape

The current Qwen3-VL encoder template is not a toy wrapper. It encodes a real transformer-style path:

  1. Patch frontend: patchify, dual patch projection, stream merge, patch bias, tiled 2D position addition
  2. Position contract: 2D position ids plus multi-section RoPE for vision attention
  3. Transformer body: layernorm, packed QKV projection, split Q/K/V, full attention, out projection, residual, MLP
  4. Deepstack branch: selected layer taps are spatially merged and projected through branch MLP blocks
  5. Projector footer: the resulting rows become the bridge-visible vision embedding prefix for the decoder

That matters because the engine is not “calling an external encoder.” It is lowering and running the encoder as part of the same system.

Important Artifacts

ArtifactRole
encoder/weights_manifest.json Resolved config, embedded template, quant summary, and weight inventory
encoder/ir1.json Pre-lowering operator expansion from the template
encoder/layout.json Activation buffers and memory offsets used by the generated runtime
encoder/call.json Call-ready IR consumed by code generation
encoder/encoder_v8.c Generated C runtime for the vision encoder graph
multimodal_bridge/bridge_report.json Final bridge contract: prefix rows, grid, prompt segmentation, generated text

What “Template-Driven Bridge” Means

The next cleanup step is not to add more handwritten Qwen logic. It is to promote bridge policy into explicit contracts:

Once that happens, the bridge shell stops asking “is this Qwen3-VL?” and starts asking “what does the resolved contract say?”

Why This Is A Real Milestone

Serious Software Signal
The important thing is not only that Qwen3-VL now captions an image. It is that the same deterministic stack now spans GGUF intake, template lowering, generated vision execution, bridge stitching, and decoder continuation. That is the shape of a real transformer runtime, not an application wrapper.
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close