v8 Vision Encoder Architecture

This page explains how the validated v8 vision lane works under the hood: how a real image becomes a deterministic encoder prefix, how that prefix crosses the bridge, and where the current system is already template-driven versus where the bridge host still carries explicit policy.

Current Scope
The working path today is the validated v8 Qwen3-VL encoder + decoder pair. This page is about that architecture and why it matters, not a claim that all multimodal families are already solved.

Why This Matters

This is the first point where the same C-Kernel-Engine inference stack is visibly spanning both sides of a multimodal transformer: a vision encoder graph on one end and a decoder graph on the other.

That is a real architecture milestone. The system is no longer only “text inference with extra kernels.” It now proves that templates, lowering, memory planning, codegen, and runtime stitching can carry an image-derived prefix into a decoder without a separate one-off runtime.

template-driven graph GGUF-derived config deterministic memory encoder → bridge → decoder

Validated Components

Circuit: version/v8/circuits/qwen3_vl_vision.json
Converter: version/v8/scripts/convert_gguf_to_bump_v8.py
Lowering: version/v8/scripts/build_ir_v8.py
Host bridge: version/v8/scripts/run_multimodal_bridge_v8.py
Operator surface: version/v8/scripts/ck_run_v8.py

The checked-in smoke image uses .ppm, the Portable Pixmap format. A PPM file is just a small header plus raw RGB pixels, so the bridge can parse it directly and keep CI independent of PNG/JPEG decoder differences. It is a deterministic test fixture format, not a requirement for normal image runs.

Open v8 Runbook See Encoder Parity Notes

End-to-End Flow

Step 1

GGUF Intake

The converter reads tensor names, dtypes, shapes, tokenizer metadata, and Qwen3-VL vision metadata from the GGUF files. This is where artifact-specific truth enters the system.

Step 2

Template Resolution

The encoder graph shape comes from qwen3_vl_vision.json: patch frontend, position path, attention body, deepstack branch, and projector footer.

Step 3

IR + Layout

build_ir_v8.py lowers that template into concrete kernel calls, activation buffers, and offsets. The result is deterministic IR plus a deterministic memory plan.

Step 4

Bridge Prefix

The encoder runtime writes a named activation carrying projector output rows. Those rows become the multimodal prefix consumed by the decoder prefill/decode bridge.

Step 5

Decoder Continuation

The decoder runs on the decode-layout runtime and treats the encoder rows as a staged prefix. Text tokens resume after the vision grid using the bridge contract’s row/grid semantics.

Source of Truth Stack

1. GGUF

The artifact-specific source of truth.

Tensor names and shapes
Weight dtypes and quantization types
Vision metadata like patch size, image size, projector dims, deepstack hints
Tokenizer and chat template metadata

2. Template

The structural source of truth.

Header/body/footer graph order
Vision-specific operator sequence
Kernel ids and contract sections
What the encoder is, not the exact artifact numbers

3. Lowered Manifest + Layout

The execution source of truth.

Concrete call graph
Resolved buffers and offsets
Quant/kernel decisions
Bridge-visible activation locations

4. Bridge Host

The orchestration source of truth.

Loads image input
Runs encoder and decoder runtimes
Maps encoder output into decoder prefix rows
Owns the remaining family-specific bridge glue

What Is Already Derived

Mostly Derived Today

Vision graph structure from qwen3_vl_vision.json
Patch/grid/projector dimensions from GGUF metadata
Quant summary and weight mapping from the actual artifact
Kernel call order from lowering, not handwritten encoder code
Memory buffer sizes and offsets from the layout planner
Named activation resolution for the projector output bridge

Still Explicit In The Host

Qwen3-VL smart-resize/image preprocessing logic
Synthetic image generators for seam-only probes
Some bridge row/grid fallback rules
Chat-template fallback parsing in the bridge runner
The current choice of which bridge decoder API to call

Honest State
The encoder runtime itself is already mostly template/manifest/lowering driven. The remaining hardcoded family knowledge is concentrated in the bridge shell, not in the generated encoder graph.

The Actual Encoder Shape

The current Qwen3-VL encoder template is not a toy wrapper. It encodes a real transformer-style path:

Patch frontend: patchify, dual patch projection, stream merge, patch bias, tiled 2D position addition
Position contract: 2D position ids plus multi-section RoPE for vision attention
Transformer body: layernorm, packed QKV projection, split Q/K/V, full attention, out projection, residual, MLP
Deepstack branch: selected layer taps are spatially merged and projected through branch MLP blocks
Projector footer: the resulting rows become the bridge-visible vision embedding prefix for the decoder

That matters because the engine is not “calling an external encoder.” It is lowering and running the encoder as part of the same system.

Inside the Patch Frontend

The patch frontend is where a raw image becomes encoder tokens. These steps map directly to kernels in src/kernels/vision_kernels.c — there is no hidden Python preprocessing doing the heavy lifting.

Important Artifacts

Artifact	Role
`encoder/weights_manifest.json`	Resolved config, embedded template, quant summary, and weight inventory
`encoder/ir1.json`	Pre-lowering operator expansion from the template
`encoder/layout.json`	Activation buffers and memory offsets used by the generated runtime
`encoder/call.json`	Call-ready IR consumed by code generation
`encoder/encoder_v8.c`	Generated C runtime for the vision encoder graph
`multimodal_bridge/bridge_report.json`	Final bridge contract: prefix rows, grid, prompt segmentation, generated text

What “Template-Driven Bridge” Means

The Qwen3-VL bridge policy is now declared as a template contract instead of living only as host-runner convention:

Image preprocess contract: resize policy, alignment rule, normalization stats, channel layout
Multimodal bridge contract: mixed visual/text prefill, MRoPE 2D grid metadata, persistent decoder KV cache, and staged decode continuation
Chat contract: prompt shell, stop markers, and image marker placement

The bridge shell can now ask “what does the resolved contract say?” and choose the staged runtime from that contract.

Current Speed Read

The Qwen3-VL OCR speed work is now moving through measured bottlenecks instead of guessing. The first large win changed decode-staged into a hybrid decoder runtime: decode IR with prefill-sized activations for the mixed image/text prefix, followed by safe incremental decode. The current patch adds CK threadpool dispatch inside gemm_nt_f16 for large FP16 vision-encoder GEMMs.

Stage	Measured signal	What changed
Original OCR path	Multiple-minute end-to-end run on real SDPR-style OCR images	Correctness-first Qwen3-VL bridge and decoder path
Staged bridge	Encoder about 212.7s -> 85.8s; mixed prefill about 68.3s -> 47.5s; steady path about 281s -> 133s	Hybrid decoder runtime, staged mixed visual/text prefix, FP16 KV cache correctness, and encoder/decoder profiling
FP16 encoder GEMM threadpool	Encoder about 85.8s -> 59.0s; mixed prefill still about 47s; one-token path about 106s	Threadpool dispatch for large `gemm_nt_f16` vision-encoder GEMMs
Small OCR smoke	Output remains coherent (`CK`), encoder about 6.7s on the small local smoke	Regression guard for the fast path without requiring the full SDPR image workload

This is still not llama.cpp-speed OCR, but the shape is encouraging: the path moved from many minutes toward roughly a minute-class real-image run by fixing one measured bottleneck at a time. A tested Q8-contract threadpool idea for the next encoder bottleneck did not materially move the real profile, so it was dropped instead of carrying complexity.

The next dominant cost is no longer the vision encoder scheduler. The profile now points at decoder mixed prefill, especially gemm_nt_q4_k_q8_k in mlp_gate_up at about 23.7s on the real OCR image. That needs packed/reused Q4_K prefill microkernel work, not another scheduler-only patch.

Why This Is A Real Milestone

Serious Software Signal
The important thing is not only that Qwen3-VL now captions an image. It is that the same deterministic stack now spans GGUF intake, template lowering, generated vision execution, bridge stitching, and decoder continuation. That is the shape of a real transformer runtime, not an application wrapper.