Vision Encoder Parity
Qwen3-VL vision encoder parity was closed on March 30, 2026. The generated C runtime now matches local llama.cpp end to end in native strict mode.
max_abs=0.0,
mean_abs=0.0,
rmse=0.0,
cosine=1.0.
This is not the temporary mtmd-oracle shortcut.
What Was Actually Wrong
The final blocker was not patching, not deepstack stitching, and not a missing attention op. We initially localized the drift to full attention because that was the first place tiny errors became visible. The real remaining seed was later in the layer: the GGML fp16-table GELU path was not bit-identical to llama.cpp.
Once layer-0 attention became exact, the first remaining mismatch showed up at
ffn_gelu, then propagated through ffn_out, residual addition,
and the remaining encoder stack.
Why This Took Time
Encoder Parity Is Not Just "Correct Math"
For decoder bring-up, mathematically correct kernels were often enough to converge quickly.
For this vision encoder, the target was ggml-exact graph behavior.
Tiny fp32 or fp16 rounding differences that look harmless in one row
compound across full attention, MLP, and 27 layers of residual accumulation.
The painful lesson is simple: if a model is loaded from GGUF and the reference is llama.cpp, "close" is not done.
Method That Worked
- Get end-to-end parity metrics first, so the failure is measurable.
- Dump per-activation tensors for both CK and llama.cpp.
- Find the first layer and first tensor that diverges.
- Replay that tensor boundary in isolation until the drift is explained.
- Fix the exact seed error, then rerun end-to-end because later layers amplify tiny misses.
Major Issues To Watch For On Future Encoder Ports
1. GGML-Exact Activation Tables
If the reference path uses lookup-table math, especially fp16 tables, do not assume a "same formula" implementation is enough. The exact table and the exact fp32→fp16 conversion path matter.
2. Full Attention Is Less Forgiving
Full bidirectional attention spreads tiny row-level differences across the entire layer. Decoder-style causal paths tend to localize drift much more naturally.
3. Mixed Dtype Boundaries Matter
Even when most activations are fp32, the decisive mismatch may come from an fp16 lookup,
quantized projection boundary, or a reference-only conversion convention.
4. Standalone Test Builds Must Track Runtime Splits
Splitting production code into new helper files is safe, but standalone libs and generated harnesses must be updated too. Otherwise the main engine passes while test-only link targets fail.
What To Check First On The Next Encoder
- Template contract: make sure the model path is expressed in the template, not hidden in lowering code.
- Kernel map coverage: verify every required op resolves to a concrete kernel id before codegen.
- Layer-0 first divergence: do not start with layer 12 or final embeddings.
- Reference activations: dump both sides and replay isolated seams before editing kernels broadly.
- Activation-table parity: GELU, RoPE tables, and any special lookup path deserve direct standalone checks.
Is The v8 IR Builder Hard-Coded?
Mostly no, but not perfectly no.
The Qwen3-VL encoder path is driven by:
- the template in
version/v8/templates/qwen3_vl_vision.json - kernel maps in
version/v8/kernel_maps/*.json - overlay bindings in
version/v8/kernel_maps/kernel_bindings.overlay.json
That is the correct direction: model structure in the template, kernel identity in maps, runtime function binding in the registry.
But build_ir_v8.py still contains real lowering policy:
buffer contracts, generic attention selection, fallback rules, and parameter shaping for vision operators.
That policy is mostly generic vision lowering, not Qwen3-VL-specific hardcoding,
but it is still code, not pure data.
Regression Gates That Matter
make test-v8-qwen3vlmake test-v8-vision-kernelsmake llamacpp-parity-fullmake test
These four are the minimum confidence set for future encoder compatibility work. If a parity fix only passes one of them, it is not ready.
Bottom Line
The engine did not need a special-case Qwen3-VL runtime. It needed one more round of exactness at the seam where GGML conventions still mattered. That is a good outcome: the path remained template- and kernel-driven, and the final fix improved the shared exactness story instead of adding one-off model glue.