--- layout: default title: Vision Encoder Parity permalink: /vision-encoder-parity/ ---

Vision Encoder Parity

Qwen3-VL vision encoder parity was closed on March 30, 2026. The generated C runtime now matches local llama.cpp end to end in native strict mode.

Current Result: Native strict parity for the generated v8 Qwen3-VL encoder is exact: max_abs=0.0, mean_abs=0.0, rmse=0.0, cosine=1.0. This is not the temporary mtmd-oracle shortcut.

What Was Actually Wrong

The final blocker was not patching, not deepstack stitching, and not a missing attention op. We initially localized the drift to full attention because that was the first place tiny errors became visible. The real remaining seed was later in the layer: the GGML fp16-table GELU path was not bit-identical to llama.cpp.

Once layer-0 attention became exact, the first remaining mismatch showed up at ffn_gelu, then propagated through ffn_out, residual addition, and the remaining encoder stack.

Why This Took Time

Encoder Parity Is Not Just "Correct Math"

For decoder bring-up, mathematically correct kernels were often enough to converge quickly. For this vision encoder, the target was ggml-exact graph behavior. Tiny fp32 or fp16 rounding differences that look harmless in one row compound across full attention, MLP, and 27 layers of residual accumulation.

The painful lesson is simple: if a model is loaded from GGUF and the reference is llama.cpp, "close" is not done.

Method That Worked

  1. Get end-to-end parity metrics first, so the failure is measurable.
  2. Dump per-activation tensors for both CK and llama.cpp.
  3. Find the first layer and first tensor that diverges.
  4. Replay that tensor boundary in isolation until the drift is explained.
  5. Fix the exact seed error, then rerun end-to-end because later layers amplify tiny misses.
Do not skip the first-divergence step. Large final embedding drift does not imply a large bug. In this case, a tiny fp16-table activation mismatch was enough to poison later layers.

Major Issues To Watch For On Future Encoder Ports

1. GGML-Exact Activation Tables

If the reference path uses lookup-table math, especially fp16 tables, do not assume a "same formula" implementation is enough. The exact table and the exact fp32→fp16 conversion path matter.

2. Full Attention Is Less Forgiving

Full bidirectional attention spreads tiny row-level differences across the entire layer. Decoder-style causal paths tend to localize drift much more naturally.

3. Mixed Dtype Boundaries Matter

Even when most activations are fp32, the decisive mismatch may come from an fp16 lookup, quantized projection boundary, or a reference-only conversion convention.

4. Standalone Test Builds Must Track Runtime Splits

Splitting production code into new helper files is safe, but standalone libs and generated harnesses must be updated too. Otherwise the main engine passes while test-only link targets fail.

What To Check First On The Next Encoder

Is The v8 IR Builder Hard-Coded?

Mostly no, but not perfectly no.

The Qwen3-VL encoder path is driven by:

That is the correct direction: model structure in the template, kernel identity in maps, runtime function binding in the registry.

But build_ir_v8.py still contains real lowering policy: buffer contracts, generic attention selection, fallback rules, and parameter shaping for vision operators. That policy is mostly generic vision lowering, not Qwen3-VL-specific hardcoding, but it is still code, not pure data.

The honest architectural answer: the system is template-driven and kernel-driven, but not yet 100% declarative. The builder still owns some generic lowering intelligence.

Regression Gates That Matter

These four are the minimum confidence set for future encoder compatibility work. If a parity fix only passes one of them, it is not ready.

Bottom Line

The engine did not need a special-case Qwen3-VL runtime. It needed one more round of exactness at the seam where GGML conventions still mattered. That is a good outcome: the path remained template- and kernel-driven, and the final fix improved the shared exactness story instead of adding one-off model glue.

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close