--- layout: default title: Qwen3-VL Numerical Parity permalink: /vision-encoder-parity/ ---

Qwen3-VL Numerical Parity

Qwen3-VL forced CKE to treat numerical behavior, execution schedules, cache transitions, and position semantics as explicit compiler contracts. This page records what failed, how X-ray attribution evolved, what is proven, and what remains open.

Status on July 14, 2026: the Q8_0 vision encoder is byte-exact against llama.cpp on the inspected 1008 x 16384 prefixes. Fresh, source-fingerprinted production runs pass 128 generated tokens for the second OCR fixture and the public Fake 2 fixture. The first OCR fixture reproducibly diverges at decoder step 20. Full replay fails at the same step, isolating shared decoder arithmetic rather than persistent cache state. The 5-, 10-, and 40-image gates are intentionally paused until that defect is fixed.

Evidence, Not A Single Parity Label

Qwen3-VL parity has several independent levels. Passing one does not imply the next.

Level	Question	Required evidence
Vision geometry	Are the same patches and visual rows produced?	Grid, prefix rows, embedding width, patch order
Encoder values	Is every visual-prefix value identical?	Elementwise comparison and byte hash
Mixed prefill	Are visual and text segments executed with the same positions and cache transitions?	Segment trace, cache metadata, logits
Full replay	Does recomputing the complete prefix produce the same next token?	Teacher-forced ranking and logits
Persistent decode	Does incremental KV-cache execution stay aligned?	Token trace through EOS plus cache-state attribution
Application	Does OCR behavior remain aligned across varied images?	Multi-image output and accuracy sweep

Current Multi-Image Sweep

Fixture	Geometry	Encoder	Decoder result
OCR fixture 1	28 x 36, 1008 rows	Inspected prefix exact	First top-1 mismatch at step 20; full replay also fails
OCR fixture 2	36 x 28, 1008 rows	Inspected prefix exact	128 / 128 top-1 tokens
Public Fake 2 fixture	36 x 28, 1008 rows	Inspected prefix exact	128 / 128 top-1 tokens
Remaining OCR corpus	Varied	Not promoted	5-, 10-, and 40-image gates pending

Top-1 equality is a behavioral gate, not a claim that every decoder logit is byte-exact. The landscape run matched all 128 selected tokens while its minimum logits cosine was approximately 0.9504. Close rankings can hide numerical drift until a later prompt or image changes the margin. For that reason the gate records cosine, RMSE, top-k overlap, margins, full replay, and the generated token trace.

Why An Ordinary Qwen3 Decoder Was Not Enough

Qwen3-VL reuses a Qwen-like text transformer, but its runtime contract is not simply vision_encoder -> ordinary_Qwen3_decode. The decoder must consume and preserve multimodal state that a text-only Qwen3 path does not have.

The visual prefix contains projected embeddings, not token IDs that can be passed through the normal embedding lookup.
Deepstack streams are injected at specific decoder layers in addition to the base visual embedding.
Multimodal M-RoPE uses text, row, and column position semantics instead of ordinary one-dimensional text positions.
Physical KV-cache indices and semantic M-RoPE positions are related but not interchangeable.
llama.cpp executes text_before -> visual -> text_after as cache-preserving prefill segments.
Incremental decode must append one K/V row without modifying any previous row and must advance all position state correctly.

The core decoder weights may be Qwen3-shaped, but the bridge, execution schedule, position contract, deepstack injection, and cache lifecycle make Qwen3-VL a distinct stitched circuit.

What Actually Failed

Frontend And Geometry

Resize policy, patch order, tiled position interpolation, and floating-point evaluation order all had to match the reference.

Vision M-RoPE

Rotary width, pair count, split-half partner indexing, axis sections, recurrence, and rounding points were initially conflated.

Reference Math

GGML fp16-table GELU, Q8 activation thresholds, FP16 K/V storage, online softmax, split thresholds, and merge order exposed differences hidden by formula-level tests.

Capability Identity

A legacy FP32 decode function incorrectly advertised the FP16 online/FP32 merge capability. Resolution was deterministic but followed false metadata until the explicit contract provider became authoritative.

Bridge And Decoder State

Combined versus segmented prefill, deepstack placement, semantic versus physical positions, append indices, cache strides, and persistent state all affected the result.

Diagnostic Artifacts

Some Q/K/V dumps used different layouts. Internal llama.cpp dumps disabled flash attention. Stale generated libraries and incomplete runtime hashes produced misleading comparisons.

Harness Semantics

Comparing after both runtimes emitted EOS created a false divergence. Segmented full replay produced multiple valid checkpoint files, and unsupported checkpoint aliases previously triggered long runs that emitted nothing.

Bug Stages And Fix Ownership

Stage	Failure class	Observed example	Correct owner	Regression evidence
Reference harness	Invalid comparison window	Tokens were compared after both runtimes emitted EOS.	Parity runner and bridge stop-token metadata	EOS-aware multitoken tests
Reference adapter	Wrong tensor or backend mode	Segmented replay selected the wrong Q/K occurrence; internal llama.cpp dumps could change flash mode.	llama.cpp/PyTorch adapter and parity profile	Occurrence, alias, layout, and mode-metadata tests
Generated artifact	Stale or uninstrumented runtime	A cached library omitted imported compiler sources from its fingerprint or lacked `CK_PARITY_DUMP`.	Runtime builder and capture preflight	Source-set fingerprint and instrumentation rejection tests
Circuit	Wrong execution semantics	Combined prefill did not express `text_before -> visual -> text_after` cache-preserving execution.	Circuit schedule, bridge contract, and semantic positions	Segmented-prefill and call-IR tests
Kernel map	False capability advertisement	A legacy FP32 decode function claimed an FP16 online-softmax/FP32-merge contract.	Exact provider capability and resolver	Zero/ambiguous provider failures and resolved-function assertions
Kernel	Missing numerical implementation	RMSNorm/QK normalization needed an explicit ascending FP64 row-sum provider.	Leaf kernel with exact storage, arithmetic, reduction, and threading contract	Concrete scalar oracle and production-shape tests
Stateful execution	Cache or position mismatch	Persistent decode can differ from full replay through append index, stride, M-RoPE position, or prior-row corruption.	Execution-state X-ray and runtime state API	Bounded cache metadata, row-hash, and replay classification tests
Remaining arithmetic	Small seed amplified later	At image-1 step 20, Q/K first differ near 1.79e-7; post-RoPE K reaches 4.58e-5 and attention first materially fails near 1.47e-4.	Same-input primitive oracle, then kernel or capability selected by evidence	Not closed; no tolerance relaxation permitted

The harness is part of the correctness architecture, not disposable test scaffolding. A faulty adapter can falsely accuse a correct kernel, while a permissive harness can conceal a real circuit or reduction defect. Every diagnosed harness failure therefore receives a regression test and enters the same nightly report as kernel and compiler contracts.

Rejected Hypotheses And Negative Results

A rejected experiment is retained as bounded evidence, not treated as wasted work. Its conclusion applies only to the recorded model, tensor boundary, compiler, ISA, and execution mode.

Hypothesis or intervention	Observed result	Conclusion for this failure	Where it may still matter
Disable AVX-512 VNNI	Native VNNI and native no-VNNI layer-0 MLP boundaries were identical; forced AVX2 still differed.	VNNI was not the source of that encoder drift.	Other quant formats, compiler versions, or kernels with different VNNI implementations.
Change the simple AVX2 Q8 dot-product accumulation wrapper	The layer-0 MLP-up metrics did not move.	The wrapper was not the first cause; activation quantization or the batched GEMM path remained stronger suspects.	Shapes routed directly through the wrapper rather than the batched path.
Force scalar/reference Q4 projection at image-1 step 20	The token mismatch and granular metrics were unchanged.	The selected Q4 SIMD provider was not responsible for the current step-20 seed.	Other layers, shapes, weight formats, ISAs, or exact-input projection failures.
Force reference Q6 projection	The step-20 result was unchanged.	The V projection/Q6 provider was not the current first cause.	Later Q6 MLP-down boundaries and different natural inputs.
Force exact M-RoPE, alone and with reference projections	The step-20 result was unchanged.	The tested M-RoPE implementation was not the remaining top-1 cause.	Different position grids, partial rotary widths, section patterns, or BF16 storage contracts.
Attribute the mismatch to persistent KV state	Full replay failed at the same generated step as persistent decode.	The remaining failure is shared arithmetic after the replay inputs are reconstructed.	Other images and steps where persistent decode fails but full replay matches.
Trust an older strict-mode pass	Source-set fingerprinting rebuilt the runtime and reproduced the failure.	The earlier pass was stale-artifact evidence and cannot support a parity claim.	Strict mode remains useful when explicitly labeled and built from the same source identity.
Compare production logits with an internal-dump reference run without mode metadata	Some llama.cpp internal captures can disable flash attention.	That comparison changes the oracle execution mode and cannot attribute production-flash arithmetic.	Unfused-to-unfused diagnostic comparisons when both sides declare the mode explicitly.

Future agents should rerun a rejected path only when at least one recorded condition changes. A new ISA, compiler, model shape, storage dtype, backend mode, or first-divergence boundary is a valid reason. Repeating the same intervention on the same evidence is not.

Why Small Differences Became Large Failures

The difficult bugs were rarely grossly incorrect matrix multiplication. They were differences in when values were rounded, how reductions were partitioned, and which state was reused. A one-ULP input change can cross a Q8 scale or code threshold. The following projection expands that changed code across many outputs, residual layers accumulate it, and a close token ranking can eventually flip even when cosine similarity remains high.

Sequence length and threading also change algorithms. At a KV threshold, a reference can move from one online-softmax reduction to worker-local partials and an ordered merge. Diagnostic capture can disable a production flash path. A test is valid only when backend mode, dtype, thread count, partitioning, runtime libraries, positions, and input bytes are recorded together. CKE's global strict flag is also an execution mode, not proof of correctness: it can select diagnostic implementations that differ from the circuit-resolved production providers.

Three Numerical Lanes

Lane	Reference	Primary risks	Current claim
AVX2 GGUF	llama.cpp	Q8 quantization, AVX2 reduction order, FP16 cache attention, persistent state	Encoder exact on tested fixtures; decoder sweep still open
AVX-512 GGUF	llama.cpp	ISA dispatch, VNNI/non-VNNI routing, reduction and cache contracts	Separate Xeon validation is required; results are not inferred from AVX2
BF16 / AMX	PyTorch	BF16 storage boundaries, SDPA tiling, LayerNorm/GEMM rounding, M-RoPE	Leaf and practical contracts are strong; full BF16 encoder and generation are not byte-exact

How X-Ray Had To Evolve

The first X-ray implementation compared named tensors and found the first divergent layer or op. That was necessary but insufficient for a stateful multimodal decoder. The current workflow checks execution state before expensive tensor capture.

Verify runtime, model, engine, shim, manifest, compiler, ISA, thread, and attention-mode identity.
Compare prefill segmentation and the actual kernel batch shapes.
Compare physical cache count, append index, strides, and semantic positions.
Verify the newly appended FP16 K/V row round-trips and every previous row remains unchanged.
Hash and compare every valid cache row, reporting the first and worst logical coordinates.
Use full replay at the failing token to classify persistent-state versus shared arithmetic failures.
Bisect sparse layers, then every layer in the failing interval, then internal op checkpoints.
Replay identical input bytes through the suspected quantizer or kernel before changing production math.

Current varied-image classification: exact encoder prefixes were established for four inspected fixtures. Fresh, fingerprinted production runs pass 128 tokens for the second OCR image and the public Fake 2 fixture. The first OCR image has a reproducible shared-arithmetic mismatch at generated step 20: persistent and full replay agree with each other but not llama.cpp. An older strict-mode pass was invalidated after source-set hashing forced the stale decoder runtime to regenerate. Granular replay now shows Q/K/V projection, Q/K normalization, and post-RoPE values remain close, with attention output the first material amplification. Production, strict/debug, flash, and diagnostic unfused results are recorded as distinct modes and are never combined into one parity claim.

Capture Adapter Guardrails

Discover checkpoint names from the exact generated decoder source paired with the runtime.
Hard-fail unknown names, unavailable layers, or missing replay variants before model initialization.
Capture a bounded comma-separated checkpoint set in one persistent run and one replay run.
Build or reuse a dedicated parity-instrumented runtime; never expect a production library compiled without CK_PARITY_DUMP to emit tensors.
Select segmented files by final physical position and reject unresolved duplicates.
Fingerprint imported compiler, circuit, kernel-map, and schema sources so stale generated libraries cannot masquerade as current results.
Record strict/debug, flash, repack, threading, model, engine, generated library, and manifest identity in the report.

Architecture Hardened By The Investigation

The accepted ownership model is now explicit:

Circuits own topology, dataflow, execution phases, segmented schedules, semantic positions, required numerical contracts, and checkpoints.
Kernel maps own exact function identity, call ABI, storage and compute dtype, rounding points, reduction order, threading arithmetic, ISA eligibility, and implementation metadata. Distinct functions cannot claim one contract unless both pass its numerical oracle.
The resolver selects exactly one semantically compatible provider and hard-fails on zero, multiple, or incomplete matches.
IR and code generation carry the resolved decision forward. They must not infer model families from function names or inject a model-specific fallback.
Parity profiles own backend tensor mappings, canonical layouts, tolerances, and bounded attribution plans.

A fix belongs in the circuit when the model requires different graph or schedule semantics, in the kernel map when an existing implementation has a distinct numerical capability, and in a kernel when the required capability does not exist. The DSL should change only to express a generic missing concept, never to special-case Qwen3-VL.

Regression Gates

make test-v8-dsl
make test-v8-qwen3vl
make test-qwen3vl-methodical-parity
make test-v8-vision-kernels
make test-numerical-contracts
make test-attention-f16-split-kv
make llamacpp-parity-full

Artifact-backed encoder, mixed-prefill, teacher-forced, persistent decode, and OCR sweeps remain resource-gated. Missing models or unsupported BF16 hardware must report SKIP, not PASS. Numerical failures are hard failures; performance regressions are reported separately because machine topology and thermal state affect timing.

What Happens Next

Close the first OCR fixture's shared layer-0 attention/reduction mismatch with a same-input production-flash oracle.
Rerun the first five fixtures through at least 128 tokens and EOS where bounded.
Expand to ten images only after the five-image gate is clean.
Run the full 40-image encoder, mixed-prefill, persistent-decode, and OCR sweep.
Publish exact artifacts, negative results, compiler and ISA identity, and application accuracy.
Resume performance tuning only after correctness remains green.

Bottom Line

Qwen3-VL was difficult because correctness spans more than kernels and more than a tensor graph. It spans image geometry, dtype boundaries, evaluation order, execution schedules, position spaces, cache transitions, backend modes, and persistent state. X-ray made those contracts observable and converted repeated manual debugging into a bounded first-divergence process. The encoder evidence is now strong, but the project will not call end-to-end parity complete until the varied-image persistent-decode and OCR gates pass.