IR Pipeline v6.6

This page explains the v6.6 pipeline using visuals that match how the engine actually builds and runs models today. The goal is clarity: templates are declarative, IR is dynamic, and codegen is intentionally dumb.

Looking for training/backprop flow?
See v7-backprop-ir.html for IR1/IR2 + layout/codegen training flow, canary diagnostics, and oracle parity runbook.

IR pipeline flow from template to generated C

Read This Diagram in 30 Seconds

IR1 validates template logic and resolves kernel IDs
IR2 is where producer/consumer wiring, conservative fusion, and parallel-safe scheduling are decided
IR3 freezes memory layout (arena offsets + pointer expressions) and emits call-ready arguments
Runtime C executes that fixed plan using thread-pool orchestration and ISA dispatch (scalar/AVX/AVX2/VNNI)

If you forget where a bug belongs: dataflow issues usually originate in IR2, offset/alias issues in IR3, and call-order issues in codegen/runtime glue.

Operator Start Here (Snapshot: February 11, 2026)

If you revisit this page in six months, begin with the gate entrypoint first. It encodes the current v6.6 release contract better than any prose summary.

One Command First

make v6.6-gate

This runs kernel-map sync, tooling contracts, matrix smoke, parity matrix (runtime-optional), and long-decode stability in sequence.

v6.6 gate ladder from kernel-map sync to long-decode stability

Runtime Modes and Preconditions

v6.6 operates in two practical modes: build-smoke mode and full-parity mode. Full parity requires llama parity runtime artifacts; without them, strict parity gates can report SKIP and fail under --require-all.

v6.6 runtime modes: preflight only, build-smoke, and full parity

Minimal Preflight (No Parity Runtime)

make v6.6-kernel-map-gate
make v6.6-validate-contracts
make v6.6-validate-matrix-smoke

Use this path to validate IR/codegen/build stability without parity runtime binaries.

Full Release Gate (Parity Required)

make llamacpp-parity-build
make v6.6-gate
make v6.6-validate-parity-matrix-required

Use this path before release claims that depend on strict CK vs llama parity confidence.

Fast Failure Triage

Do not read all logs linearly. Start from the first failed gate, open the corresponding JSON report, then move upstream to the producer script.

Decision tree for triaging v6.6 gate failures

Gate	Command	First File to Inspect
L0 Kernel map sync	`make v6.6-kernel-map-gate`	`version/v6.6/kernel_maps/KERNEL_REGISTRY.json`
L1 Tooling contracts	`make v6.6-validate-contracts`	`version/v6.6/tools/contract_report_latest.json`
L2 Matrix smoke	`make v6.6-validate-matrix-smoke`	`version/v6.6/tools/model_matrix_report_latest.json`
L3 Parity matrix (runtime-optional)	`make v6.6-validate-parity-matrix`	`version/v6.6/tools/parity_matrix_report_latest.json`
L3R Strict parity matrix (runtime-required)	`make v6.6-validate-parity-matrix-required`	`version/v6.6/tools/parity_matrix_report_latest.json`
L4 Long decode stability	`make v6.6-validate-longdecode`	`version/v6.6/tools/long_decode_report_latest.json`

Live Gate Status (From JSON Reports)

This panel reads the latest v6.6 gate artifacts from version/v6.6/tools/*.json and summarizes gate health at a glance.

v6.6 Gate Dashboard

Loading gate reports...

L1 Contracts

Checking report...

L2 Matrix Smoke

Checking report...

L3 Parity Matrix

Checking report...

L4 Long Decode

Checking report...

Validation + Test Gates (Detailed View)

This stack diagram shows what each gate validates, where it runs, and how failures surface in artifacts and CI output.

v6.6 validation and parity gates from tooling contracts to nightly CI

Artifact Lineage (Who Produces What)

For fast debugging, think in artifact lineage: report JSON -> producer script -> upstream artifact input.

Lineage graph for v6.6 gate artifacts and producing scripts

Status Semantics

Gate status words are shared, but strictness is configurable. Use this table as operational truth.

Status	Operational Meaning	Effect in Strict Gate
PASS	Contract/rule satisfied	Continues
WARN	Potential drift, still executable	Can fail when strict mode is enabled
SKIP	Validation not run (runtime/input unavailable)	Fails when `--require-all` is active
FAIL	Contract violation or runtime error	Stops gate immediately

v6.6 Evolution Timeline

This timeline uses explicit dates so later readers can reconstruct why gate behavior changed and what constraints were introduced.

Timeline of v6.6 evolution from initial build flow to operator gate stack

Edge-Case Coverage to Preserve

This matrix captures edge cases that matter most for keeping v6.6 stable over time.

v6.6 edge-case matrix covering runtime, cache, and sliding-window boundaries

High-Value Test Cases and Edge Cases

No parity runtime installed: verify expected SKIP path and explicit failure path for required parity mode
Offline cached model flow: v6.6-validate-matrix-smoke should pass when cache is warm and network is unavailable
Sliding-window boundaries: validate sliding_window = -1, 0, 1, > seq_len for both prefill and decode
Model override hygiene: wrappers passing --model / --model-dir must avoid hidden default-cache behavior
Layout compatibility: ensure flat memory.weights.entries and legacy formats both remain parseable where needed
Concurrent gate runs: protect shared cache outputs from cross-run artifact corruption
Long decode reliability: enforce minimum decode-run count and stable no-NaN/no-Inf summary checks

Six-Month Operator Memory Card

This card is intentionally redundant. If you read only one visual after a long break, read this one.

Kernel AMP Strategy

AMP here means automatic mixed precision per operation. It is not the GPU-style AMP most people know. In v6.6, AMP decides whether each op runs on FP32 activations or Q8 activations based on quant summary, template flags, and kernel availability.

Templates to IR

Templates are architecture-level graphs. They define the op sequence and can optionally provide kernel overrides, but the IR builder still resolves the final kernel IDs from the registry based on quantization and availability.

Template Mechanics (How Ops Are Chosen)

Templates are declarative. They list ops in order and set flags. The IR builder interprets those ops, then resolves kernels using the registry. Templates may provide targeted kernel overrides (for stability or parity), but they never contain function pointers or memory layout details.

Example Template Snippet

{
  "name": "qwen3",
  "flags": {
    "use_qk_norm": true,
    "prefer_fp32_logits": false
  },
  "block_types": {
    "decoder": {
      "body": { "ops": ["attn_norm", "qkv_proj", "qk_norm", "rope_qk", "attn", "out_proj", "residual_add",
                        "ffn_norm", "mlp_gate_up", "silu_mul", "mlp_down", "residual_add"] }
    }
  }
}

IR1 builder expands this into concrete ops, assigns IDs, and attaches dataflow.

Why IR1 Exists

IR1 is the validation and contract layer. It ensures the template ops are valid, mapped, and supported before we allocate memory or emit C. This is the earliest point to fail fast.

IR1 Responsibilities

Validate template ops are mapped to kernel families
Check kernel availability in the registry
Attach dataflow edges (op IDs → inputs)
Record kernel IDs without allocating memory

Implementation: version/v6.6/scripts/build_ir_v6_6.py

Kernel Maps + Registry

Kernel maps define how kernels are registered and discovered. IR never hardcodes kernels — it resolves them from the registry based on kernel IDs and supported dtypes.

Kernel Map → C Signature

Kernel maps describe the C function signature, expected dtypes, and buffer roles. The IR lowerer wires input/output pointers and sizes based on this schema.

Example Kernel Map Snippet

{
  "id": "gemv_q8_0",
  "signature": "void gemv_q8_0(const void* w, const float* x, float* y, int m, int k)",
  "quant": { "weight": "q8_0", "activation": "fp32" },
  "buffers": {
    "w": { "role": "weight" },
    "x": { "role": "activation", "buffer": "main_stream" },
    "y": { "role": "activation", "buffer": "main_stream" }
  }
}

Kernel Map JSON → C Args (Concrete Mapping)

Kernel Map Field	Meaning	Lowered/C Code
`inputs[].name`	Argument name in C signature	Pointer emitted by IR lower (e.g., `model->q_scratch`)
`inputs[].dtype`	Expected activation dtype	Kernel variant selection (fp32 vs q8)
`outputs[].name`	Output buffer label	Pointer emitted by IR lower
`params[].name`	Scalar arg (dims, stride)	Literal value in call

Example Kernel Map Entry

{
  "id": "embedding_forward_q8_0",
  "inputs": [{ "name": "tokens", "dtype": "int32", "shape": ["T"] }],
  "outputs": [{ "name": "output", "dtype": "fp32", "shape": ["T","E"] }]
}

Lowered call will pass token_ids and embedded_input pointers.

See: version/v6.6/kernel_maps/embedding_forward_q8_0.json

Producer vs Consumer (Plain Language)

In this page, producer and consumer are data-flow terms, not people. A producer writes a value/buffer/artifact. A consumer reads it in the same stage or the next stage.

Context	Producer	Consumer	What It Means
Stage handoff	Current stage script writes JSON/C artifact	Next stage script reads that artifact	Example: `build_ir_v6_6.py` produces `lowered_decode_call.json`; `codegen_v6_6.py` consumes it
IR op graph (IR1)	Earlier op writes named outputs	Later op references those outputs via `dataflow.inputs`	If op 13 has `"q": 12`, op 12 is producer of `q`, op 13 is consumer of `q`
Kernel call contract	`outputs[]` args in kernel map	`inputs[]` args in kernel map	Kernel map defines who writes and who reads for each call argument
Memory planner	Logical buffer assignment for produced tensors	Later calls reading same logical value	Planner decides where produced data lives (scratch/main/KV) so consumers can read it safely
Codegen/runtime	Lowered call IR produces concrete call args in C	Compiled runtime executes calls in order	Codegen does not invent producers/consumers; it emits what IR already resolved

Producer consumer chain for qkv projection, rope, and attention

How to Read One Dataflow Edge

{
  "op_id": 12,
  "op": "qkv_proj",
  "outputs": { "q": "...", "k": "...", "v": "..." }
}
{
  "op_id": 13,
  "op": "rope_qk",
  "dataflow": { "inputs": { "q": 12, "k": 12 } }
}

Interpretation: op 12 produces q/k; op 13 consumes those outputs and produces rotated q/k for the next consumer (usually attention).

When QKV "Split" Happens (Head Handling)

Short answer: in v6.6, split happens during Q/K/V projection itself (head-major outputs), not as a separate "split kernel" after one big projection.

Stage	What Happens	Shape Intent
Q/K/V projection	IR emits `q_proj`, `k_proj`, `v_proj` (or fused head-major QKV in selected paths).	`q: [H, T, AD]`, `k/v: [KV, T, AD]` (already split by heads)
QK norm (optional)	`qk_norm` runs on projected Q/K.	Same head-major layout, normalized in place
RoPE	`rope_qk` rotates Q and K only.	Q/K stay head-major; V is unchanged
Attention	`attention_forward__head_major_gqa_` consumes Q/K/V and computes attention.	Kernel loops all query heads internally and maps to KV heads via GQA

Do We Do QKV + RoPE "All At Once"?

Not in the default conservative v6.6 path.

Q, K, V are produced first (three ops or a selected fused QKV op).
RoPE runs after Q/K exist, before attention.
Attention kernel then handles head iteration and GQA mapping internally.

A sliding-window variant uses the same contract and adds sliding_window as an extra runtime parameter.

Kernel-Level Head Mapping (What "Automatic" Means)

Inside attention kernels, heads are iterated in-kernel and mapped to KV groups:

for (h = 0; h < num_heads; ++h) {
  kv_head = h * num_kv_heads / num_heads;  // GQA mapping
  ... attention for this head ...
}

Caller still provides num_heads, num_kv_heads, head_dim, and buffers. The kernel does not infer model dims.

References

version/v6.6/kernel_maps/attention_forward_causal_head_major_gqa_flash_strided.json
version/v6.6/kernel_maps/attention_forward_causal_head_major_gqa_flash_strided_sliding.json
src/kernels/attention_kernels.c
src/kernels/attention_kernels_sliding.c

IR1 to IR Lowering

IR1 is a declarative graph. IR2 adds scheduling and kernel selection details. The lower stages compute memory layout, resolve offsets, and produce call arguments for codegen.

What IR2 Does (Optional Optimizations) Conservative

IR2 is the optimization staging area. This is where we can apply fusions, reorder safe ops, or choose specialized kernels. In v6.6, most of this is intentionally conservative or disabled to avoid behavioral drift.

IR2 Focus

Optional fusion patterns (e.g., norm + QKV)
Kernel variant selection for prefill vs decode
Keeping behavior stable while enabling future speedups

IR2 Fusions (Currently Disabled / Conservative)

version/v6.6/kernel_maps/mega_fused_attention_prefill.json
version/v6.6/kernel_maps/mega_fused_outproj_mlp_prefill.json
version/v6.6/kernel_maps/mega_fused_attention_decode_q5_0.json

These exist in the registry but are kept conservative in v6.6 to avoid parity drift. Some require head-major constraints or rely on quantized activation contracts that are still being stabilized.

Fusion vs Unfused (One Attention Block)

Dataflow Stitching (How IR Connects Ops)

IR1 uses op IDs and named outputs. Each op declares its inputs as outputs of previous ops. The dataflow tracker builds a graph, and the memory planner assigns buffers to each edge.

IR1 Example (Stitching)

{
  "op_id": 12,
  "op": "qkv_proj",
  "outputs": { "q": "q_scratch", "k": "k_scratch", "v": "v_scratch" },
  "dataflow": { "inputs": { "x": 11 } }
}
{
  "op_id": 13,
  "op": "rope_qk",
  "dataflow": { "inputs": { "q": 12, "k": 12 } }
}

Pipeline Outputs (Artifacts)

Each stage writes explicit artifacts. These files are the contract between stages and are the inputs for debugging tools.

IR File Schema (Quick Reference)

File	Key Fields	Purpose
`ir1_decode.json`	`ops`, `dataflow`, `kernel`	Validated op graph, kernel IDs, dataflow edges
`lowered_decode.json`	`operations`, `args`, `config`	Call-ready args, resolved dims, runtime pointers
`layout_decode.json`	`memory.weights`, `memory.activations`	Buffer offsets and sizes for weights and activations
`init_call.json`	`ops`, `params`	One-time init kernels (RoPE cache, KV init)

JSON Walkthrough

This diagram shows how template → IR1 → lowered IR → layout relate. It is the quickest way to understand where each value comes from.

IR Timeline (Why Each Stage Exists)

Real Snippet (From Model Cache)

This is a real IR1 + lowered op pulled from a v6.6 model cache (Qwen2‑0.5B). It shows how the same op is represented before and after lowering.

IR1 (before lowering)

{
  "op_id": 0,
  "kernel": "embedding_forward_q8_0",
  "op": "dense_embedding_lookup",
  "section": "header",
  "layer": -1,
  "dataflow": {
    "inputs": { "token_ids": { "from": "external:token_ids", "dtype": "i32" } },
    "outputs": { "out": { "dtype": "fp32" } }
  },
  "weights": {
    "token_emb": { "dtype": "q8_0", "offset": 496, "size": 144643072 }
  }
}

Lowered (call-ready)

{
  "kernel": "embedding_forward_q8_0",
  "function": "embedding_forward_q8_0",
  "weights": {
    "token_emb": { "ptr_expr": "bump_weights + 0", "dtype": "q8_0" }
  },
  "activations": {
    "tokens": { "ptr_expr": "activations + 16384", "dtype": "int32" }
  },
  "outputs": {
    "output": { "ptr_expr": "activations + 20480", "dtype": "fp32" }
  }
}

Scratch vs Persistent Buffers

Generated C has two concerns: logic (which kernel to call) and memory (where buffers live). Scratch buffers are reused aggressively, while persistent buffers (KV cache, logits) are stable across steps.

Logic vs Memory

We separate logic (what to run) from memory (where it lives). This is why the pipeline stays deterministic and codegen can stay dumb.

Codegen Is Dumb (By Design)

Codegen only prints what the lowered IR says. This keeps C output clean and predictable. It also makes failures traceable to the IR builder, not the generator.

Example Generated C Call

/* Lowered IR says: gemv_q8_0(w2, x, y, m, k) */
gemv_q8_0(
    model->bump + W_L2_MLP_DOWN,   // weight pointer (from layout)
    model->layer_input,            // activation pointer
    model->mlp_scratch,            // output scratch
    INTERMEDIATE_SIZE,             // m
    EMBED_DIM                       // k
);

Codegen does not decide kernels. It emits the call with exact pointers and sizes from IR + layout.

Why C (Not Rust) for Runtime Control

The runtime is intentionally written in C so we control both the call logic and the memory layout without compiler-owned decisions. C gives explicit control over alignment, buffer reuse, and function dispatch — exactly what we need for deterministic, cache-aware inference.

View This in the IR Visualizer

The IR visualizer can load IR1, lowered IR, and layouts directly from a model cache directory.

Quick Start

python version/v6.6/tools/open_ir_visualizer.py --list
python version/v6.6/tools/open_ir_visualizer.py gemma3

# Or generate only (no browser):
python version/v6.6/tools/open_ir_visualizer.py --generate gemma3

# Generate with decode profile artifacts:
python version/v6.6/tools/open_ir_visualizer.py --generate gemma3 --with-profile --force-compile

# Generate with probes (memory sign-off + perf stat/flamegraph + perf budgets):
python version/v6.6/tools/open_ir_visualizer.py --generate gemma3 --with-probes --force-compile

This generates ir_report.html inside the model’s ck_build directory.

For cached model aliases that are not directly runnable checkpoints, pass --run-model hf://.../model.gguf. Example for Gemma: --run-model hf://unsloth/gemma-3-270m-it-GGUF/gemma-3-270m-it-Q5_K_M.gguf --chat-template none.

Manual Load

Open version/v6.6/tools/ir_visualizer.html in a browser and load:

ir1_decode.json / ir1_prefill.json
lowered_decode_call.json / lowered_prefill_call.json (or lowered_*.json fallback)
layout_decode.json / layout_prefill.json
weights_manifest.json (optional but recommended)
profile_summary.json (optional)

Recommended Run Flow (Baked Report)

# 1) List models that already have IR artifacts
python version/v6.6/tools/open_ir_visualizer.py --list

# 2) Open an interactive report in your browser
python version/v6.6/tools/open_ir_visualizer.py Qwen--Qwen3-0.6B-GGUF

# 3) Generate report only (no auto-open)
python version/v6.6/tools/open_ir_visualizer.py --generate Qwen--Qwen3-0.6B-GGUF

# Optional: custom output path
python version/v6.6/tools/open_ir_visualizer.py --generate Qwen--Qwen3-0.6B-GGUF --output /tmp/ir_report.html

# One-command rich report (profile + probes + embedded artifacts)
python version/v6.6/tools/open_ir_visualizer.py \
  --generate Qwen--Qwen3-0.6B-GGUF \
  --with-probes --force-compile \
  --output /tmp/ir_report.html

By default this writes ir_report.html in the model's ck_build directory. The generated report now embeds decode/prefill IR data so it opens as a standalone artifact.

Open the Generated Report

# Linux desktop
xdg-open ~/.cache/ck-engine-v6.6/models/Qwen--Qwen3-0.6B-GGUF/ir_report.html

# Or run auto-open directly (no separate xdg-open step)
python version/v6.6/tools/open_ir_visualizer.py Qwen--Qwen3-0.6B-GGUF

If your environment has no GUI, copy the generated ir_report.html to a machine with a browser and open it there.

How to Read the Viewer in 10 Minutes

Panel	What to Verify	Bug Signal to Watch
Operator Snapshot	Mode (`decode`/`prefill`), required file coverage, warnings, runbook	Missing required files, unexpected warnings, wrong model metadata
Memory Layout	Weights/activations sizes and region bars match expectation	Offsets jump backward, suspicious tiny/huge buffers, KV cache missing
Kernel Flow	Op order is coherent; expected kernels appear per layer	Missing op families, wrong kernel variants, layer count drift
Quantization Audit	Dtype-to-kernel mapping consistency (`q8_0` -> q8 kernels, etc.)	Dtype mismatch rows, unexpected fp32 fallback in quant path
Dataflow Graph	Producer/consumer edges for q/k/v, residual, and MLP chain	Uninitialized inputs, missing producer IDs, broken residual flow
Profile	Hotspots align with architecture expectations	Sudden hotspot shift without a deliberate kernel/runtime change

Producer/Consumer Tracing (Concrete Recipe)

Load report and select the right mode (decode or prefill).
Open Dataflow Graph, filter to layer/op of interest (e.g., rope_qk, attn).
Read the op row: find inputs with from_op and from_output.
Jump to that producer op ID and confirm it emits the named output with expected dtype.
Cross-check memory implications in Memory Layout (buffer exists, size and role are sane).

Rule of thumb: every consumer input should trace to exactly one intentional producer path. If you cannot explain the edge in one sentence, treat it as suspect.

When Tests Pass But You Still Suspect a Bug

Decode vs prefill divergence: same op family but different kernel variant or dtype without explicit reason
Residual path anomalies: extra/missing residual save-add pair in one layer
Silent dtype drift: quantized weights but fp32-heavy execution path in lowered calls
Memory planner smell: unusually large scratch growth or KV cache footprint jump
Graph coherence: orphan node/edge patterns that suggest missing producer wiring

These are high-signal human checks that catch regressions before they become parity failures.

Operator Workflow (Daily Use)

# Gate first
make v6.6-gate

# For release-level strictness
make v6.6-validate-parity-matrix-required

# Then inspect visual report for topology/memory sanity
python version/v6.6/tools/open_ir_visualizer.py Qwen--Qwen3-0.6B-GGUF

What AMP Means Here

In v6.6, AMP means automatic mixed precision per op. It decides the activation dtype each kernel uses. For example, Qwen2/Qwen3 can use Q8 activations for GEMV/GEMM, while Gemma may prefer FP32 logits for stability. This is not GPU AMP — it is a CPU‑oriented kernel selection policy.

Remaining Hardcoded Assumptions

These are known constraints in v6.6. They are documented so v7 can cleanly remove them.

Area	Current Behavior	Risk
RoPE scaling	No scaling types (linear, dynamic, yarn) yet	Context extension models may diverge
RoPE layout	Assumes half-dim cache layout	Models with rotary_dim or interleaved layouts can break
Activation mapping	silu_mul maps to swiglu by default	Non-SwiGLU models need explicit hidden_act mapping
KV cache layout	Head-major, static layout	Paged or alternative KV layouts not supported
BUMP defaults	Fallback layout constants	Should fail if converter omits layout

Design Intent
Templates and IR are dynamic by design. The generated C is the stable artifact that can be stripped into a lightweight runtime.