Architecture Decision Records

This page documents key architectural decisions for the C-Kernel-Engine IR v4 code generation system. Each ADR explains the context, decision, and consequences to help future contributors understand why things are built the way they are.

IR v4 Context
IR v4 is the "bridge release" unifying v2 (lowering + memory planning) with v3 (deterministic layout). These ADRs trace why each step exists in the pipeline.

Start Here: ADR-001

ADR-001 (Pipeline Split) is the foundation that explains why we separate Graph IR, Lowered IR, and Layout. Read it first to understand the architecture.

Decision Index

ID	Title	Category	Status
ADR-001	IR v4 Pipeline Split (Graph → Lowered → Layout)	Design	Accepted
ADR-002	Templates as Canonical Architecture Source	Design	Accepted
ADR-003	Mode-Specific Lowering (Prefill/Decode/Backward)	Performance	Accepted
ADR-004	Deterministic Layout + Canaries for Debug	Design	Accepted
ADR-005	Kernel Selection Happens in Lowering	Performance	Accepted
ADR-006	Weights Map from HuggingFace Names via Template	Design	Accepted

ADR-001: IR v4 Pipeline Split (Graph → Lowered → Layout) Accepted

Status: Accepted

Date: 2025-01

Category: Design

Supersedes: IR v2, v3

This is the foundational ADR. It explains the three-phase pipeline. All other ADRs build on these concepts.

Context

Previous IR versions conflated several concerns:

IR v2: Combined lowering and memory planning into a single phase
IR v3: Clean layout but hardcoded architecture knowledge in Python

Decision

Split the IR pipeline into three distinct phases:

IR v4 PIPELINE

config.json + template.yaml

▼

GRAPH IR

High-level

Abstract ops: Embed, GEMM, Attention
No kernels, no offsets, no modes

▼

LOWERED IR

Per-mode

Concrete kernels selected
prefill: gemm_blocked_parallel_bf16
decode: gemm_1x1_bf16

▼

LAYOUT

Memory plan

Byte offsets, 64B alignment
Canary markers for debug

▼

CODEGEN

Output

Emit C code with real kernel calls
model.c, model.h, layout.json

Consequences

Benefits

Each phase testable independently
New modes = new lowering pass
Layout is deterministic and exportable
Graph IR can be visualized

Costs

Three-phase pipeline more complex
More intermediate artifacts
Phases must stay in sync

ADR-002: Templates as Canonical Architecture Source Accepted

Status: Accepted

Date: 2025-01

Category: Design

Supersedes: Hardcoded Python

Context

In IR v3, model architecture was embedded in build_layer_layout(). Adding new models required modifying core codegen.

Decision

Templates are the canonical definition of model architectures. The IR is a compiled artifact derived from templates + config.json.

# templates/qwen2.yaml
name: qwen2
config_mapping:
  hidden_size: embed_dim
  num_attention_heads: num_heads

layers:
  - name: decoder_layers
    repeat: "{{num_layers}}"
    ops:
      - name: pre_attn_norm
        op: RMSNorm
        input: "{{prev_output}}"
        output: normed

      - name: qkv_proj
        op: GEMM
        input: normed
        outputs: [q, k, v]

      - name: attention
        op: Attention
        inputs: [q, k, v]
        output: attn_out
        params:
          causal: true

Consequences

New architectures: Add YAML file, not Python code
Explicit: Architecture visible at a glance
Validation: Can diff templates against HuggingFace

ADR-003: Mode-Specific Lowering (Prefill/Decode/Backward) Accepted

Status: Accepted

Date: 2025-01

Category: Performance

Related: ADR-005

Context

Transformer inference has fundamentally different execution patterns:

Prefill: Process entire prompt (T tokens, large buffers)
Decode: Generate one token (T=1, KV cache reused)
Backward: Compute gradients for training

Decision

Allocate separate activation buffers for each execution mode. Weights are shared, working memory is mode-specific.

┌─────────────────────────────────────────────────────────────┐
│                    MEMORY LAYOUT                             │
├─────────────────────────────────────────────────────────────┤
│ Weights (read-only, shared across modes)                     │
├─────────────────────────────────────────────────────────────┤
│ KV Cache (persistent across decode steps)                    │
├─────────────────────────────────────────────────────────────┤
│ Prefill Buffers [max_seq, embed_dim] ← Large                 │
├─────────────────────────────────────────────────────────────┤
│ Decode Buffers [1, embed_dim] ← 2000x smaller                │
└─────────────────────────────────────────────────────────────┘

Buffer Size Comparison (Qwen2-0.5B)

Buffer	Prefill	Decode	Ratio
layer_input	3.5 MB	1.75 KB	2048x
attn_scores	224 MB	112 KB	2048x

Decode buffers are ~2000x smaller, enabling better cache utilization.

ADR-004: Deterministic Layout + Canaries for Debug Accepted

Status: Accepted

Date: 2025-01

Category: Design

Origin: IR v3

Context

Buffer overflows in GEMM/attention are common and hard to diagnose. IR v2 used dynamic allocation, making bugs non-reproducible.

Decision

All tensor offsets computed at codegen time, baked into C code
64-byte alignment for AVX-512 cache line optimization
Insert 64-byte canary markers (0xDEADBEEF) between tensors in debug builds
Export memory_layout.json for debugging tools

// Generated code - offsets are compile-time constants
#define OFFSET_EMBED_WEIGHT      64
#define OFFSET_LAYER_0_WQ        1606720
#define CANARY_0                 272629824  // Between embed and layer_0

// Verify canaries after each layer (debug mode)
bool verify_canaries(Model *model) {
    uint64_t *canary = (uint64_t*)(model->base + CANARY_0);
    for (int j = 0; j < 8; j++) {
        if (canary[j] != 0xDEADBEEFDEADBEEFULL) {
            fprintf(stderr, "CANARY VIOLATION!\n");
            return false;
        }
    }
    return true;
}

ADR-005: Kernel Selection Happens in Lowering Accepted

Status: Accepted

Date: 2025-01

Category: Performance

Related: ADR-003

Context

Multiple kernel variants exist for each operation:

GEMM: gemm_blocked_serial_bf16, gemm_blocked_parallel_bf16, gemm_1x1_bf16
Attention: batch vs single-query, GQA vs MHA

Decision

Graph IR uses abstract ops. Lowering resolves to concrete kernel function names.

Graph IR (Abstract)

{"op": "GEMM",
 "inputs": ["x", "w"],
 "output": "y"}

Lowered IR (Concrete)

// Prefill mode
{"kernel": "gemm_blocked_parallel_bf16"}

// Decode mode
{"kernel": "gemm_1x1_bf16"}

Result: No runtime branching for kernel selection. Mode-specific code paths are explicit.

ADR-006: Weights Map from HuggingFace Names via Template Accepted

Status: Accepted

Date: 2025-01

Category: Design

Related: ADR-002

Context

HuggingFace models use inconsistent naming conventions:

Llama: model.layers.0.self_attn.q_proj.weight
GPT-2: transformer.h.0.attn.c_attn.weight (combined QKV!)

Decision

Weight name mapping defined in architecture templates.

# templates/qwen2.yaml
weight_mapping:
  "model.embed_tokens.weight": "embed.weight"
  "model.layers.{i}.self_attn.q_proj.weight": "layers.{i}.attn.wq"
  "model.layers.{i}.self_attn.k_proj.weight": "layers.{i}.attn.wk"
  "model.layers.{i}.mlp.gate_proj.weight": "layers.{i}.mlp.gate"
  "model.norm.weight": "final_norm.gamma"
  "lm_head.weight": "lm_head.weight"

At codegen, validate all mapped weights exist in safetensors header. Warn about unmapped weights.

Architecture Decision Records

Start Here: ADR-001

Decision Index

Context

Decision

Consequences

Benefits

Costs

Context

Decision

Consequences

Context

Decision

Buffer Size Comparison (Qwen2-0.5B)

Context

Decision

Context

Decision

Graph IR (Abstract)

Lowered IR (Concrete)

Context

Decision

Related Documentation