Architecture Decision Records

This page documents key architectural decisions for the C-Kernel-Engine IR v4 code generation system. Each ADR explains the context, decision, and consequences to help future contributors understand why things are built the way they are.

IR v4 Context
IR v4 is the "bridge release" unifying v2 (lowering + memory planning) with v3 (deterministic layout). These ADRs trace why each step exists in the pipeline.

Start Here: ADR-001

ADR-001 (Pipeline Split) is the foundation that explains why we separate Graph IR, Lowered IR, and Layout. Read it first to understand the architecture.

Decision Index

ID Title Category Status
ADR-001 IR v4 Pipeline Split (Graph → Lowered → Layout) Design Accepted
ADR-002 Templates as Canonical Architecture Source Design Accepted
ADR-003 Mode-Specific Lowering (Prefill/Decode/Backward) Performance Accepted
ADR-004 Deterministic Layout + Canaries for Debug Design Accepted
ADR-005 Kernel Selection Happens in Lowering Performance Accepted
ADR-006 Weights Map from HuggingFace Names via Template Design Accepted
ADR-001: IR v4 Pipeline Split (Graph → Lowered → Layout) Accepted
Status: Accepted
Date: 2025-01
Category: Design
Supersedes: IR v2, v3
This is the foundational ADR. It explains the three-phase pipeline. All other ADRs build on these concepts.

Context

Previous IR versions conflated several concerns:

  • IR v2: Combined lowering and memory planning into a single phase
  • IR v3: Clean layout but hardcoded architecture knowledge in Python

Decision

Split the IR pipeline into three distinct phases:

IR v4 PIPELINE
config.json + template.yaml
GRAPH IR
High-level
Abstract ops: Embed, GEMM, Attention
No kernels, no offsets, no modes
LOWERED IR
Per-mode
Concrete kernels selected
prefill: gemm_blocked_parallel_bf16
decode: gemm_1x1_bf16
LAYOUT
Memory plan
Byte offsets, 64B alignment
Canary markers for debug
CODEGEN
Output
Emit C code with real kernel calls
model.c, model.h, layout.json

Consequences

Benefits

  • Each phase testable independently
  • New modes = new lowering pass
  • Layout is deterministic and exportable
  • Graph IR can be visualized

Costs

  • Three-phase pipeline more complex
  • More intermediate artifacts
  • Phases must stay in sync
ADR-002: Templates as Canonical Architecture Source Accepted
Status: Accepted
Date: 2025-01
Category: Design
Supersedes: Hardcoded Python

Context

In IR v3, model architecture was embedded in build_layer_layout(). Adding new models required modifying core codegen.

Decision

Templates are the canonical definition of model architectures. The IR is a compiled artifact derived from templates + config.json.

# templates/qwen2.yaml
name: qwen2
config_mapping:
  hidden_size: embed_dim
  num_attention_heads: num_heads

layers:
  - name: decoder_layers
    repeat: "{{num_layers}}"
    ops:
      - name: pre_attn_norm
        op: RMSNorm
        input: "{{prev_output}}"
        output: normed

      - name: qkv_proj
        op: GEMM
        input: normed
        outputs: [q, k, v]

      - name: attention
        op: Attention
        inputs: [q, k, v]
        output: attn_out
        params:
          causal: true

Consequences

  • New architectures: Add YAML file, not Python code
  • Explicit: Architecture visible at a glance
  • Validation: Can diff templates against HuggingFace
ADR-003: Mode-Specific Lowering (Prefill/Decode/Backward) Accepted
Status: Accepted
Date: 2025-01
Category: Performance
Related: ADR-005

Context

Transformer inference has fundamentally different execution patterns:

  • Prefill: Process entire prompt (T tokens, large buffers)
  • Decode: Generate one token (T=1, KV cache reused)
  • Backward: Compute gradients for training

Decision

Allocate separate activation buffers for each execution mode. Weights are shared, working memory is mode-specific.

┌─────────────────────────────────────────────────────────────┐
│                    MEMORY LAYOUT                             │
├─────────────────────────────────────────────────────────────┤
│ Weights (read-only, shared across modes)                     │
├─────────────────────────────────────────────────────────────┤
│ KV Cache (persistent across decode steps)                    │
├─────────────────────────────────────────────────────────────┤
│ Prefill Buffers [max_seq, embed_dim] ← Large                 │
├─────────────────────────────────────────────────────────────┤
│ Decode Buffers [1, embed_dim] ← 2000x smaller                │
└─────────────────────────────────────────────────────────────┘

Buffer Size Comparison (Qwen2-0.5B)

BufferPrefillDecodeRatio
layer_input3.5 MB1.75 KB2048x
attn_scores224 MB112 KB2048x

Decode buffers are ~2000x smaller, enabling better cache utilization.

ADR-004: Deterministic Layout + Canaries for Debug Accepted
Status: Accepted
Date: 2025-01
Category: Design
Origin: IR v3

Context

Buffer overflows in GEMM/attention are common and hard to diagnose. IR v2 used dynamic allocation, making bugs non-reproducible.

Decision

  • All tensor offsets computed at codegen time, baked into C code
  • 64-byte alignment for AVX-512 cache line optimization
  • Insert 64-byte canary markers (0xDEADBEEF) between tensors in debug builds
  • Export memory_layout.json for debugging tools
// Generated code - offsets are compile-time constants
#define OFFSET_EMBED_WEIGHT      64
#define OFFSET_LAYER_0_WQ        1606720
#define CANARY_0                 272629824  // Between embed and layer_0

// Verify canaries after each layer (debug mode)
bool verify_canaries(Model *model) {
    uint64_t *canary = (uint64_t*)(model->base + CANARY_0);
    for (int j = 0; j < 8; j++) {
        if (canary[j] != 0xDEADBEEFDEADBEEFULL) {
            fprintf(stderr, "CANARY VIOLATION!\n");
            return false;
        }
    }
    return true;
}
ADR-005: Kernel Selection Happens in Lowering Accepted
Status: Accepted
Date: 2025-01
Category: Performance
Related: ADR-003

Context

Multiple kernel variants exist for each operation:

  • GEMM: gemm_blocked_serial_bf16, gemm_blocked_parallel_bf16, gemm_1x1_bf16
  • Attention: batch vs single-query, GQA vs MHA

Decision

Graph IR uses abstract ops. Lowering resolves to concrete kernel function names.

Graph IR (Abstract)

{"op": "GEMM",
 "inputs": ["x", "w"],
 "output": "y"}

Lowered IR (Concrete)

// Prefill mode
{"kernel": "gemm_blocked_parallel_bf16"}

// Decode mode
{"kernel": "gemm_1x1_bf16"}

Result: No runtime branching for kernel selection. Mode-specific code paths are explicit.

ADR-006: Weights Map from HuggingFace Names via Template Accepted
Status: Accepted
Date: 2025-01
Category: Design
Related: ADR-002

Context

HuggingFace models use inconsistent naming conventions:

  • Llama: model.layers.0.self_attn.q_proj.weight
  • GPT-2: transformer.h.0.attn.c_attn.weight (combined QKV!)

Decision

Weight name mapping defined in architecture templates.

# templates/qwen2.yaml
weight_mapping:
  "model.embed_tokens.weight": "embed.weight"
  "model.layers.{i}.self_attn.q_proj.weight": "layers.{i}.attn.wq"
  "model.layers.{i}.self_attn.k_proj.weight": "layers.{i}.attn.wk"
  "model.layers.{i}.mlp.gate_proj.weight": "layers.{i}.mlp.gate"
  "model.norm.weight": "final_norm.gamma"
  "lm_head.weight": "lm_head.weight"

At codegen, validate all mapped weights exist in safetensors header. Warn about unmapped weights.

Related Documentation

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close