Architecture Decision Records
This page documents key architectural decisions for the C-Kernel-Engine IR v4 code generation system. Each ADR explains the context, decision, and consequences to help future contributors understand why things are built the way they are.
IR v4 is the "bridge release" unifying v2 (lowering + memory planning) with v3 (deterministic layout). These ADRs trace why each step exists in the pipeline.
Start Here: ADR-001
ADR-001 (Pipeline Split) is the foundation that explains why we separate Graph IR, Lowered IR, and Layout. Read it first to understand the architecture.
Decision Index
| ID | Title | Category | Status |
|---|---|---|---|
| ADR-001 | IR v4 Pipeline Split (Graph → Lowered → Layout) | Design | Accepted |
| ADR-002 | Templates as Canonical Architecture Source | Design | Accepted |
| ADR-003 | Mode-Specific Lowering (Prefill/Decode/Backward) | Performance | Accepted |
| ADR-004 | Deterministic Layout + Canaries for Debug | Design | Accepted |
| ADR-005 | Kernel Selection Happens in Lowering | Performance | Accepted |
| ADR-006 | Weights Map from HuggingFace Names via Template | Design | Accepted |
ADR-001: IR v4 Pipeline Split (Graph → Lowered → Layout) Accepted
Context
Previous IR versions conflated several concerns:
- IR v2: Combined lowering and memory planning into a single phase
- IR v3: Clean layout but hardcoded architecture knowledge in Python
Decision
Split the IR pipeline into three distinct phases:
config.json + template.yaml
No kernels, no offsets, no modes
prefill: gemm_blocked_parallel_bf16decode: gemm_1x1_bf16
Canary markers for debug
model.c, model.h, layout.jsonConsequences
Benefits
- Each phase testable independently
- New modes = new lowering pass
- Layout is deterministic and exportable
- Graph IR can be visualized
Costs
- Three-phase pipeline more complex
- More intermediate artifacts
- Phases must stay in sync
ADR-002: Templates as Canonical Architecture Source Accepted
Context
In IR v3, model architecture was embedded in build_layer_layout(). Adding new models required modifying core codegen.
Decision
Templates are the canonical definition of model architectures. The IR is a compiled artifact derived from templates + config.json.
# templates/qwen2.yaml
name: qwen2
config_mapping:
hidden_size: embed_dim
num_attention_heads: num_heads
layers:
- name: decoder_layers
repeat: "{{num_layers}}"
ops:
- name: pre_attn_norm
op: RMSNorm
input: "{{prev_output}}"
output: normed
- name: qkv_proj
op: GEMM
input: normed
outputs: [q, k, v]
- name: attention
op: Attention
inputs: [q, k, v]
output: attn_out
params:
causal: true
Consequences
- New architectures: Add YAML file, not Python code
- Explicit: Architecture visible at a glance
- Validation: Can diff templates against HuggingFace
ADR-003: Mode-Specific Lowering (Prefill/Decode/Backward) Accepted
Context
Transformer inference has fundamentally different execution patterns:
- Prefill: Process entire prompt (T tokens, large buffers)
- Decode: Generate one token (T=1, KV cache reused)
- Backward: Compute gradients for training
Decision
Allocate separate activation buffers for each execution mode. Weights are shared, working memory is mode-specific.
┌─────────────────────────────────────────────────────────────┐ │ MEMORY LAYOUT │ ├─────────────────────────────────────────────────────────────┤ │ Weights (read-only, shared across modes) │ ├─────────────────────────────────────────────────────────────┤ │ KV Cache (persistent across decode steps) │ ├─────────────────────────────────────────────────────────────┤ │ Prefill Buffers [max_seq, embed_dim] ← Large │ ├─────────────────────────────────────────────────────────────┤ │ Decode Buffers [1, embed_dim] ← 2000x smaller │ └─────────────────────────────────────────────────────────────┘
Buffer Size Comparison (Qwen2-0.5B)
| Buffer | Prefill | Decode | Ratio |
|---|---|---|---|
| layer_input | 3.5 MB | 1.75 KB | 2048x |
| attn_scores | 224 MB | 112 KB | 2048x |
Decode buffers are ~2000x smaller, enabling better cache utilization.
ADR-004: Deterministic Layout + Canaries for Debug Accepted
Context
Buffer overflows in GEMM/attention are common and hard to diagnose. IR v2 used dynamic allocation, making bugs non-reproducible.
Decision
- All tensor offsets computed at codegen time, baked into C code
- 64-byte alignment for AVX-512 cache line optimization
- Insert 64-byte canary markers (
0xDEADBEEF) between tensors in debug builds - Export
memory_layout.jsonfor debugging tools
// Generated code - offsets are compile-time constants
#define OFFSET_EMBED_WEIGHT 64
#define OFFSET_LAYER_0_WQ 1606720
#define CANARY_0 272629824 // Between embed and layer_0
// Verify canaries after each layer (debug mode)
bool verify_canaries(Model *model) {
uint64_t *canary = (uint64_t*)(model->base + CANARY_0);
for (int j = 0; j < 8; j++) {
if (canary[j] != 0xDEADBEEFDEADBEEFULL) {
fprintf(stderr, "CANARY VIOLATION!\n");
return false;
}
}
return true;
}
ADR-005: Kernel Selection Happens in Lowering Accepted
Context
Multiple kernel variants exist for each operation:
- GEMM:
gemm_blocked_serial_bf16,gemm_blocked_parallel_bf16,gemm_1x1_bf16 - Attention: batch vs single-query, GQA vs MHA
Decision
Graph IR uses abstract ops. Lowering resolves to concrete kernel function names.
Graph IR (Abstract)
{"op": "GEMM",
"inputs": ["x", "w"],
"output": "y"}
Lowered IR (Concrete)
// Prefill mode
{"kernel": "gemm_blocked_parallel_bf16"}
// Decode mode
{"kernel": "gemm_1x1_bf16"}
Result: No runtime branching for kernel selection. Mode-specific code paths are explicit.
ADR-006: Weights Map from HuggingFace Names via Template Accepted
Context
HuggingFace models use inconsistent naming conventions:
- Llama:
model.layers.0.self_attn.q_proj.weight - GPT-2:
transformer.h.0.attn.c_attn.weight(combined QKV!)
Decision
Weight name mapping defined in architecture templates.
# templates/qwen2.yaml
weight_mapping:
"model.embed_tokens.weight": "embed.weight"
"model.layers.{i}.self_attn.q_proj.weight": "layers.{i}.attn.wq"
"model.layers.{i}.self_attn.k_proj.weight": "layers.{i}.attn.wk"
"model.layers.{i}.mlp.gate_proj.weight": "layers.{i}.mlp.gate"
"model.norm.weight": "final_norm.gamma"
"lm_head.weight": "lm_head.weight"
At codegen, validate all mapped weights exist in safetensors header. Warn about unmapped weights.
Related Documentation
- Code Generation Guide - How generated C code is structured
- IR v2 Format - Previous IR format documentation
- Memory Safety - Canary and debug features