IR v2 Format Specification
IR v2 is a self-documenting JSON format that captures model architecture, buffer definitions, and computation graphs in a portable, machine-readable format.
Key Features
- Portable - Same IR works for different batch sizes
- Self-documenting - Contains notes explaining the format
- Extensible - Supports encoder, decoder, and encoder-decoder architectures
Pipeline Overview
Dimension Mapping
The IR uses symbolic dimension IDs that map to actual values from the model config:
| ID | Name | Source in config.json | Example (Qwen2-0.5B) |
|---|---|---|---|
| 0 | tokens | batch_size x seq_len (runtime) | 131072 |
| 1 | embed | hidden_size | 896 |
| 2 | aligned_embed | hidden_size (64-byte aligned) | 896 |
| 3 | head_dim | hidden_size / num_attention_heads | 64 |
| 5 | num_heads | num_attention_heads | 14 |
| 6 | num_kv_heads | num_key_value_heads | 2 |
| 8 | intermediate | intermediate_size | 4864 |
| 10 | vocab | vocab_size | 151936 |
How Dimensions Resolve
shape[{dim:10}] → dimensions[10] → {name:"vocab", value:151936} → 151936 elements
Example: token_emb shape = [vocab, aligned_embed] = [151936, 896]
Input Files
config.json (from HuggingFace)
Downloaded automatically. Provides model dimensions.
{
"hidden_size": 896,
"vocab_size": 151936,
"num_attention_heads": 14,
"num_key_value_heads": 2,
"intermediate_size": 4864,
"num_hidden_layers": 24
}
global_buffers.json (kernel_maps/)
Defines global buffers using symbolic dimension names.
{"buffers": [
{"name": "token_emb",
"scope": "global",
"role": "weight",
"shape": [{"dim":"vocab"},
{"dim":"aligned_embed"}]}
]}
decoder_layer_plan.json (kernel_maps/)
Defines operations in each transformer layer.
{"steps": [
{"kernel": "rmsnorm",
"bind": {"input": "input",
"gamma": "ln1_gamma"}},
{"kernel": "qkv_project", ...},
{"kernel": "attention", ...}
]}
ir_v2.json (Output)
Self-documenting IR with all dimensions resolved.
{
"version": 2,
"notes": ["...format guide..."],
"dimensions": [...],
"buffers": [...],
"nodes": [...]
}
Usage
Generate IR v2
# Fetch config from HuggingFace make fetch-v2 IR_V2_HF=Qwen/Qwen2-0.5B # Generate IR v2 make ir-v2 IR_V2_HF=Qwen/Qwen2-0.5B # Generate C runtime ./build/ck_ir_v2_demo --ir build/ir_v2.json --emit build/generated_v2.c
For GGUF Quantized Models
# Use the GGUF-specific repo
make ir-v2 IR_V2_HF=Qwen/Qwen2-0.5B-Instruct-GGUF \
IR_V2_WEIGHTS=qwen2-0_5b-instruct-q4_k_m.gguf
IR Lowering
The high-level IR is portable but not executable. The lowering pass converts it to a mode-specific form with a concrete memory plan.
Prefill Mode
Process full sequence (prompt). All tokens in parallel.
./build/ck_ir_v2_demo config.json \ --lower prefill
Decode Mode
Generate one token at a time. Uses KV cache.
./build/ck_ir_v2_demo config.json \ --lower decode \ --lower-out build/decode.json
Backward Mode
Training. Auto-builds backward graph.
./build/ck_ir_v2_demo config.json \ --lower backward
What Lowering Adds
{
"lowering": {
"mode": "prefill",
"nodes_filtered": 12,
"nodes_kept": 252
},
"memory_plan": {
"total_bytes": 976914176,
"buffers": [
{"name": "token_emb", "offset": 0, "size": 272269312},
{"name": "embedded_input", "offset": 272269312, "size": 469762048},
{"name": "layer_0.ln1_gamma", "offset": 742031360, "size": 1792},
{"name": "layer_0.ln1_output", "offset": 742033152, "size": ...}
]
},
"nodes": [ ... filtered & tagged ... ]
}
Note: All offsets are from a single base pointer. Weights and activations interleaved in execution order.
Memory Layout Philosophy
C-Kernel-Engine uses a single contiguous memory allocation for the entire model. This design is critical for CPU performance.
Why One Allocation?
- CPU doesn't care if data is a weight or activation - it just needs sequential memory access
- Single base pointer + offsets = simpler memory management, no pointer chasing
- Hugepage alignment (1GB/2MB) enables NUMA-aware placement
- Layer-level prefetching - while layer N computes, layer N+1's data streams from different memory channel
NUMA / Sub-NUMA Optimization
With 1GB hugepages, each page can be placed on a different DRAM bank:
Page 0 (0-1GB) → DRAM Bank 0 (Layer 0-5) Page 1 (1-2GB) → DRAM Bank 1 (Layer 6-11) Page 2 (2-3GB) → DRAM Bank 2 (Layer 12-17) ...
While Layer 0 computes, Layer 6's data prefetches from a different memory channel - no bandwidth contention.
Offset-Based Access
All tensors accessed via offsets from single base pointer:
typedef struct {
size_t ln1_gamma; // weight
size_t ln1_output; // activation
size_t wq, wk, wv; // weights
size_t q, k, v; // activations
...
} LayerOffsets;
// Access: base + layer->ln1_gamma
Why NOT Separate Arenas?
Separating weights/activations/gradients into different allocations would mean:
- 3+ base pointers to track (or 6+ with sections)
- Multiple malloc/canary regions to manage
- Cache misses when jumping between arenas
- Harder to optimize for NUMA placement
One allocation = one base pointer + offsets = maximum simplicity and performance.
Key Concepts
Shape Resolution
Each shape dimension is {dim, mult, div}:
size = dimensions[dim].value
* mult / div
Buffer Roles
weight- Model paramsactivation- Runtime tensorsgrad- Gradients
Buffer Scope
global- Shared across layerslayer- Per-layer tensors
IR v2 JSON Structure
{
"version": 2,
"notes": [
"=== IR V2 FORMAT GUIDE ===",
"DIMENSIONS: maps numeric IDs to named values from config.json",
" - dim:10 → dimensions[10] → 'vocab' → 151936",
"BUFFERS: each has shape, role, dtype, scope",
"NODES: kernel invocations with bindings to buffers"
],
"config": {
"num_layers": 24,
"hidden_size": 896,
"vocab_size": 151936,
...
},
"dimensions": [
{"id": 0, "name": "tokens", "value": 131072},
{"id": 1, "name": "embed", "value": 896},
{"id": 10, "name": "vocab", "value": 151936}
],
"buffers": [
{
"name": "token_emb",
"scope": "global",
"role": "weight",
"dtype": "bf16",
"shape": [{"dim": 10}, {"dim": 2}]
}
],
"nodes": [
{
"layer": 0,
"op": "rmsnorm",
"kernel": "rmsnorm_forward",
"bindings": [
{"arg": "input", "buffer": "input"},
{"arg": "gamma", "buffer": "ln1_gamma"}
]
}
]
}