IR v2 Format Specification

IR v2 is a self-documenting JSON format that captures model architecture, buffer definitions, and computation graphs in a portable, machine-readable format.

Key Features

Portable - Same IR works for different batch sizes
Self-documenting - Contains notes explaining the format
Extensible - Supports encoder, decoder, and encoder-decoder architectures

Pipeline Overview

Dimension Mapping

The IR uses symbolic dimension IDs that map to actual values from the model config:

ID	Name	Source in config.json	Example (Qwen2-0.5B)
0	`tokens`	batch_size x seq_len (runtime)	131072
1	`embed`	hidden_size	896
2	`aligned_embed`	hidden_size (64-byte aligned)	896
3	`head_dim`	hidden_size / num_attention_heads	64
5	`num_heads`	num_attention_heads	14
6	`num_kv_heads`	num_key_value_heads	2
8	`intermediate`	intermediate_size	4864
10	`vocab`	vocab_size	151936

How Dimensions Resolve

shape[{dim:10}] → dimensions[10] → {name:"vocab", value:151936} → 151936 elements

Example: token_emb shape = [vocab, aligned_embed] = [151936, 896]

Input Files

config.json (from HuggingFace)

Downloaded automatically. Provides model dimensions.

{
  "hidden_size": 896,
  "vocab_size": 151936,
  "num_attention_heads": 14,
  "num_key_value_heads": 2,
  "intermediate_size": 4864,
  "num_hidden_layers": 24
}

global_buffers.json (kernel_maps/)

Defines global buffers using symbolic dimension names.

{"buffers": [
  {"name": "token_emb",
   "scope": "global",
   "role": "weight",
   "shape": [{"dim":"vocab"},
             {"dim":"aligned_embed"}]}
]}

decoder_layer_plan.json (kernel_maps/)

Defines operations in each transformer layer.

{"steps": [
  {"kernel": "rmsnorm",
   "bind": {"input": "input",
            "gamma": "ln1_gamma"}},
  {"kernel": "qkv_project", ...},
  {"kernel": "attention", ...}
]}

ir_v2.json (Output)

Self-documenting IR with all dimensions resolved.

{
  "version": 2,
  "notes": ["...format guide..."],
  "dimensions": [...],
  "buffers": [...],
  "nodes": [...]
}

Usage

Generate IR v2

# Fetch config from HuggingFace
make fetch-v2 IR_V2_HF=Qwen/Qwen2-0.5B

# Generate IR v2
make ir-v2 IR_V2_HF=Qwen/Qwen2-0.5B

# Generate C runtime
./build/ck_ir_v2_demo --ir build/ir_v2.json --emit build/generated_v2.c

For GGUF Quantized Models

# Use the GGUF-specific repo
make ir-v2 IR_V2_HF=Qwen/Qwen2-0.5B-Instruct-GGUF \
           IR_V2_WEIGHTS=qwen2-0_5b-instruct-q4_k_m.gguf

IR Lowering

The high-level IR is portable but not executable. The lowering pass converts it to a mode-specific form with a concrete memory plan.

Prefill Mode

Process full sequence (prompt). All tokens in parallel.

./build/ck_ir_v2_demo config.json \
  --lower prefill

Decode Mode

Generate one token at a time. Uses KV cache.

./build/ck_ir_v2_demo config.json \
  --lower decode \
  --lower-out build/decode.json

Backward Mode

Training. Auto-builds backward graph.

./build/ck_ir_v2_demo config.json \
  --lower backward

What Lowering Adds

{
  "lowering": {
    "mode": "prefill",
    "nodes_filtered": 12,
    "nodes_kept": 252
  },
  "memory_plan": {
    "total_bytes": 976914176,
    "buffers": [
      {"name": "token_emb", "offset": 0, "size": 272269312},
      {"name": "embedded_input", "offset": 272269312, "size": 469762048},
      {"name": "layer_0.ln1_gamma", "offset": 742031360, "size": 1792},
      {"name": "layer_0.ln1_output", "offset": 742033152, "size": ...}
    ]
  },
  "nodes": [ ... filtered & tagged ... ]
}

Note: All offsets are from a single base pointer. Weights and activations interleaved in execution order.

Memory Layout Philosophy

C-Kernel-Engine uses a single contiguous memory allocation for the entire model. This design is critical for CPU performance.

Why One Allocation?

CPU doesn't care if data is a weight or activation - it just needs sequential memory access
Single base pointer + offsets = simpler memory management, no pointer chasing
Hugepage alignment (1GB/2MB) enables NUMA-aware placement
Layer-level prefetching - while layer N computes, layer N+1's data streams from different memory channel

NUMA / Sub-NUMA Optimization

With 1GB hugepages, each page can be placed on a different DRAM bank:

Page 0 (0-1GB)   → DRAM Bank 0 (Layer 0-5)
Page 1 (1-2GB)   → DRAM Bank 1 (Layer 6-11)
Page 2 (2-3GB)   → DRAM Bank 2 (Layer 12-17)
...

While Layer 0 computes, Layer 6's data prefetches from a different memory channel - no bandwidth contention.

Offset-Based Access

All tensors accessed via offsets from single base pointer:

typedef struct {
    size_t ln1_gamma;      // weight
    size_t ln1_output;     // activation
    size_t wq, wk, wv;     // weights
    size_t q, k, v;        // activations
    ...
} LayerOffsets;

// Access: base + layer->ln1_gamma

Why NOT Separate Arenas?

Separating weights/activations/gradients into different allocations would mean:

3+ base pointers to track (or 6+ with sections)
Multiple malloc/canary regions to manage
Cache misses when jumping between arenas
Harder to optimize for NUMA placement

One allocation = one base pointer + offsets = maximum simplicity and performance.

Key Concepts

Shape Resolution

Each shape dimension is {dim, mult, div}:

size = dimensions[dim].value
       * mult / div

Buffer Roles

weight - Model params
activation - Runtime tensors
grad - Gradients

Buffer Scope

global - Shared across layers
layer - Per-layer tensors

IR v2 JSON Structure

{
  "version": 2,
  "notes": [
    "=== IR V2 FORMAT GUIDE ===",
    "DIMENSIONS: maps numeric IDs to named values from config.json",
    "  - dim:10 → dimensions[10] → 'vocab' → 151936",
    "BUFFERS: each has shape, role, dtype, scope",
    "NODES: kernel invocations with bindings to buffers"
  ],
  "config": {
    "num_layers": 24,
    "hidden_size": 896,
    "vocab_size": 151936,
    ...
  },
  "dimensions": [
    {"id": 0, "name": "tokens", "value": 131072},
    {"id": 1, "name": "embed", "value": 896},
    {"id": 10, "name": "vocab", "value": 151936}
  ],
  "buffers": [
    {
      "name": "token_emb",
      "scope": "global",
      "role": "weight",
      "dtype": "bf16",
      "shape": [{"dim": 10}, {"dim": 2}]
    }
  ],
  "nodes": [
    {
      "layer": 0,
      "op": "rmsnorm",
      "kernel": "rmsnorm_forward",
      "bindings": [
        {"arg": "input", "buffer": "input"},
        {"arg": "gamma", "buffer": "ln1_gamma"}
      ]
    }
  ]
}