IR v2 Format Specification

IR v2 is a self-documenting JSON format that captures model architecture, buffer definitions, and computation graphs in a portable, machine-readable format.

Key Features

Pipeline Overview

IR v2 Pipeline: From HuggingFace to C Runtime config.json (HuggingFace) hidden_size: 896 vocab_size: 151936 num_layers: 24 global_buffers.json (kernel_maps/) token_emb: [vocab, embed] logits: [tokens, vocab] decoder_layer_plan.json (kernel_maps/) rmsnorm → qkv_project → rope → attention → mlp → residual scripts/build_ir_v2.py Parse config → Build dimensions → Load buffers → Plan nodes ir_v2.json dimensions: [{id:10, name:"vocab"}] buffers: [{name:"token_emb",...}] nodes: [{kernel:"rmsnorm",...}] generated_v2.c CKV2BufferLayout buffers[] = { {0, 272269312, 0}, // token_emb ... Legend External input Config (you write) Python compiler Generated output

Dimension Mapping

The IR uses symbolic dimension IDs that map to actual values from the model config:

ID Name Source in config.json Example (Qwen2-0.5B)
0tokensbatch_size x seq_len (runtime)131072
1embedhidden_size896
2aligned_embedhidden_size (64-byte aligned)896
3head_dimhidden_size / num_attention_heads64
5num_headsnum_attention_heads14
6num_kv_headsnum_key_value_heads2
8intermediateintermediate_size4864
10vocabvocab_size151936

How Dimensions Resolve

shape[{dim:10}] → dimensions[10] → {name:"vocab", value:151936} → 151936 elements

Example: token_emb shape = [vocab, aligned_embed] = [151936, 896]

Input Files

config.json (from HuggingFace)

Downloaded automatically. Provides model dimensions.

{
  "hidden_size": 896,
  "vocab_size": 151936,
  "num_attention_heads": 14,
  "num_key_value_heads": 2,
  "intermediate_size": 4864,
  "num_hidden_layers": 24
}

global_buffers.json (kernel_maps/)

Defines global buffers using symbolic dimension names.

{"buffers": [
  {"name": "token_emb",
   "scope": "global",
   "role": "weight",
   "shape": [{"dim":"vocab"},
             {"dim":"aligned_embed"}]}
]}

decoder_layer_plan.json (kernel_maps/)

Defines operations in each transformer layer.

{"steps": [
  {"kernel": "rmsnorm",
   "bind": {"input": "input",
            "gamma": "ln1_gamma"}},
  {"kernel": "qkv_project", ...},
  {"kernel": "attention", ...}
]}

ir_v2.json (Output)

Self-documenting IR with all dimensions resolved.

{
  "version": 2,
  "notes": ["...format guide..."],
  "dimensions": [...],
  "buffers": [...],
  "nodes": [...]
}

Usage

Generate IR v2

# Fetch config from HuggingFace
make fetch-v2 IR_V2_HF=Qwen/Qwen2-0.5B

# Generate IR v2
make ir-v2 IR_V2_HF=Qwen/Qwen2-0.5B

# Generate C runtime
./build/ck_ir_v2_demo --ir build/ir_v2.json --emit build/generated_v2.c

For GGUF Quantized Models

# Use the GGUF-specific repo
make ir-v2 IR_V2_HF=Qwen/Qwen2-0.5B-Instruct-GGUF \
           IR_V2_WEIGHTS=qwen2-0_5b-instruct-q4_k_m.gguf

IR Lowering

The high-level IR is portable but not executable. The lowering pass converts it to a mode-specific form with a concrete memory plan.

ir_v2.json Portable IR --lower Lowering Pass Filter nodes Compute memory Mode-specific prefill.json decode.json backward.json Each includes: • lowering metadata • memory_plan (offsets) • filtered nodes

Prefill Mode

Process full sequence (prompt). All tokens in parallel.

./build/ck_ir_v2_demo config.json \
  --lower prefill

Decode Mode

Generate one token at a time. Uses KV cache.

./build/ck_ir_v2_demo config.json \
  --lower decode \
  --lower-out build/decode.json

Backward Mode

Training. Auto-builds backward graph.

./build/ck_ir_v2_demo config.json \
  --lower backward

What Lowering Adds

{
  "lowering": {
    "mode": "prefill",
    "nodes_filtered": 12,
    "nodes_kept": 252
  },
  "memory_plan": {
    "total_bytes": 976914176,
    "buffers": [
      {"name": "token_emb", "offset": 0, "size": 272269312},
      {"name": "embedded_input", "offset": 272269312, "size": 469762048},
      {"name": "layer_0.ln1_gamma", "offset": 742031360, "size": 1792},
      {"name": "layer_0.ln1_output", "offset": 742033152, "size": ...}
    ]
  },
  "nodes": [ ... filtered & tagged ... ]
}

Note: All offsets are from a single base pointer. Weights and activations interleaved in execution order.

Memory Layout Philosophy

C-Kernel-Engine uses a single contiguous memory allocation for the entire model. This design is critical for CPU performance.

Why One Allocation?

Memory Layout: Single Contiguous Block (Execution Order) Token Embeddings (weight) Embedded Input (activation) Layer 0 ln1_gamma (W) ln1_output (A) wq, wk, wv (W) q, k, v (A) attn_scores (A) proj_weight (W) ... MLP weights & activations interleaved ... Layer 1 ... Layer N (same pattern) Final LayerNorm (weight + activation) Logits (activation) Gradient Storage (if training) Same layout: d_weights and d_activations interleaved Execution Order Weight Activation Gradient

NUMA / Sub-NUMA Optimization

With 1GB hugepages, each page can be placed on a different DRAM bank:

Page 0 (0-1GB)   → DRAM Bank 0 (Layer 0-5)
Page 1 (1-2GB)   → DRAM Bank 1 (Layer 6-11)
Page 2 (2-3GB)   → DRAM Bank 2 (Layer 12-17)
...

While Layer 0 computes, Layer 6's data prefetches from a different memory channel - no bandwidth contention.

Offset-Based Access

All tensors accessed via offsets from single base pointer:

typedef struct {
    size_t ln1_gamma;      // weight
    size_t ln1_output;     // activation
    size_t wq, wk, wv;     // weights
    size_t q, k, v;        // activations
    ...
} LayerOffsets;

// Access: base + layer->ln1_gamma

Why NOT Separate Arenas?

Separating weights/activations/gradients into different allocations would mean:

One allocation = one base pointer + offsets = maximum simplicity and performance.

Key Concepts

Shape Resolution

Each shape dimension is {dim, mult, div}:

size = dimensions[dim].value
       * mult / div

Buffer Roles

  • weight - Model params
  • activation - Runtime tensors
  • grad - Gradients

Buffer Scope

  • global - Shared across layers
  • layer - Per-layer tensors

IR v2 JSON Structure

{
  "version": 2,
  "notes": [
    "=== IR V2 FORMAT GUIDE ===",
    "DIMENSIONS: maps numeric IDs to named values from config.json",
    "  - dim:10 → dimensions[10] → 'vocab' → 151936",
    "BUFFERS: each has shape, role, dtype, scope",
    "NODES: kernel invocations with bindings to buffers"
  ],
  "config": {
    "num_layers": 24,
    "hidden_size": 896,
    "vocab_size": 151936,
    ...
  },
  "dimensions": [
    {"id": 0, "name": "tokens", "value": 131072},
    {"id": 1, "name": "embed", "value": 896},
    {"id": 10, "name": "vocab", "value": 151936}
  ],
  "buffers": [
    {
      "name": "token_emb",
      "scope": "global",
      "role": "weight",
      "dtype": "bf16",
      "shape": [{"dim": 10}, {"dim": 2}]
    }
  ],
  "nodes": [
    {
      "layer": 0,
      "op": "rmsnorm",
      "kernel": "rmsnorm_forward",
      "bindings": [
        {"arg": "input", "buffer": "input"},
        {"arg": "gamma", "buffer": "ln1_gamma"}
      ]
    }
  ]
}
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close