Code Generation

C-Kernel-Engine can generate optimized C runtime code from HuggingFace model configs. This enables running models with zero Python overhead.

Pipeline Overview

1. Config Parsing

Load a HuggingFace-style config.json and extract model dimensions.

ck_model_config_from_hf_json(
    "config.json",
    &cfg
);

2. IR Building

Construct an intermediate representation graph of all operations.

ck_build_decoder_ir(
    &cfg,
    &forward_graph
);
ck_build_decoder_backward_ir(
    &forward_graph,
    &backward_graph
);

3. Code Emission

Generate complete C code with forward and backward passes.

ck_codegen_emit_runtime(
    &forward_graph,
    "model.c"
);

Quick Start

Generate from Config

# Build the codegen tool
make build/ck_ir_demo

# Generate C runtime from a HuggingFace config
./build/ck_ir_demo path/to/config.json --emit build/model.c

# Or use the make target
make ck-emit CONFIG=path/to/config.json OUT=build/model.c

Library Mode (Prefill + Decode)

Library mode emits a generated runtime with an exported ABI suitable for dlopen/ctypes.

# Emit a model runtime with exported symbols (prefill + KV-cache decode + training entry points)
./build/ck_ir_demo path/to/config.json --emit build/model.c --emit-lib

# The generator also writes build/model.c.kernels (one kernel .c per line)
cc -O3 -fPIC -fopenmp -shared -Iinclude -o build/libmodel.so build/model.c $(cat build/model.c.kernels) -lm

Inference sequence: enable KV cache → prefill once → decode token-by-token. Decode is disabled when training is enabled.

Input: Model Config

HuggingFace config.json

The codegen accepts standard HuggingFace model configurations:

{
  "architectures": ["LlamaForCausalLM"],
  "hidden_size": 768,
  "num_attention_heads": 12,
  "num_key_value_heads": 4,
  "num_hidden_layers": 6,
  "intermediate_size": 2048,
  "vocab_size": 32000,
  "max_position_embeddings": 2048,
  "rms_norm_eps": 1e-5,
  "rope_theta": 10000.0
}

Parsed CKModelConfig

typedef struct {
    int hidden_size;           // 768
    int num_attention_heads;   // 12
    int num_key_value_heads;   // 4 (for GQA)
    int num_hidden_layers;     // 6
    int intermediate_size;     // 2048
    int vocab_size;            // 32000
    int max_position_embeddings; // 2048
    float rms_norm_eps;        // 1e-5
    float rope_theta;          // 10000.0
} CKModelConfig;

Kernel Map: Single Source of Truth

Why a Kernel Map?
Every kernel is tested individually with PyTorch parity. The codegen uses the exact same kernels that passed tests. No drift, no duplication.

CKKernelSpec: Kernel Registry

Each kernel maps to its tested C implementation:

const CKKernelSpec ck_kernel_specs[] = {
    {"rmsnorm",   "rmsnorm_forward",   "rmsnorm_backward",
     {"src/kernels/rmsnorm_kernels.c", NULL, ...}},

    {"attention", "attention_forward_causal_head_major_gqa",
                  "attention_backward_causal_head_major_gqa",
     {"src/kernels/attention_kernels.c",
      "src/kernels/softmax_kernels.c", NULL, ...}},

    {"swiglu",    "swiglu_forward",    "swiglu_backward",
     {"src/kernels/swiglu_kernels.c",
      "src/kernels/sigmoid_kernels.c", NULL, ...}},
};

CKBufferSpec: Buffer Definitions

All buffers with symbolic dimensions, roles, and conditions:

const CKBufferSpec ck_decoder_buffers[] = {
    // Global buffers
    {"token_emb", CK_SCOPE_GLOBAL, CK_ROLE_WEIGHT,
     {{CK_DIM_VOCAB, 1, 1}, {CK_DIM_ALIGNED_EMBED, 1, 1}, ...}},

    {"lm_head_weight", CK_SCOPE_GLOBAL, CK_ROLE_WEIGHT,
     {{CK_DIM_VOCAB, 1, 1}, {CK_DIM_ALIGNED_EMBED, 1, 1}, ...},
     .alias_of = "token_emb"},  // tied embeddings

    {"rope_cos_cache", CK_SCOPE_GLOBAL, CK_ROLE_ACTIVATION,
     {{CK_DIM_TOKENS, 1, 1}, {CK_DIM_HEAD_DIM, 1, 2}, ...},
     .condition = "rope_theta"},  // only if RoPE enabled

    // Per-layer buffers
    {"q", CK_SCOPE_LAYER, CK_ROLE_OUTPUT,
     {{CK_DIM_NUM_HEADS, 1, 1}, {CK_DIM_TOKENS, 1, 1},
      {CK_DIM_ALIGNED_HEAD, 1, 1}, ...}},
};

CKPlanStep: Execution Plan

Forward pass as a sequence of kernel invocations:

const CKPlanStep ck_decoder_forward_plan[] = {
    {"rmsnorm",      NULL},           // LN1
    {"qkv_project",  NULL},           // Q, K, V projections
    {"rope",         "rope_theta>0"}, // RoPE (conditional)
    {"attention",    NULL},           // Self-attention
    {"attn_proj",    NULL},           // Output projection
    {"residual_add", NULL},           // Residual connection
    {"rmsnorm",      NULL},           // LN2
    {"mlp_up",       NULL},           // FC1: hidden → 2×intermediate
    {"swiglu",       NULL},           // SwiGLU activation
    {"mlp_down",     NULL},           // FC2: intermediate → hidden
    {"residual_add", NULL},           // Residual connection
};

How Codegen Uses This

Read execution plan (ck_decoder_forward_plan)
For each step, lookup kernel in ck_kernel_specs
Emit source files from sources[]
Emit function call using forward/backward
Bind buffers from ck_decoder_buffers

Why This Matters

No duplication: codegen uses tested kernels
No drift: one source of truth
Composable: add new kernels to registry
Conditional: RoPE, bias, etc. are optional
Typed: symbolic dims, roles, conditions

Registry Flow: How It All Fits Together

The Complete Pipeline

Step	Component	What It Does
1	`config.json`	HuggingFace model config with dimensions (hidden_size, num_heads, etc.)
2	`CKIRGraph`	Parse config into IR with operations and tensor shapes
3	`CKKernelSpec[]`	Lookup kernel name → get forward/backward function + source files
4	`CKBufferSpec[]`	Get buffer shapes, roles, dtypes → compute memory layout
5	`CKPlanStep[]`	Execute kernels in order, respecting conditions
6	`Bump Allocator`	Resolve symbolic dims → concrete byte offsets (dtype-aware)
7	`Source Emitter`	Emit tested kernel sources + function call wiring
8	`generated_model.c`	Standalone C file: `gcc -O3 model.c -o model`

Key Insight
The generated model.c contains the exact same kernel code that passed PyTorch parity tests. The registry ensures no code is written twice.

Adding a New Kernel

To add a new kernel (e.g., GeLU activation):

// 1. Write and test the kernel
// src/kernels/gelu_kernels.c
void gelu_forward(const float *in, float *out, int n);
void gelu_backward(const float *in, const float *d_out, float *d_in, int n);

// 2. Add to kernel specs
{"gelu", "gelu_forward", "gelu_backward",
 {"src/kernels/gelu_kernels.c", NULL, ...}},

// 3. Add buffers if needed
{"gelu_out", CK_SCOPE_LAYER, CK_ROLE_OUTPUT,
 {{CK_DIM_TOKENS, 1, 1}, {CK_DIM_ALIGNED_EMBED, 1, 1}, ...}},

// 4. Add to execution plan (for models that use GeLU)
{"gelu", "activation_type==gelu"},

Codegen automatically picks up the new kernel. No other changes needed.

Intermediate Representation

Why IR?
The IR layer decouples config parsing from code generation, enabling optimizations and different backends.

IR Graph Structure

typedef struct {
    CKOpType op;        // CK_OP_RMSNORM, CK_OP_ATTENTION, etc.
    int layer_index;
    int input_ids[4];   // References to input nodes
    int output_id;      // Output node ID
    // ... dimension info
} CKIRNode;

typedef struct {
    CKModelConfig config;
    CKIRNode *nodes;
    int num_nodes;
    int num_layers;
} CKIRGraph;

Supported Operations

Op Type	Forward	Backward
`CK_OP_RMSNORM`	rmsnorm_forward	rmsnorm_backward
`CK_OP_ATTENTION`	attention_forward_causal_head_major_gqa	attention_backward_causal_head_major_gqa
`CK_OP_ROPE`	rope_forward_qk	rope_backward_qk
`CK_OP_SWIGLU`	swiglu_forward	swiglu_backward
`CK_OP_LINEAR`	gemm_blocked_serial	fc_backward_kernel
`CK_OP_RESIDUAL`	ck_add_inplace	(gradient passthrough)

Output: Generated Code

Generated Forward Pass

void forward_layer_0(
    const float *input,
    const ModelWeights *weights,
    LayerActivations *acts,
    const float *cos_cache,
    const float *sin_cache,
    int num_tokens
) {
    // 1. Pre-attention RMSNorm
    rmsnorm_forward(input, weights->ln1_gamma, acts->ln1_out,
                    acts->rstd1, num_tokens, 768, 768, 1e-5f);

    // 2. QKV projection
    ck_qkv_project_head_major(acts->ln1_out,
        weights->wq, weights->bq,
        weights->wk, weights->bk,
        weights->wv, weights->bv,
        acts->q, acts->k, acts->v,
        num_tokens, 768, 12, 4, 64);

    // 3. Apply RoPE
    rope_forward_qk(acts->q, acts->k, cos_cache, sin_cache,
                    12, 4, num_tokens, 64, 0);

    // 4. Attention
    attention_forward_causal_head_major_gqa(
        acts->q, acts->k, acts->v,
        acts->scores, acts->attn_out,
        12, 4, num_tokens, 64, 64, 2048);

    // 5. Output projection + residual
    ck_attention_project_head_major(acts->attn_out, weights->wo, weights->bo,
                                     acts->proj_out, acts->scratch,
                                     num_tokens, 768, 12, 64);
    ck_add_inplace(acts->proj_out, input, num_tokens, 768);

    // 6-8. MLP block...
}

Generated Backward Pass

void backward_layer_0(
    const float *d_output,
    const ModelWeights *weights,
    const LayerActivations *acts,
    WeightGradients *grads,
    float *d_input
) {
    // Reverse order of forward pass

    // 1. Backward through MLP residual
    // d_mlp_out = d_output (residual gradient passthrough)

    // 2. Backward through FC2
    fc2_backward_kernel(d_output, acts->swiglu_out, weights->w2,
                        d_swiglu, grads->d_w2, grads->d_b2, ...);

    // 3. Backward through SwiGLU
    swiglu_backward(acts->fc1_out, d_swiglu, d_fc1, num_tokens, 2048);

    // ... continue backwards through all ops

    // N. Backward through RMSNorm 1
    rmsnorm_backward(d_rmsnorm, input, weights->ln1_gamma,
                     acts->rstd1, d_input, grads->d_ln1_gamma,
                     num_tokens, 768, 768);
}

IR Serialization

Export IR to JSON

ck_ir_serialize_json(
    &graph,
    "model_ir.json"
);

Enables inspection, debugging, and external tooling.

Load IR from JSON

ck_ir_parse_json(
    "model_ir.json",
    &graph
);

Two-stage pipeline: generate IR once, emit code multiple times.

Memory Layout

Buffer Allocation

The codegen calculates all buffer sizes based on config dimensions:

layout_transformer_from_ir(&model, &ir);

// Computed sizes:
// - Weight memory: embeddings + all layer weights
// - Activation memory: per-layer intermediates
// - KV cache: for autoregressive generation

Buffer	Size Formula
Q	`num_heads × max_tokens × head_dim`
K, V	`num_kv_heads × max_tokens × head_dim`
Attention scores	`num_heads × max_tokens × context_window`
MLP hidden	`max_tokens × 2 × intermediate_size`
Layer weights	`num_layers × (QKV + O + MLP weights)`

Bump Allocator: Dtype-Aware Layout

Why Bump Allocation?
All memory is allocated in one contiguous block. Offsets are computed at codegen time based on dtype. No runtime malloc, clean strides, cache-friendly.

Mixed Precision Layout

Weights can be bf16/int8 while activations stay fp32:

// Generated at codegen time - dtype-aware
size_t off = 0;

// Weights: bf16 (2 bytes each)
m->token_emb_offset = off;
off += align64(V * aligned_embed * 2);     // bf16

m->wq_offset = off;
off += align64(num_heads * head_dim * embed * 2);  // bf16

// Activations: fp32 (4 bytes each)
m->embedded_input_offset = off;
off += align64(T * aligned_embed * 4);     // fp32

m->q_offset = off;
off += align64(num_heads * T * head_dim * 4);  // fp32

Dtype Enum

typedef enum {
    CK_DTYPE_F32 = 0,  // 4 bytes
    CK_DTYPE_F16,      // 2 bytes
    CK_DTYPE_BF16,     // 2 bytes
    CK_DTYPE_I8,       // 1 byte
    CK_DTYPE_I4,       // 0.5 bytes (packed)
} CKDType;

Per-Buffer Dtype

{"wq", CK_SCOPE_LAYER, CK_ROLE_WEIGHT,
 CK_DTYPE_BF16,  // weights in bf16
 {{CK_DIM_NUM_HEADS, ...}}},

{"q", CK_SCOPE_LAYER, CK_ROLE_OUTPUT,
 CK_DTYPE_F32,   // activations in fp32
 {{CK_DIM_NUM_HEADS, ...}}},

All Offsets Cacheline-Aligned

Every buffer starts at a 64-byte boundary for optimal SIMD access:

static size_t align64(size_t n) {
    return (n + 63) & ~63;
}

static size_t bump(size_t *off, size_t bytes) {
    size_t start = align64(*off);
    *off = start + bytes;
    return start;
}

Source Files

File	Purpose
`ckernel_ir.c`	IR graph building, serialization
`ckernel_codegen.c`	C code emission from IR
`ckernel_kernel_specs.c`	Kernel map: specs, buffers, execution plan
`ckernel_model_layout.c`	Bump allocator, memory layout
`ckernel_model_load.c`	Weight loading from files
`ckernel_registry.c`	Op validation, kernel lookup

For full source code, see the Doxygen source browser.