Code Generation

C-Kernel-Engine can generate optimized C runtime code from HuggingFace model configs. This enables running models with zero Python overhead.

Code Generation Pipeline

Pipeline Overview

1. Config Parsing

Load a HuggingFace-style config.json and extract model dimensions.

ck_model_config_from_hf_json(
    "config.json",
    &cfg
);

2. IR Building

Construct an intermediate representation graph of all operations.

ck_build_decoder_ir(
    &cfg,
    &forward_graph
);
ck_build_decoder_backward_ir(
    &forward_graph,
    &backward_graph
);

3. Code Emission

Generate complete C code with forward and backward passes.

ck_codegen_emit_runtime(
    &forward_graph,
    "model.c"
);

Quick Start

Generate from Config

# Build the codegen tool
make build/ck_ir_demo

# Generate C runtime from a HuggingFace config
./build/ck_ir_demo path/to/config.json --emit build/model.c

# Or use the make target
make ck-emit CONFIG=path/to/config.json OUT=build/model.c

Library Mode (Prefill + Decode)

Library mode emits a generated runtime with an exported ABI suitable for dlopen/ctypes.

# Emit a model runtime with exported symbols (prefill + KV-cache decode + training entry points)
./build/ck_ir_demo path/to/config.json --emit build/model.c --emit-lib

# The generator also writes build/model.c.kernels (one kernel .c per line)
cc -O3 -fPIC -fopenmp -shared -Iinclude -o build/libmodel.so build/model.c $(cat build/model.c.kernels) -lm

Inference sequence: enable KV cache → prefill once → decode token-by-token. Decode is disabled when training is enabled.

Input: Model Config

HuggingFace config.json

The codegen accepts standard HuggingFace model configurations:

{
  "architectures": ["LlamaForCausalLM"],
  "hidden_size": 768,
  "num_attention_heads": 12,
  "num_key_value_heads": 4,
  "num_hidden_layers": 6,
  "intermediate_size": 2048,
  "vocab_size": 32000,
  "max_position_embeddings": 2048,
  "rms_norm_eps": 1e-5,
  "rope_theta": 10000.0
}

Parsed CKModelConfig

typedef struct {
    int hidden_size;           // 768
    int num_attention_heads;   // 12
    int num_key_value_heads;   // 4 (for GQA)
    int num_hidden_layers;     // 6
    int intermediate_size;     // 2048
    int vocab_size;            // 32000
    int max_position_embeddings; // 2048
    float rms_norm_eps;        // 1e-5
    float rope_theta;          // 10000.0
} CKModelConfig;

Kernel Map: Single Source of Truth

Why a Kernel Map?
Every kernel is tested individually with PyTorch parity. The codegen uses the exact same kernels that passed tests. No drift, no duplication.

CKKernelSpec: Kernel Registry

Each kernel maps to its tested C implementation:

const CKKernelSpec ck_kernel_specs[] = {
    {"rmsnorm",   "rmsnorm_forward",   "rmsnorm_backward",
     {"src/kernels/rmsnorm_kernels.c", NULL, ...}},

    {"attention", "attention_forward_causal_head_major_gqa",
                  "attention_backward_causal_head_major_gqa",
     {"src/kernels/attention_kernels.c",
      "src/kernels/softmax_kernels.c", NULL, ...}},

    {"swiglu",    "swiglu_forward",    "swiglu_backward",
     {"src/kernels/swiglu_kernels.c",
      "src/kernels/sigmoid_kernels.c", NULL, ...}},
};

CKBufferSpec: Buffer Definitions

All buffers with symbolic dimensions, roles, and conditions:

const CKBufferSpec ck_decoder_buffers[] = {
    // Global buffers
    {"token_emb", CK_SCOPE_GLOBAL, CK_ROLE_WEIGHT,
     {{CK_DIM_VOCAB, 1, 1}, {CK_DIM_ALIGNED_EMBED, 1, 1}, ...}},

    {"lm_head_weight", CK_SCOPE_GLOBAL, CK_ROLE_WEIGHT,
     {{CK_DIM_VOCAB, 1, 1}, {CK_DIM_ALIGNED_EMBED, 1, 1}, ...},
     .alias_of = "token_emb"},  // tied embeddings

    {"rope_cos_cache", CK_SCOPE_GLOBAL, CK_ROLE_ACTIVATION,
     {{CK_DIM_TOKENS, 1, 1}, {CK_DIM_HEAD_DIM, 1, 2}, ...},
     .condition = "rope_theta"},  // only if RoPE enabled

    // Per-layer buffers
    {"q", CK_SCOPE_LAYER, CK_ROLE_OUTPUT,
     {{CK_DIM_NUM_HEADS, 1, 1}, {CK_DIM_TOKENS, 1, 1},
      {CK_DIM_ALIGNED_HEAD, 1, 1}, ...}},
};

CKPlanStep: Execution Plan

Forward pass as a sequence of kernel invocations:

const CKPlanStep ck_decoder_forward_plan[] = {
    {"rmsnorm",      NULL},           // LN1
    {"qkv_project",  NULL},           // Q, K, V projections
    {"rope",         "rope_theta>0"}, // RoPE (conditional)
    {"attention",    NULL},           // Self-attention
    {"attn_proj",    NULL},           // Output projection
    {"residual_add", NULL},           // Residual connection
    {"rmsnorm",      NULL},           // LN2
    {"mlp_up",       NULL},           // FC1: hidden → 2×intermediate
    {"swiglu",       NULL},           // SwiGLU activation
    {"mlp_down",     NULL},           // FC2: intermediate → hidden
    {"residual_add", NULL},           // Residual connection
};

How Codegen Uses This

  1. Read execution plan (ck_decoder_forward_plan)
  2. For each step, lookup kernel in ck_kernel_specs
  3. Emit source files from sources[]
  4. Emit function call using forward/backward
  5. Bind buffers from ck_decoder_buffers

Why This Matters

  • No duplication: codegen uses tested kernels
  • No drift: one source of truth
  • Composable: add new kernels to registry
  • Conditional: RoPE, bias, etc. are optional
  • Typed: symbolic dims, roles, conditions

Registry Flow: How It All Fits Together

Kernel Registry Flow Diagram

The Complete Pipeline

Step Component What It Does
1 config.json HuggingFace model config with dimensions (hidden_size, num_heads, etc.)
2 CKIRGraph Parse config into IR with operations and tensor shapes
3 CKKernelSpec[] Lookup kernel name → get forward/backward function + source files
4 CKBufferSpec[] Get buffer shapes, roles, dtypes → compute memory layout
5 CKPlanStep[] Execute kernels in order, respecting conditions
6 Bump Allocator Resolve symbolic dims → concrete byte offsets (dtype-aware)
7 Source Emitter Emit tested kernel sources + function call wiring
8 generated_model.c Standalone C file: gcc -O3 model.c -o model
Key Insight
The generated model.c contains the exact same kernel code that passed PyTorch parity tests. The registry ensures no code is written twice.

Adding a New Kernel

To add a new kernel (e.g., GeLU activation):

// 1. Write and test the kernel
// src/kernels/gelu_kernels.c
void gelu_forward(const float *in, float *out, int n);
void gelu_backward(const float *in, const float *d_out, float *d_in, int n);

// 2. Add to kernel specs
{"gelu", "gelu_forward", "gelu_backward",
 {"src/kernels/gelu_kernels.c", NULL, ...}},

// 3. Add buffers if needed
{"gelu_out", CK_SCOPE_LAYER, CK_ROLE_OUTPUT,
 {{CK_DIM_TOKENS, 1, 1}, {CK_DIM_ALIGNED_EMBED, 1, 1}, ...}},

// 4. Add to execution plan (for models that use GeLU)
{"gelu", "activation_type==gelu"},

Codegen automatically picks up the new kernel. No other changes needed.

Intermediate Representation

Why IR?
The IR layer decouples config parsing from code generation, enabling optimizations and different backends.

IR Graph Structure

typedef struct {
    CKOpType op;        // CK_OP_RMSNORM, CK_OP_ATTENTION, etc.
    int layer_index;
    int input_ids[4];   // References to input nodes
    int output_id;      // Output node ID
    // ... dimension info
} CKIRNode;

typedef struct {
    CKModelConfig config;
    CKIRNode *nodes;
    int num_nodes;
    int num_layers;
} CKIRGraph;

Supported Operations

Op Type Forward Backward
CK_OP_RMSNORM rmsnorm_forward rmsnorm_backward
CK_OP_ATTENTION attention_forward_causal_head_major_gqa attention_backward_causal_head_major_gqa
CK_OP_ROPE rope_forward_qk rope_backward_qk
CK_OP_SWIGLU swiglu_forward swiglu_backward
CK_OP_LINEAR gemm_blocked_serial fc_backward_kernel
CK_OP_RESIDUAL ck_add_inplace (gradient passthrough)

Output: Generated Code

Generated Forward Pass

void forward_layer_0(
    const float *input,
    const ModelWeights *weights,
    LayerActivations *acts,
    const float *cos_cache,
    const float *sin_cache,
    int num_tokens
) {
    // 1. Pre-attention RMSNorm
    rmsnorm_forward(input, weights->ln1_gamma, acts->ln1_out,
                    acts->rstd1, num_tokens, 768, 768, 1e-5f);

    // 2. QKV projection
    ck_qkv_project_head_major(acts->ln1_out,
        weights->wq, weights->bq,
        weights->wk, weights->bk,
        weights->wv, weights->bv,
        acts->q, acts->k, acts->v,
        num_tokens, 768, 12, 4, 64);

    // 3. Apply RoPE
    rope_forward_qk(acts->q, acts->k, cos_cache, sin_cache,
                    12, 4, num_tokens, 64, 0);

    // 4. Attention
    attention_forward_causal_head_major_gqa(
        acts->q, acts->k, acts->v,
        acts->scores, acts->attn_out,
        12, 4, num_tokens, 64, 64, 2048);

    // 5. Output projection + residual
    ck_attention_project_head_major(acts->attn_out, weights->wo, weights->bo,
                                     acts->proj_out, acts->scratch,
                                     num_tokens, 768, 12, 64);
    ck_add_inplace(acts->proj_out, input, num_tokens, 768);

    // 6-8. MLP block...
}

Generated Backward Pass

void backward_layer_0(
    const float *d_output,
    const ModelWeights *weights,
    const LayerActivations *acts,
    WeightGradients *grads,
    float *d_input
) {
    // Reverse order of forward pass

    // 1. Backward through MLP residual
    // d_mlp_out = d_output (residual gradient passthrough)

    // 2. Backward through FC2
    fc2_backward_kernel(d_output, acts->swiglu_out, weights->w2,
                        d_swiglu, grads->d_w2, grads->d_b2, ...);

    // 3. Backward through SwiGLU
    swiglu_backward(acts->fc1_out, d_swiglu, d_fc1, num_tokens, 2048);

    // ... continue backwards through all ops

    // N. Backward through RMSNorm 1
    rmsnorm_backward(d_rmsnorm, input, weights->ln1_gamma,
                     acts->rstd1, d_input, grads->d_ln1_gamma,
                     num_tokens, 768, 768);
}

IR Serialization

Export IR to JSON

ck_ir_serialize_json(
    &graph,
    "model_ir.json"
);

Enables inspection, debugging, and external tooling.

Load IR from JSON

ck_ir_parse_json(
    "model_ir.json",
    &graph
);

Two-stage pipeline: generate IR once, emit code multiple times.

Memory Layout

Buffer Allocation

The codegen calculates all buffer sizes based on config dimensions:

layout_transformer_from_ir(&model, &ir);

// Computed sizes:
// - Weight memory: embeddings + all layer weights
// - Activation memory: per-layer intermediates
// - KV cache: for autoregressive generation
Buffer Size Formula
Q num_heads × max_tokens × head_dim
K, V num_kv_heads × max_tokens × head_dim
Attention scores num_heads × max_tokens × context_window
MLP hidden max_tokens × 2 × intermediate_size
Layer weights num_layers × (QKV + O + MLP weights)

Bump Allocator: Dtype-Aware Layout

Why Bump Allocation?
All memory is allocated in one contiguous block. Offsets are computed at codegen time based on dtype. No runtime malloc, clean strides, cache-friendly.

Mixed Precision Layout

Weights can be bf16/int8 while activations stay fp32:

// Generated at codegen time - dtype-aware
size_t off = 0;

// Weights: bf16 (2 bytes each)
m->token_emb_offset = off;
off += align64(V * aligned_embed * 2);     // bf16

m->wq_offset = off;
off += align64(num_heads * head_dim * embed * 2);  // bf16

// Activations: fp32 (4 bytes each)
m->embedded_input_offset = off;
off += align64(T * aligned_embed * 4);     // fp32

m->q_offset = off;
off += align64(num_heads * T * head_dim * 4);  // fp32

Dtype Enum

typedef enum {
    CK_DTYPE_F32 = 0,  // 4 bytes
    CK_DTYPE_F16,      // 2 bytes
    CK_DTYPE_BF16,     // 2 bytes
    CK_DTYPE_I8,       // 1 byte
    CK_DTYPE_I4,       // 0.5 bytes (packed)
} CKDType;

Per-Buffer Dtype

{"wq", CK_SCOPE_LAYER, CK_ROLE_WEIGHT,
 CK_DTYPE_BF16,  // weights in bf16
 {{CK_DIM_NUM_HEADS, ...}}},

{"q", CK_SCOPE_LAYER, CK_ROLE_OUTPUT,
 CK_DTYPE_F32,   // activations in fp32
 {{CK_DIM_NUM_HEADS, ...}}},

All Offsets Cacheline-Aligned

Every buffer starts at a 64-byte boundary for optimal SIMD access:

static size_t align64(size_t n) {
    return (n + 63) & ~63;
}

static size_t bump(size_t *off, size_t bytes) {
    size_t start = align64(*off);
    *off = start + bytes;
    return start;
}

Source Files

File Purpose
ckernel_ir.c IR graph building, serialization
ckernel_codegen.c C code emission from IR
ckernel_kernel_specs.c Kernel map: specs, buffers, execution plan
ckernel_model_layout.c Bump allocator, memory layout
ckernel_model_load.c Weight loading from files
ckernel_registry.c Op validation, kernel lookup

For full source code, see the Doxygen source browser.

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close