Code Generation
C-Kernel-Engine can generate optimized C runtime code from HuggingFace model configs. This enables running models with zero Python overhead.
Pipeline Overview
1. Config Parsing
Load a HuggingFace-style config.json and extract model dimensions.
ck_model_config_from_hf_json(
"config.json",
&cfg
);
2. IR Building
Construct an intermediate representation graph of all operations.
ck_build_decoder_ir(
&cfg,
&forward_graph
);
ck_build_decoder_backward_ir(
&forward_graph,
&backward_graph
);
3. Code Emission
Generate complete C code with forward and backward passes.
ck_codegen_emit_runtime(
&forward_graph,
"model.c"
);
Quick Start
Generate from Config
# Build the codegen tool make build/ck_ir_demo # Generate C runtime from a HuggingFace config ./build/ck_ir_demo path/to/config.json --emit build/model.c # Or use the make target make ck-emit CONFIG=path/to/config.json OUT=build/model.c
Library Mode (Prefill + Decode)
Library mode emits a generated runtime with an exported ABI suitable for dlopen/ctypes.
# Emit a model runtime with exported symbols (prefill + KV-cache decode + training entry points) ./build/ck_ir_demo path/to/config.json --emit build/model.c --emit-lib # The generator also writes build/model.c.kernels (one kernel .c per line) cc -O3 -fPIC -fopenmp -shared -Iinclude -o build/libmodel.so build/model.c $(cat build/model.c.kernels) -lm
Inference sequence: enable KV cache → prefill once → decode token-by-token. Decode is disabled when training is enabled.
Input: Model Config
HuggingFace config.json
The codegen accepts standard HuggingFace model configurations:
{
"architectures": ["LlamaForCausalLM"],
"hidden_size": 768,
"num_attention_heads": 12,
"num_key_value_heads": 4,
"num_hidden_layers": 6,
"intermediate_size": 2048,
"vocab_size": 32000,
"max_position_embeddings": 2048,
"rms_norm_eps": 1e-5,
"rope_theta": 10000.0
}
Parsed CKModelConfig
typedef struct {
int hidden_size; // 768
int num_attention_heads; // 12
int num_key_value_heads; // 4 (for GQA)
int num_hidden_layers; // 6
int intermediate_size; // 2048
int vocab_size; // 32000
int max_position_embeddings; // 2048
float rms_norm_eps; // 1e-5
float rope_theta; // 10000.0
} CKModelConfig;
Kernel Map: Single Source of Truth
Every kernel is tested individually with PyTorch parity. The codegen uses the exact same kernels that passed tests. No drift, no duplication.
CKKernelSpec: Kernel Registry
Each kernel maps to its tested C implementation:
const CKKernelSpec ck_kernel_specs[] = {
{"rmsnorm", "rmsnorm_forward", "rmsnorm_backward",
{"src/kernels/rmsnorm_kernels.c", NULL, ...}},
{"attention", "attention_forward_causal_head_major_gqa",
"attention_backward_causal_head_major_gqa",
{"src/kernels/attention_kernels.c",
"src/kernels/softmax_kernels.c", NULL, ...}},
{"swiglu", "swiglu_forward", "swiglu_backward",
{"src/kernels/swiglu_kernels.c",
"src/kernels/sigmoid_kernels.c", NULL, ...}},
};
CKBufferSpec: Buffer Definitions
All buffers with symbolic dimensions, roles, and conditions:
const CKBufferSpec ck_decoder_buffers[] = {
// Global buffers
{"token_emb", CK_SCOPE_GLOBAL, CK_ROLE_WEIGHT,
{{CK_DIM_VOCAB, 1, 1}, {CK_DIM_ALIGNED_EMBED, 1, 1}, ...}},
{"lm_head_weight", CK_SCOPE_GLOBAL, CK_ROLE_WEIGHT,
{{CK_DIM_VOCAB, 1, 1}, {CK_DIM_ALIGNED_EMBED, 1, 1}, ...},
.alias_of = "token_emb"}, // tied embeddings
{"rope_cos_cache", CK_SCOPE_GLOBAL, CK_ROLE_ACTIVATION,
{{CK_DIM_TOKENS, 1, 1}, {CK_DIM_HEAD_DIM, 1, 2}, ...},
.condition = "rope_theta"}, // only if RoPE enabled
// Per-layer buffers
{"q", CK_SCOPE_LAYER, CK_ROLE_OUTPUT,
{{CK_DIM_NUM_HEADS, 1, 1}, {CK_DIM_TOKENS, 1, 1},
{CK_DIM_ALIGNED_HEAD, 1, 1}, ...}},
};
CKPlanStep: Execution Plan
Forward pass as a sequence of kernel invocations:
const CKPlanStep ck_decoder_forward_plan[] = {
{"rmsnorm", NULL}, // LN1
{"qkv_project", NULL}, // Q, K, V projections
{"rope", "rope_theta>0"}, // RoPE (conditional)
{"attention", NULL}, // Self-attention
{"attn_proj", NULL}, // Output projection
{"residual_add", NULL}, // Residual connection
{"rmsnorm", NULL}, // LN2
{"mlp_up", NULL}, // FC1: hidden → 2×intermediate
{"swiglu", NULL}, // SwiGLU activation
{"mlp_down", NULL}, // FC2: intermediate → hidden
{"residual_add", NULL}, // Residual connection
};
How Codegen Uses This
- Read execution plan (
ck_decoder_forward_plan) - For each step, lookup kernel in
ck_kernel_specs - Emit source files from
sources[] - Emit function call using
forward/backward - Bind buffers from
ck_decoder_buffers
Why This Matters
- No duplication: codegen uses tested kernels
- No drift: one source of truth
- Composable: add new kernels to registry
- Conditional: RoPE, bias, etc. are optional
- Typed: symbolic dims, roles, conditions
Registry Flow: How It All Fits Together
The Complete Pipeline
| Step | Component | What It Does |
|---|---|---|
| 1 | config.json |
HuggingFace model config with dimensions (hidden_size, num_heads, etc.) |
| 2 | CKIRGraph |
Parse config into IR with operations and tensor shapes |
| 3 | CKKernelSpec[] |
Lookup kernel name → get forward/backward function + source files |
| 4 | CKBufferSpec[] |
Get buffer shapes, roles, dtypes → compute memory layout |
| 5 | CKPlanStep[] |
Execute kernels in order, respecting conditions |
| 6 | Bump Allocator |
Resolve symbolic dims → concrete byte offsets (dtype-aware) |
| 7 | Source Emitter |
Emit tested kernel sources + function call wiring |
| 8 | generated_model.c |
Standalone C file: gcc -O3 model.c -o model |
The generated
model.c contains the exact same kernel code that passed PyTorch parity tests. The registry ensures no code is written twice.
Adding a New Kernel
To add a new kernel (e.g., GeLU activation):
// 1. Write and test the kernel
// src/kernels/gelu_kernels.c
void gelu_forward(const float *in, float *out, int n);
void gelu_backward(const float *in, const float *d_out, float *d_in, int n);
// 2. Add to kernel specs
{"gelu", "gelu_forward", "gelu_backward",
{"src/kernels/gelu_kernels.c", NULL, ...}},
// 3. Add buffers if needed
{"gelu_out", CK_SCOPE_LAYER, CK_ROLE_OUTPUT,
{{CK_DIM_TOKENS, 1, 1}, {CK_DIM_ALIGNED_EMBED, 1, 1}, ...}},
// 4. Add to execution plan (for models that use GeLU)
{"gelu", "activation_type==gelu"},
Codegen automatically picks up the new kernel. No other changes needed.
Intermediate Representation
The IR layer decouples config parsing from code generation, enabling optimizations and different backends.
IR Graph Structure
typedef struct {
CKOpType op; // CK_OP_RMSNORM, CK_OP_ATTENTION, etc.
int layer_index;
int input_ids[4]; // References to input nodes
int output_id; // Output node ID
// ... dimension info
} CKIRNode;
typedef struct {
CKModelConfig config;
CKIRNode *nodes;
int num_nodes;
int num_layers;
} CKIRGraph;
Supported Operations
| Op Type | Forward | Backward |
|---|---|---|
CK_OP_RMSNORM |
rmsnorm_forward | rmsnorm_backward |
CK_OP_ATTENTION |
attention_forward_causal_head_major_gqa | attention_backward_causal_head_major_gqa |
CK_OP_ROPE |
rope_forward_qk | rope_backward_qk |
CK_OP_SWIGLU |
swiglu_forward | swiglu_backward |
CK_OP_LINEAR |
gemm_blocked_serial | fc_backward_kernel |
CK_OP_RESIDUAL |
ck_add_inplace | (gradient passthrough) |
Output: Generated Code
Generated Forward Pass
void forward_layer_0(
const float *input,
const ModelWeights *weights,
LayerActivations *acts,
const float *cos_cache,
const float *sin_cache,
int num_tokens
) {
// 1. Pre-attention RMSNorm
rmsnorm_forward(input, weights->ln1_gamma, acts->ln1_out,
acts->rstd1, num_tokens, 768, 768, 1e-5f);
// 2. QKV projection
ck_qkv_project_head_major(acts->ln1_out,
weights->wq, weights->bq,
weights->wk, weights->bk,
weights->wv, weights->bv,
acts->q, acts->k, acts->v,
num_tokens, 768, 12, 4, 64);
// 3. Apply RoPE
rope_forward_qk(acts->q, acts->k, cos_cache, sin_cache,
12, 4, num_tokens, 64, 0);
// 4. Attention
attention_forward_causal_head_major_gqa(
acts->q, acts->k, acts->v,
acts->scores, acts->attn_out,
12, 4, num_tokens, 64, 64, 2048);
// 5. Output projection + residual
ck_attention_project_head_major(acts->attn_out, weights->wo, weights->bo,
acts->proj_out, acts->scratch,
num_tokens, 768, 12, 64);
ck_add_inplace(acts->proj_out, input, num_tokens, 768);
// 6-8. MLP block...
}
Generated Backward Pass
void backward_layer_0(
const float *d_output,
const ModelWeights *weights,
const LayerActivations *acts,
WeightGradients *grads,
float *d_input
) {
// Reverse order of forward pass
// 1. Backward through MLP residual
// d_mlp_out = d_output (residual gradient passthrough)
// 2. Backward through FC2
fc2_backward_kernel(d_output, acts->swiglu_out, weights->w2,
d_swiglu, grads->d_w2, grads->d_b2, ...);
// 3. Backward through SwiGLU
swiglu_backward(acts->fc1_out, d_swiglu, d_fc1, num_tokens, 2048);
// ... continue backwards through all ops
// N. Backward through RMSNorm 1
rmsnorm_backward(d_rmsnorm, input, weights->ln1_gamma,
acts->rstd1, d_input, grads->d_ln1_gamma,
num_tokens, 768, 768);
}
IR Serialization
Export IR to JSON
ck_ir_serialize_json(
&graph,
"model_ir.json"
);
Enables inspection, debugging, and external tooling.
Load IR from JSON
ck_ir_parse_json(
"model_ir.json",
&graph
);
Two-stage pipeline: generate IR once, emit code multiple times.
Memory Layout
Buffer Allocation
The codegen calculates all buffer sizes based on config dimensions:
layout_transformer_from_ir(&model, &ir); // Computed sizes: // - Weight memory: embeddings + all layer weights // - Activation memory: per-layer intermediates // - KV cache: for autoregressive generation
| Buffer | Size Formula |
|---|---|
| Q | num_heads × max_tokens × head_dim |
| K, V | num_kv_heads × max_tokens × head_dim |
| Attention scores | num_heads × max_tokens × context_window |
| MLP hidden | max_tokens × 2 × intermediate_size |
| Layer weights | num_layers × (QKV + O + MLP weights) |
Bump Allocator: Dtype-Aware Layout
All memory is allocated in one contiguous block. Offsets are computed at codegen time based on dtype. No runtime malloc, clean strides, cache-friendly.
Mixed Precision Layout
Weights can be bf16/int8 while activations stay fp32:
// Generated at codegen time - dtype-aware size_t off = 0; // Weights: bf16 (2 bytes each) m->token_emb_offset = off; off += align64(V * aligned_embed * 2); // bf16 m->wq_offset = off; off += align64(num_heads * head_dim * embed * 2); // bf16 // Activations: fp32 (4 bytes each) m->embedded_input_offset = off; off += align64(T * aligned_embed * 4); // fp32 m->q_offset = off; off += align64(num_heads * T * head_dim * 4); // fp32
Dtype Enum
typedef enum {
CK_DTYPE_F32 = 0, // 4 bytes
CK_DTYPE_F16, // 2 bytes
CK_DTYPE_BF16, // 2 bytes
CK_DTYPE_I8, // 1 byte
CK_DTYPE_I4, // 0.5 bytes (packed)
} CKDType;
Per-Buffer Dtype
{"wq", CK_SCOPE_LAYER, CK_ROLE_WEIGHT,
CK_DTYPE_BF16, // weights in bf16
{{CK_DIM_NUM_HEADS, ...}}},
{"q", CK_SCOPE_LAYER, CK_ROLE_OUTPUT,
CK_DTYPE_F32, // activations in fp32
{{CK_DIM_NUM_HEADS, ...}}},
All Offsets Cacheline-Aligned
Every buffer starts at a 64-byte boundary for optimal SIMD access:
static size_t align64(size_t n) {
return (n + 63) & ~63;
}
static size_t bump(size_t *off, size_t bytes) {
size_t start = align64(*off);
*off = start + bytes;
return start;
}
Source Files
| File | Purpose |
|---|---|
ckernel_ir.c |
IR graph building, serialization |
ckernel_codegen.c |
C code emission from IR |
ckernel_kernel_specs.c |
Kernel map: specs, buffers, execution plan |
ckernel_model_layout.c |
Bump allocator, memory layout |
ckernel_model_load.c |
Weight loading from files |
ckernel_registry.c |
Op validation, kernel lookup |
For full source code, see the Doxygen source browser.