System Architecture
C-Kernel-Engine uses a three-stage pipeline to transform model configurations into optimized C runtimes.
New: See the IR Pipeline v6.6 page for end-to-end visuals of templates, IR1/IR2, lowering, memory layout, and dumb codegen.
The "Website" Metaphor
The engine treats LLMs like a website generator treats pages. This allows us to unroll the "Block" section efficiently in C without complex control flow.
| Section | Website | LLM |
|---|---|---|
| Header | <head>, Nav, CSS | Embeddings, Positional Encoding |
| Block | Blog Posts, Articles | Transformer Layers (repeated) |
| Footer | Copyright, Scripts | Final Norm, Language Head |
Operator Spectrum Map
This view connects the data path (pretrain to RLHF/GRPO) with the compute path (math to inference/backprop) so operators can reason about the full stack in one place.
Data Runbook Links
- v7 SVG Dataset Runbook (Stage A and Stage B corpus generation)
- v7 Train + Inference Runbook (parity gates and train execution)
Computation Runbook Links
- IR Pipeline (IR1/IR2/lowering and memory mapping)
- v7 Profiling (perf, flamegraph, cache, VTune/Advisor)
- v7 Backprop IR (gradient path and training runtime checks)
Stage 1: Model Configuration
The engine accepts HuggingFace-style config.json files as input:
{
"hidden_size": 768,
"num_attention_heads": 12,
"num_key_value_heads": 4,
"num_hidden_layers": 6,
"intermediate_size": 2048,
"rms_norm_eps": 1e-5,
"rope_theta": 10000.0
}
This defines all the dimensions needed to generate layer structures.
Stage 2: Intermediate Representation
The IR Builder creates a structured representation of each layer:
CKLayerIR Structure
typedef struct {
int layer_index;
int embed_dim;
int num_heads;
int num_kv_heads;
int head_dim;
int intermediate_dim;
int context_window;
float eps;
float rope_theta;
} CKLayerIR;
Key Decisions
- Aligned dimensions: Head dim padded to cache-friendly sizes
- GQA ratio: Computed from num_heads / num_kv_heads
- Buffer sizing: Calculated for all intermediate activations
Stage 3: Code Generation
The codegen emits complete C functions for forward and backward passes:
Generated Forward Pass
void forward_layer_0(
const float *input,
const ModelWeights *weights,
LayerActivations *acts,
const float *cos_cache,
const float *sin_cache,
int num_tokens
) {
// 1. Pre-attention RMSNorm
rmsnorm_forward(input, weights->ln1_gamma, acts->ln1_out, ...);
// 2. QKV projection
ck_qkv_project_head_major(acts->ln1_out, weights->wq, ...);
// 3. Apply RoPE
rope_forward_qk(acts->q, acts->k, cos_cache, sin_cache, ...);
// 4. Attention
attention_forward_causal_head_major_gqa(acts->q, acts->k, acts->v, ...);
// 5. Output projection + residual
// 6. Post-attention RMSNorm
// 7. MLP (SwiGLU)
// 8. Final residual
}
Generated Backward Pass
void backward_layer_0(
const float *d_output,
const ModelWeights *weights,
const LayerActivations *acts,
WeightGradients *grads,
float *d_input
) {
// Reverse order of forward pass
// Each kernel uses saved activations from forward
// 1. Backward through final residual
// 2. Backward through MLP (SwiGLU)
// 3. Backward through RMSNorm 2
// 4. Backward through attention output projection
// 5. Backward through attention
attention_backward_causal_head_major_gqa(d_attn_out, acts->q, ...);
// 6. Backward through RoPE (inverse rotation)
rope_backward_qk(d_q, d_k, ...);
// 7. Backward through QKV projection
// 8. Backward through RMSNorm 1
}
Memory Layout
Q/K/V use
[num_heads, num_tokens, head_dim] layout for cache-efficient attention computation.
| Buffer | Layout | Size |
|---|---|---|
input |
[B, T, D] | batch * tokens * embed_dim |
Q |
[H, T, d_k] | num_heads * tokens * head_dim |
K, V |
[H_kv, T, d_k] | num_kv_heads * tokens * head_dim |
scores |
[H, T, T] | num_heads * tokens * context_window |
mlp_hidden |
[T, 2*I] | tokens * 2 * intermediate_dim |
Kernel Composition
Kernels are composed following transformer layer structure:
Build System
Full Library
make
Builds libckernel_engine.so with all kernels linked together.
Per-Kernel Libraries
make libckernel_attention.so make libckernel_rope.so make libckernel_rmsnorm.so
Builds individual kernel libraries for testing.
Codegen Pipeline
Generate Runtime from Config
# Build the IR demo tool make build/ck_ir_demo # Generate C runtime ./build/ck_ir_demo config.json --emit build/model.c # Or use the make target make ck-emit CONFIG=config.json OUT=build/model.c
The generated file contains:
- Buffer allocation functions
- Forward pass for all layers
- Backward pass for all layers
- Parameter gradient accumulation
Project Structure
The codebase is organized for easy navigation:
src/kernels
`-- fused
version/v6.6
|-- docs
|-- include
|-- kernel_maps
|-- patches
|-- scripts
| `-- parity
|-- src
| |-- generated
| |-- kernel_config
| |-- scripts
| `-- test_generated
|-- templates
|-- test
|-- testing
|-- tests
|-- tools
`-- unittest
version/v7
|-- artifacts
| `-- svg_dsl
| |-- gen1_archive_2026-04-05
| `-- spec_archive_2026-04-08
|-- contracts
|-- data
| |-- eval_contracts
| |-- generated
| | |-- toy_svg_semantic_shapes_tokenizer
| | `-- toy_svg_structured_atoms_tokenizer
| |-- probe_contracts
| |-- spec03
| | |-- contracts
| | |-- holdout
| | |-- manifests
| | |-- midtrain
| | |-- normalized
| | |-- pretrain
| | |-- raw_assets
| | |-- sft
| | `-- tokenizer
| `-- spec04
| |-- contracts
| |-- holdout
| |-- manifests
| |-- midtrain
| |-- normalized
| |-- pretrain
| |-- raw_assets
| |-- sft
| `-- tokenizer
|-- docs
|-- examples
|-- experiments
| `-- svg_dsl
| |-- catalog
| |-- core
| |-- programs
| `-- renderers
|-- include
|-- kernel_maps
|-- regression
|-- reports
| |-- spec12_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec12_gold_mappings
| |-- spec13b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec13b_gold_mappings
| |-- spec14a_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec14a_gold_mappings
| |-- spec14b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec14b_gold_mappings
| |-- spec15a_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec15a_gold_mappings
| |-- spec15b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec15b_gold_mappings
| |-- spec_broader_1_family_packs -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec_broader_1_family_packs
| `-- spec_broader_1_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec_broader_1_gold_mappings
|-- runs
| |-- logs
| `-- overnight_monitor
| `-- spec10
|-- scripts
| |-- dataset
| `-- parity
|-- src
|-- templates
|-- test
|-- tests
| |-- contracts
| `-- fixtures
`-- tools
`-- src
87 directories