Developer Guide
This guide explains how C-Kernel-Engine works from config.json to running inference, with deep dives into code generation, memory allocation, and profiling.
Overview: The Big Picture
The Complete Pipeline
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ config.json │────▶│ IR Graph │────▶│ generated.c │
│ (model params) │ │ (compute DAG) │ │ (C runtime) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Output logits │◀────│ Kernel Exec │◀────│ GCC compile │
│ (inference) │ │ (forward pass) │ │ (native binary)│
└─────────────────┘ └─────────────────┘ └─────────────────┘
What Makes This Different
| Traditional ML Framework | C-Kernel-Engine |
|---|---|
| Runtime graph interpretation | Compile-time code generation |
| Dynamic memory allocation | Single bump allocation at startup |
| Python/C++ dispatch overhead | Pure C, direct kernel calls |
| Generic kernels | Model-specific generated code |
Step 1: Config to Code Generation
1.1 The Config File
A standard HuggingFace-style config defines the model:
{
"hidden_size": 64,
"num_attention_heads": 2,
"num_key_value_heads": 2,
"num_hidden_layers": 2,
"intermediate_size": 128,
"vocab_size": 256,
"max_position_embeddings": 64,
"rms_norm_eps": 1e-5,
"rope_theta": 10000.0
}
1.2 IR Generation
The codegen tool (build/ck_ir_demo) parses the config and builds an IR graph:
$ make ck === Forward IR === CKIRGraph: layers=2, hidden_size=64, intermediate_size=128 L0 N0 EMBED_TOKENS outputs=[embedded] inputs=[tokens] L0 N1 RMSNORM outputs=[norm1] inputs=[embedded] L0 N2 LINEAR_QKV outputs=[qkv] inputs=[norm1] L0 N3 ROPE outputs=[qkv_rope] inputs=[qkv] L0 N4 ATTENTION outputs=[attn_out] inputs=[qkv_rope] L0 N5 LINEAR_O outputs=[proj] inputs=[attn_out] L0 N6 RESIDUAL_ADD outputs=[res1] inputs=[embedded,proj] L0 N7 RMSNORM outputs=[norm2] inputs=[res1] L0 N8 MLP_SWIGLU outputs=[mlp] inputs=[norm2] L0 N9 RESIDUAL_ADD outputs=[L0_out] inputs=[res1,mlp] ...
Each node maps to a tested C kernel function.
1.3 Code Emission
The IR is lowered to C code with all sizes baked in:
$ make emit CONFIG=tiny.config.json OUT=build/tiny_generated.c
// Generated code snippet (simplified):
void run_decoder_forward(TransformerModel *m) {
float *input = ptr_f32(m->memory_base, m->embedded_input_offset);
float *norm1 = ptr_f32(m->memory_base, m->layer[0].norm1_offset);
// Layer 0: Attention block
rmsnorm_forward(norm1, input, m->layer[0].attn_norm_weight,
64, 64, 1e-5f); // All sizes known at codegen!
attention_forward_causal_head_major_gqa(
attn_out, norm1, qkv, ...
64, 64, 2, 2, 32, 64); // T, D, heads, kv_heads, head_dim, ctx
...
}
1.4 The Kernel Manifest
Along with the C file, a .kernels manifest is generated:
$ cat build/tiny_generated.c.kernels src/kernels/rmsnorm_kernels.c src/kernels/attention_kernels.c src/kernels/softmax_kernels.c src/kernels/gemm_kernels.c src/kernels/rope_kernels.c src/kernels/swiglu_kernels.c src/kernels/embedding_kernels.c src/kernels/loss_kernels.c
This tells GCC which kernel source files to compile:
gcc -O3 build/tiny_generated.c $(cat build/tiny_generated.c.kernels) -o build/tiny_model
Step 2: Memory Layout & Bump Allocator
2.1 The Bump Allocator Philosophy
One Allocation, Zero Fragmentation
Instead of thousands of malloc/free calls, we:
- Compute total size at codegen time - All buffer sizes known from config
- Allocate once at startup - Single
mmaporaligned_alloc - Use offsets, not pointers - Each buffer is
base + offset - Free once at exit - No leaks possible
2.2 Memory Layout Visualization
┌──────────────────────────────────────────────────────────────────┐
│ 2MB Bump Buffer │
├──────────────┬──────────────┬──────────────┬────────────────────┤
│ Embeddings │ Weights │ Activations │ Gradients │
│ (static) │ (static) │ (reused) │ (if training) │
├──────────────┼──────────────┼──────────────┼────────────────────┤
│ offset: 0 │ offset: 64K │ offset: 512K │ offset: 1M │
└──────────────┴──────────────┴──────────────┴────────────────────┘
// Accessing a buffer:
float *attn_output = (float*)(model.memory_base + model.attn_output_offset);
2.3 Huge Page Optimization
The allocator tries to use 2MB huge pages for better TLB performance:
void *ck_huge_alloc(size_t bytes) {
size_t len = align_up(bytes, 2MB);
// 1. Try explicit huge pages (best)
void *p = mmap(NULL, len, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0);
if (p != MAP_FAILED) return p;
// 2. Fallback: aligned_alloc + THP hint
void *q = aligned_alloc(2MB, len);
madvise(q, len, MADV_HUGEPAGE); // Ask kernel for transparent huge pages
return q;
}
Why huge pages? A 540MB model needs 138,240 TLB entries with 4KB pages, but only 270 with 2MB pages.
2.4 Viewing the Layout
$ make tiny-train TINY_TRAIN_ARGS="--dump" === Memory Layout === Total bytes: 2097152 (2.00 MB) embedded_input_offset: 0 (16384 bytes) embed_tokens_weight: 16384 (65536 bytes) layer[0].attn_norm: 81920 (256 bytes) layer[0].qkv_weight: 82176 (24576 bytes) ... logits_offset: 1048576 (65536 bytes) gradients_offset: 1114112 (524288 bytes)
Step 3: Kernel Execution
3.1 Kernel Registry
Each kernel is registered with its forward and backward functions:
// src/ckernel_kernel_specs.c
const CKKernelSpec ck_kernel_specs[] = {
{
.name = "rmsnorm",
.forward_fn = "rmsnorm_forward",
.backward_fn = "rmsnorm_backward",
.source_files = {"src/kernels/rmsnorm_kernels.c", NULL}
},
{
.name = "attention",
.forward_fn = "attention_forward_causal_head_major_gqa",
.backward_fn = "attention_backward_causal_head_major_gqa",
.source_files = {"src/kernels/attention_kernels.c",
"src/kernels/softmax_kernels.c", NULL}
},
...
};
3.2 Forward Pass Flow
void run_model_forward(TransformerModel *m) {
// 1. Token embedding lookup
embed_tokens(m, tokens, seq_len);
// 2. Process each layer
for (int layer = 0; layer < m->num_layers; layer++) {
// Attention block
rmsnorm_forward(norm1, input, weights, D, T, eps);
linear_forward(qkv, norm1, qkv_weight, T, D, 3*D);
rope_forward(qkv, qkv, T, head_dim, theta);
attention_forward_causal(attn_out, qkv, T, D, heads, kv_heads);
linear_forward(proj, attn_out, o_weight, T, D, D);
residual_add(res1, input, proj, T * D);
// MLP block
rmsnorm_forward(norm2, res1, weights, D, T, eps);
mlp_swiglu_forward(mlp_out, norm2, gate, up, down, T, D, I);
residual_add(output, res1, mlp_out, T * D);
input = output; // Next layer's input
}
// 3. Final norm + LM head
rmsnorm_forward(final_norm, input, weights, D, T, eps);
lm_head_forward(logits, final_norm, lm_weight, T, D, V);
}
3.3 Example Kernel: RMSNorm
// src/kernels/rmsnorm_kernels.c
void rmsnorm_forward(float *out, const float *x, const float *weight,
int D, int T, float eps) {
for (int t = 0; t < T; t++) {
const float *row = x + t * D;
float *out_row = out + t * D;
// Compute RMS
float sum_sq = 0.0f;
for (int d = 0; d < D; d++) {
sum_sq += row[d] * row[d];
}
float rms = sqrtf(sum_sq / D + eps);
float scale = 1.0f / rms;
// Normalize and scale
for (int d = 0; d < D; d++) {
out_row[d] = row[d] * scale * weight[d];
}
}
}
Step 4: Backward Pass & Training
4.1 Gradient Computation
Every forward kernel has a corresponding backward kernel:
// Forward: y = rmsnorm(x, weight) rmsnorm_forward(y, x, weight, D, T, eps); // Backward: given dy, compute dx and dweight rmsnorm_backward(dx, dweight, dy, x, weight, D, T, eps);
4.2 Training Loop
for (int step = 0; step < num_steps; step++) {
// 1. Forward pass
embed_tokens(&m, tokens, T);
run_model_forward(&m);
// 2. Compute loss
float loss;
softmax_cross_entropy_forward(logits, targets, T, V, &loss);
// 3. Backward pass (reverse order)
softmax_cross_entropy_backward(d_logits, logits, targets, T, V);
run_model_backward(&m, d_logits);
// 4. SGD update
for (int i = 0; i < num_params; i++) {
weights[i] -= learning_rate * gradients[i];
}
printf("Step %d: loss=%.4f\n", step, loss);
}
4.3 Training Parity Test
$ make tiny-parity # Runs identical training in C and PyTorch, compares: # - Forward outputs (logits) # - Loss values # - All gradients # - Updated weights after SGD Step 0: C=10.234567 PyTorch=10.234568 diff=1e-06 ✓ Step 1: C=9.876543 PyTorch=9.876544 diff=1e-06 ✓ ... Max weight diff: 1.2e-05 PASS: Training parity verified!
Testing Workflow
5.1 Test Hierarchy
| Level | Command | What It Tests |
|---|---|---|
| Kernel Unit | make test |
Each kernel vs PyTorch (GELU, RMSNorm, Attention...) |
| Layer Parity | make layer-parity |
Full decoder layer forward vs PyTorch |
| E2E Parity | make tiny-parity |
Full model training vs PyTorch |
| Comprehensive | make test-quick |
Multiple configs: tiny, GQA, no-RoPE, single-layer |
| Stress | make test-stress |
Convergence tests, 500-step overfit |
5.2 Running Individual Kernel Tests
# Build kernel libs make test-libs # Run specific test python3 unittest/test_attention.py python3 unittest/test_rmsnorm.py python3 unittest/test_cross_entropy.py # Run with custom parameters python3 unittest/test_attention.py --tokens 1024 --heads 8 --kv-heads 2
5.3 Test Output Interpretation
$ python3 unittest/test_attention.py Testing attention forward (T=64, D=64, heads=2, kv_heads=2)... PyTorch output shape: (64, 64) C output shape: (64, 64) Max diff: 2.3e-06 Mean diff: 1.1e-07 OK (tolerance: 1e-04) Testing attention backward... dQ max diff: 3.1e-06 dK max diff: 2.8e-06 dV max diff: 2.5e-06 OK
Profiling Workflow
6.1 Memory Profiling (Valgrind)
$ make profile-memory ==537520== HEAP SUMMARY: ==537520== total heap usage: 11 allocs, 10 frees, 2,144,864 bytes allocated ==537520== LEAK SUMMARY: ==537520== definitely lost: 0 bytes in 0 blocks ← Clean! ==537520== suppressed: 8 bytes in 1 blocks ← OpenMP internal What this tells us: - 11 allocations total (minimal!) - 10 frees (the 1 remaining is OpenMP's internal state) - 2MB allocated = our bump buffer - No memory leaks
6.2 CPU Profiling (perf)
$ echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
$ make profile-cpu
# Overhead Symbol
# ........ ..............................
95.15% gemm_blocked_serial ← GEMM dominates (expected!)
3.12% [kernel]
1.21% ck_mlp_swiglu_forward
0.52% attention_forward_causal...
What this tells us:
- GEMM (matrix multiply) is 95% of compute
- This is correct for short sequences
- For long sequences, attention would dominate
6.3 Flamegraph
$ make flamegraph $ firefox build/flamegraph.svg
The flamegraph shows call stacks as stacked bars. Width = time spent. Look for:
- Wide bars at top - Hot leaf functions (optimize these)
- Tall stacks - Deep call chains (possible overhead)
- Flat wide sections - Single functions dominating (expected for GEMM)
6.4 Cache Profiling (Cachegrind)
$ make profile-cache ==12345== D1 miss rate: 2.1% ==12345== LL miss rate: 0.3% What to look for: - D1 miss rate < 5% is good - LL (last-level cache) misses go to RAM - expensive! - High miss rates in inner loops = need better memory access patterns
Adding New Kernels
7.1 Kernel Implementation Checklist
- Write the kernel in
src/kernels/// src/kernels/my_kernel.c void my_kernel_forward(float *out, const float *in, int N) { ... } void my_kernel_backward(float *dx, const float *dy, int N) { ... } - Add to kernel map in
kernel_maps/kernels/// kernel_maps/kernels/my_kernel.json { "name": "my_kernel", "forward": "my_kernel_forward", "backward": "my_kernel_backward", "sources": ["src/kernels/my_kernel.c"] } - Regenerate specs
make gen-specs
- Write Python test in
unittest/# unittest/test_my_kernel.py # Compare C kernel output vs PyTorch reference
- Add to Makefile PY_TESTS list
7.2 Kernel Design Guidelines
- Pure functions - No global state, no allocations
- Explicit sizes - All dimensions passed as parameters
- Contiguous memory - Assume row-major layout
- OpenMP parallel - Use
#pragma omp parallel forfor outer loops - SIMD-friendly - Inner loops should vectorize
Quick Reference
Common Commands
# Build make # Build library make clean # Clean build # Test make test # Unit tests make test-quick # Comprehensive quick tests make tiny-parity # Training parity vs PyTorch # Profile make profile-memory # Memory leaks (Valgrind) make profile-cpu # CPU hotspots (perf) make flamegraph # Visualization # Code generation make ck # Print IR make emit CONFIG=x.json OUT=out.c # Generate runtime