Developer Guide

Complete Walkthrough
This guide explains how C-Kernel-Engine works from config.json to running inference, with deep dives into code generation, memory allocation, and profiling.

Overview: The Big Picture

The Complete Pipeline

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  config.json    │────▶│   IR Graph      │────▶│  generated.c    │
│  (model params) │     │  (compute DAG)  │     │  (C runtime)    │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                        │
                                                        ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Output logits  │◀────│  Kernel Exec    │◀────│  GCC compile    │
│  (inference)    │     │  (forward pass) │     │  (native binary)│
└─────────────────┘     └─────────────────┘     └─────────────────┘

What Makes This Different

Traditional ML Framework	C-Kernel-Engine
Runtime graph interpretation	Compile-time code generation
Dynamic memory allocation	Single bump allocation at startup
Python/C++ dispatch overhead	Pure C, direct kernel calls
Generic kernels	Model-specific generated code

Step 1: Config to Code Generation

1.1 The Config File

A standard HuggingFace-style config defines the model:

{
  "hidden_size": 64,
  "num_attention_heads": 2,
  "num_key_value_heads": 2,
  "num_hidden_layers": 2,
  "intermediate_size": 128,
  "vocab_size": 256,
  "max_position_embeddings": 64,
  "rms_norm_eps": 1e-5,
  "rope_theta": 10000.0
}

1.2 IR Generation

The codegen tool (build/ck_ir_demo) parses the config and builds an IR graph:

$ make ck
=== Forward IR ===
CKIRGraph: layers=2, hidden_size=64, intermediate_size=128
  L0 N0  EMBED_TOKENS  outputs=[embedded]      inputs=[tokens]
  L0 N1  RMSNORM       outputs=[norm1]         inputs=[embedded]
  L0 N2  LINEAR_QKV    outputs=[qkv]           inputs=[norm1]
  L0 N3  ROPE          outputs=[qkv_rope]      inputs=[qkv]
  L0 N4  ATTENTION     outputs=[attn_out]      inputs=[qkv_rope]
  L0 N5  LINEAR_O      outputs=[proj]          inputs=[attn_out]
  L0 N6  RESIDUAL_ADD  outputs=[res1]          inputs=[embedded,proj]
  L0 N7  RMSNORM       outputs=[norm2]         inputs=[res1]
  L0 N8  MLP_SWIGLU    outputs=[mlp]           inputs=[norm2]
  L0 N9  RESIDUAL_ADD  outputs=[L0_out]        inputs=[res1,mlp]
  ...

Each node maps to a tested C kernel function.

1.3 Code Emission

The IR is lowered to C code with all sizes baked in:

$ make emit CONFIG=tiny.config.json OUT=build/tiny_generated.c

// Generated code snippet (simplified):
void run_decoder_forward(TransformerModel *m) {
    float *input = ptr_f32(m->memory_base, m->embedded_input_offset);
    float *norm1 = ptr_f32(m->memory_base, m->layer[0].norm1_offset);

    // Layer 0: Attention block
    rmsnorm_forward(norm1, input, m->layer[0].attn_norm_weight,
                    64, 64, 1e-5f);  // All sizes known at codegen!

    attention_forward_causal_head_major_gqa(
        attn_out, norm1, qkv, ...
        64, 64, 2, 2, 32, 64);  // T, D, heads, kv_heads, head_dim, ctx
    ...
}

1.4 The Kernel Manifest

Along with the C file, a .kernels manifest is generated:

$ cat build/tiny_generated.c.kernels
src/kernels/rmsnorm_kernels.c
src/kernels/attention_kernels.c
src/kernels/softmax_kernels.c
src/kernels/gemm_kernels.c
src/kernels/rope_kernels.c
src/kernels/swiglu_kernels.c
src/kernels/embedding_kernels.c
src/kernels/loss_kernels.c

This tells GCC which kernel source files to compile:

gcc -O3 build/tiny_generated.c $(cat build/tiny_generated.c.kernels) -o build/tiny_model

Step 2: Memory Layout & Bump Allocator

2.1 The Bump Allocator Philosophy

One Allocation, Zero Fragmentation

Instead of thousands of malloc/free calls, we:

Compute total size at codegen time - All buffer sizes known from config
Allocate once at startup - Single mmap or aligned_alloc
Use offsets, not pointers - Each buffer is base + offset
Free once at exit - No leaks possible

2.2 Memory Layout Visualization

┌──────────────────────────────────────────────────────────────────┐
│                        2MB Bump Buffer                           │
├──────────────┬──────────────┬──────────────┬────────────────────┤
│  Embeddings  │   Weights    │ Activations  │    Gradients       │
│   (static)   │   (static)   │  (reused)    │   (if training)    │
├──────────────┼──────────────┼──────────────┼────────────────────┤
│ offset: 0    │ offset: 64K  │ offset: 512K │ offset: 1M         │
└──────────────┴──────────────┴──────────────┴────────────────────┘

// Accessing a buffer:
float *attn_output = (float*)(model.memory_base + model.attn_output_offset);

2.3 Huge Page Optimization

The allocator tries to use 2MB huge pages for better TLB performance:

void *ck_huge_alloc(size_t bytes) {
    size_t len = align_up(bytes, 2MB);

    // 1. Try explicit huge pages (best)
    void *p = mmap(NULL, len, PROT_READ|PROT_WRITE,
                   MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0);
    if (p != MAP_FAILED) return p;

    // 2. Fallback: aligned_alloc + THP hint
    void *q = aligned_alloc(2MB, len);
    madvise(q, len, MADV_HUGEPAGE);  // Ask kernel for transparent huge pages
    return q;
}

Why huge pages? A 540MB model needs 138,240 TLB entries with 4KB pages, but only 270 with 2MB pages.

2.4 Viewing the Layout

$ make tiny-train TINY_TRAIN_ARGS="--dump"

=== Memory Layout ===
Total bytes: 2097152 (2.00 MB)
  embedded_input_offset:     0 (16384 bytes)
  embed_tokens_weight:   16384 (65536 bytes)
  layer[0].attn_norm:    81920 (256 bytes)
  layer[0].qkv_weight:   82176 (24576 bytes)
  ...
  logits_offset:       1048576 (65536 bytes)
  gradients_offset:    1114112 (524288 bytes)

Step 3: Kernel Execution

3.1 Kernel Registry

Each kernel is registered with its forward and backward functions:

// src/ckernel_kernel_specs.c
const CKKernelSpec ck_kernel_specs[] = {
    {
        .name = "rmsnorm",
        .forward_fn = "rmsnorm_forward",
        .backward_fn = "rmsnorm_backward",
        .source_files = {"src/kernels/rmsnorm_kernels.c", NULL}
    },
    {
        .name = "attention",
        .forward_fn = "attention_forward_causal_head_major_gqa",
        .backward_fn = "attention_backward_causal_head_major_gqa",
        .source_files = {"src/kernels/attention_kernels.c",
                         "src/kernels/softmax_kernels.c", NULL}
    },
    ...
};

3.2 Forward Pass Flow

void run_model_forward(TransformerModel *m) {
    // 1. Token embedding lookup
    embed_tokens(m, tokens, seq_len);

    // 2. Process each layer
    for (int layer = 0; layer < m->num_layers; layer++) {
        // Attention block
        rmsnorm_forward(norm1, input, weights, D, T, eps);
        linear_forward(qkv, norm1, qkv_weight, T, D, 3*D);
        rope_forward(qkv, qkv, T, head_dim, theta);
        attention_forward_causal(attn_out, qkv, T, D, heads, kv_heads);
        linear_forward(proj, attn_out, o_weight, T, D, D);
        residual_add(res1, input, proj, T * D);

        // MLP block
        rmsnorm_forward(norm2, res1, weights, D, T, eps);
        mlp_swiglu_forward(mlp_out, norm2, gate, up, down, T, D, I);
        residual_add(output, res1, mlp_out, T * D);

        input = output;  // Next layer's input
    }

    // 3. Final norm + LM head
    rmsnorm_forward(final_norm, input, weights, D, T, eps);
    lm_head_forward(logits, final_norm, lm_weight, T, D, V);
}

3.3 Example Kernel: RMSNorm

// src/kernels/rmsnorm_kernels.c
void rmsnorm_forward(float *out, const float *x, const float *weight,
                     int D, int T, float eps) {
    for (int t = 0; t < T; t++) {
        const float *row = x + t * D;
        float *out_row = out + t * D;

        // Compute RMS
        float sum_sq = 0.0f;
        for (int d = 0; d < D; d++) {
            sum_sq += row[d] * row[d];
        }
        float rms = sqrtf(sum_sq / D + eps);
        float scale = 1.0f / rms;

        // Normalize and scale
        for (int d = 0; d < D; d++) {
            out_row[d] = row[d] * scale * weight[d];
        }
    }
}

Step 4: Backward Pass & Training

4.1 Gradient Computation

Every forward kernel has a corresponding backward kernel:

// Forward: y = rmsnorm(x, weight)
rmsnorm_forward(y, x, weight, D, T, eps);

// Backward: given dy, compute dx and dweight
rmsnorm_backward(dx, dweight, dy, x, weight, D, T, eps);

4.2 Training Loop

for (int step = 0; step < num_steps; step++) {
    // 1. Forward pass
    embed_tokens(&m, tokens, T);
    run_model_forward(&m);

    // 2. Compute loss
    float loss;
    softmax_cross_entropy_forward(logits, targets, T, V, &loss);

    // 3. Backward pass (reverse order)
    softmax_cross_entropy_backward(d_logits, logits, targets, T, V);
    run_model_backward(&m, d_logits);

    // 4. SGD update
    for (int i = 0; i < num_params; i++) {
        weights[i] -= learning_rate * gradients[i];
    }

    printf("Step %d: loss=%.4f\n", step, loss);
}

4.3 Training Parity Test

$ make tiny-parity

# Runs identical training in C and PyTorch, compares:
# - Forward outputs (logits)
# - Loss values
# - All gradients
# - Updated weights after SGD

Step 0: C=10.234567 PyTorch=10.234568 diff=1e-06 ✓
Step 1: C=9.876543  PyTorch=9.876544  diff=1e-06 ✓
...
Max weight diff: 1.2e-05
PASS: Training parity verified!

Testing Workflow

5.1 Test Hierarchy

Level	Command	What It Tests
Kernel Unit	`make test`	Each kernel vs PyTorch (GELU, RMSNorm, Attention...)
Layer Parity	`make layer-parity`	Full decoder layer forward vs PyTorch
E2E Parity	`make tiny-parity`	Full model training vs PyTorch
Comprehensive	`make test-quick`	Multiple configs: tiny, GQA, no-RoPE, single-layer
Stress	`make test-stress`	Convergence tests, 500-step overfit

5.2 Running Individual Kernel Tests

# Build kernel libs
make test-libs

# Run specific test
python3 unittest/test_attention.py
python3 unittest/test_rmsnorm.py
python3 unittest/test_cross_entropy.py

# Run with custom parameters
python3 unittest/test_attention.py --tokens 1024 --heads 8 --kv-heads 2

5.3 Test Output Interpretation

$ python3 unittest/test_attention.py

Testing attention forward (T=64, D=64, heads=2, kv_heads=2)...
  PyTorch output shape: (64, 64)
  C output shape: (64, 64)
  Max diff: 2.3e-06
  Mean diff: 1.1e-07
  OK (tolerance: 1e-04)

Testing attention backward...
  dQ max diff: 3.1e-06
  dK max diff: 2.8e-06
  dV max diff: 2.5e-06
  OK

Profiling Workflow

6.1 Memory Profiling (Valgrind)

$ make profile-memory

==537520== HEAP SUMMARY:
==537520==   total heap usage: 11 allocs, 10 frees, 2,144,864 bytes allocated

==537520== LEAK SUMMARY:
==537520==    definitely lost: 0 bytes in 0 blocks  ← Clean!
==537520==         suppressed: 8 bytes in 1 blocks  ← OpenMP internal

What this tells us:
- 11 allocations total (minimal!)
- 10 frees (the 1 remaining is OpenMP's internal state)
- 2MB allocated = our bump buffer
- No memory leaks

6.2 CPU Profiling (perf)

$ echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
$ make profile-cpu

# Overhead  Symbol
# ........  ..............................
    95.15%  gemm_blocked_serial          ← GEMM dominates (expected!)
     3.12%  [kernel]
     1.21%  ck_mlp_swiglu_forward
     0.52%  attention_forward_causal...

What this tells us:
- GEMM (matrix multiply) is 95% of compute
- This is correct for short sequences
- For long sequences, attention would dominate

6.3 Flamegraph

$ make flamegraph
$ firefox build/flamegraph.svg

The flamegraph shows call stacks as stacked bars. Width = time spent. Look for:

Wide bars at top - Hot leaf functions (optimize these)
Tall stacks - Deep call chains (possible overhead)
Flat wide sections - Single functions dominating (expected for GEMM)

6.4 Cache Profiling (Cachegrind)

$ make profile-cache

==12345== D1 miss rate: 2.1%
==12345== LL miss rate: 0.3%

What to look for:
- D1 miss rate < 5% is good
- LL (last-level cache) misses go to RAM - expensive!
- High miss rates in inner loops = need better memory access patterns

Adding New Kernels

7.1 Kernel Implementation Checklist

Write the kernel in src/kernels/

// src/kernels/my_kernel.c
void my_kernel_forward(float *out, const float *in, int N) { ... }
void my_kernel_backward(float *dx, const float *dy, int N) { ... }

Add to kernel map in kernel_maps/kernels/

// kernel_maps/kernels/my_kernel.json
{
  "name": "my_kernel",
  "forward": "my_kernel_forward",
  "backward": "my_kernel_backward",
  "sources": ["src/kernels/my_kernel.c"]
}

Regenerate specs
```
make gen-specs
```

Write Python test in unittest/

# unittest/test_my_kernel.py
# Compare C kernel output vs PyTorch reference

Add to Makefile PY_TESTS list

7.2 Kernel Design Guidelines

Pure functions - No global state, no allocations
Explicit sizes - All dimensions passed as parameters
Contiguous memory - Assume row-major layout
OpenMP parallel - Use #pragma omp parallel for for outer loops
SIMD-friendly - Inner loops should vectorize

Quick Reference

Common Commands

# Build
make                    # Build library
make clean              # Clean build

# Test
make test               # Unit tests
make test-quick         # Comprehensive quick tests
make tiny-parity        # Training parity vs PyTorch

# Profile
make profile-memory     # Memory leaks (Valgrind)
make profile-cpu        # CPU hotspots (perf)
make flamegraph         # Visualization

# Code generation
make ck                 # Print IR
make emit CONFIG=x.json OUT=out.c  # Generate runtime