System Architecture

C-Kernel-Engine uses a three-stage pipeline to transform model configurations into optimized C runtimes.

New: See the IR Pipeline v6.6 page for end-to-end visuals of templates, IR1/IR2, lowering, memory layout, and dumb codegen.

Performance Methodology: See Kernel Tuning Methodology for the VTune, Advisor, roofline, CK profile, parity, and E2E workflow used to push generated kernels toward CPU-node limits.

v8 Compiler Hardening: See v8 Numerical Contracts for circuits, kernel-map capabilities, explicit attention reductions, deterministic resolution, generated-artifact equivalence, and parity promotion gates.

Active v8 Compiler Architecture

The active v8 inference compiler consumes three authoritative inputs: model weights, an architecture circuit, and executable kernel maps. The circuit owns graph structure and required semantics. Kernel maps own ABI and implementation capability. The DSL resolves and stitches; generated C does not guess.

This boundary is executable policy, not only documentation. The v8 and v7 DSL audits reject new model-family dispatch in generic compiler/codegen functions, including dispatch hidden behind local aliases. Exact kernel operations do not fall back to broader families, malformed per-layer dimensions fail, and circuit weight policies are schema validated. New-model support must therefore extend a circuit, a kernel map, or a tested kernel instead of accumulating another compiler conditional.

v8 weights, circuits, kernel maps, numerical contract resolution, IR lowering and generated C

Compiler Input

weights + circuits + kernel maps
             ↓
deterministic DSL lowering
             ↓
generated C

Model definitions live in version/v8/circuits/. The historical embedded BUMP key remains template for serialized compatibility.

Semantic Selection

Precision-sensitive operations declare required_contracts. Real kernel maps advertise provides and supported_reductions. Missing and ambiguous providers are compile-time errors.

Authoritative Resolution

The resolver selects the compatible kernel and contract before GraphIR construction. GraphIR records that decision; LoweredIR, call-ready IR, and generated C must preserve it without fallback or reselection.

Promotion Gate

Leaf tests are necessary but insufficient. Public routing, threaded thresholds, stitched layer boundaries, full replay, persistent decode, and E2E behavior must agree before production validation.

Why reduction is architecture

Attention is not fully specified by softmax(QKᵀ)V. Q/K rounding, score accumulation, online-softmax state, V accumulation, split thresholds, partial storage, and merge order can change logits. v8 records these as complete contracts instead of inferring them from strict-mode flags, cache dtypes, sequence lengths, or thread counts.

Why threading is a kernel capability

Threading metadata applies across attention, GEMM, GEMV, and other parallel operators. Kernel maps declare the runtime, work partition, dispatch mechanism, and whether parallel order changes a reduction. The performance planner may choose among semantically compatible implementations, but it cannot silently change accumulation order to gain speed.

The "Website" Metaphor

The engine treats LLMs like a website generator treats pages. This allows us to unroll the "Block" section efficiently in C without complex control flow.

Section	Website	LLM
Header	<head>, Nav, CSS	Embeddings, Positional Encoding
Block	Blog Posts, Articles	Transformer Layers (repeated)
Footer	Copyright, Scripts	Final Norm, Language Head

Operator Spectrum Map

This view connects the data path (pretrain to RLHF/GRPO) with the compute path (math to inference/backprop) so operators can reason about the full stack in one place.

Click the map to open in the dark in-site SVG viewer.

Download SVG

Data Runbook Links

v7 SVG Dataset Runbook (Stage A and Stage B corpus generation)
v7 Train + Inference Runbook (parity gates and train execution)

Computation Runbook Links

IR Pipeline (IR1/IR2/lowering and memory mapping)
Kernel Tuning Methodology (VTune, Advisor, roofline, CK profile, and parity gates)
v7 Profiling (perf, flamegraph, cache, VTune/Advisor)
v7 Backprop IR (gradient path and training runtime checks)

Stage 1: Model Configuration

The engine accepts HuggingFace-style config.json files as input:

{
  "hidden_size": 768,
  "num_attention_heads": 12,
  "num_key_value_heads": 4,
  "num_hidden_layers": 6,
  "intermediate_size": 2048,
  "rms_norm_eps": 1e-5,
  "rope_theta": 10000.0
}

This defines all the dimensions needed to generate layer structures.

Stage 2: Intermediate Representation

The IR Builder creates a structured representation of each layer:

CKLayerIR Structure

typedef struct {
    int layer_index;
    int embed_dim;
    int num_heads;
    int num_kv_heads;
    int head_dim;
    int intermediate_dim;
    int context_window;
    float eps;
    float rope_theta;
} CKLayerIR;

Key Decisions

Aligned dimensions: Head dim padded to cache-friendly sizes
GQA ratio: Computed from num_heads / num_kv_heads
Buffer sizing: Calculated for all intermediate activations

Stage 3: Code Generation

The codegen emits complete C functions for forward and backward passes:

Generated Forward Pass

void forward_layer_0(
    const float *input,
    const ModelWeights *weights,
    LayerActivations *acts,
    const float *cos_cache,
    const float *sin_cache,
    int num_tokens
) {
    // 1. Pre-attention RMSNorm
    rmsnorm_forward(input, weights->ln1_gamma, acts->ln1_out, ...);

    // 2. QKV projection
    ck_qkv_project_head_major(acts->ln1_out, weights->wq, ...);

    // 3. Apply RoPE
    rope_forward_qk(acts->q, acts->k, cos_cache, sin_cache, ...);

    // 4. Attention
    attention_forward_causal_head_major_gqa(acts->q, acts->k, acts->v, ...);

    // 5. Output projection + residual
    // 6. Post-attention RMSNorm
    // 7. MLP (SwiGLU)
    // 8. Final residual
}

Generated Backward Pass

void backward_layer_0(
    const float *d_output,
    const ModelWeights *weights,
    const LayerActivations *acts,
    WeightGradients *grads,
    float *d_input
) {
    // Reverse order of forward pass
    // Each kernel uses saved activations from forward

    // 1. Backward through final residual
    // 2. Backward through MLP (SwiGLU)
    // 3. Backward through RMSNorm 2
    // 4. Backward through attention output projection
    // 5. Backward through attention
    attention_backward_causal_head_major_gqa(d_attn_out, acts->q, ...);

    // 6. Backward through RoPE (inverse rotation)
    rope_backward_qk(d_q, d_k, ...);

    // 7. Backward through QKV projection
    // 8. Backward through RMSNorm 1
}

Memory Layout

Head-Major Layout
Q/K/V use [num_heads, num_tokens, head_dim] layout for cache-efficient attention computation.

Buffer	Layout	Size
`input`	[B, T, D]	batch * tokens * embed_dim
`Q`	[H, T, d_k]	num_heads * tokens * head_dim
`K, V`	[H_kv, T, d_k]	num_kv_heads * tokens * head_dim
`scores`	[H, T, T]	num_heads * tokens * context_window
`mlp_hidden`	[T, 2*I]	tokens * 2 * intermediate_dim

Kernel Composition

Kernels are composed following transformer layer structure:

Build System

Full Library

make

Builds libckernel_engine.so with all kernels linked together.

Per-Kernel Libraries

make libckernel_attention.so
make libckernel_rope.so
make libckernel_rmsnorm.so

Builds individual kernel libraries for testing.

Codegen Pipeline

Generate Runtime from Config

# Build the IR demo tool
make build/ck_ir_demo

# Generate C runtime
./build/ck_ir_demo config.json --emit build/model.c

# Or use the make target
make ck-emit CONFIG=config.json OUT=build/model.c

The generated file contains:

Buffer allocation functions
Forward pass for all layers
Backward pass for all layers
Parameter gradient accumulation

Project Structure

The codebase is organized for easy navigation:

Focused Source Tree src/kernels · version/v6.6 · version/v7 Updated: 2026-07-18 18:09

src/kernels
`-- fused
version/v6.6
|-- docs
|-- include
|-- kernel_maps
|-- patches
|-- scripts
|   `-- parity
|-- src
|   |-- generated
|   |-- kernel_config
|   |-- scripts
|   `-- test_generated
|-- templates
|-- test
|-- testing
|-- tests
|-- tools
`-- unittest
version/v7
|-- artifacts
|   `-- svg_dsl
|       |-- gen1_archive_2026-04-05
|       `-- spec_archive_2026-04-08
|-- contracts
|-- data
|   |-- eval_contracts
|   |-- spec03
|   |   |-- contracts
|   |   |-- holdout
|   |   |-- manifests
|   |   |-- midtrain
|   |   |-- normalized
|   |   |-- pretrain
|   |   |-- raw_assets
|   |   |-- sft
|   |   `-- tokenizer
|   `-- spec04
|       |-- contracts
|       |-- holdout
|       |-- manifests
|       |-- normalized
|       |-- raw_assets
|       `-- tokenizer
|-- docs
|-- examples
|-- experiments
|   `-- svg_dsl
|-- include
|-- kernel_maps
|-- regression
|-- reports
|   |-- spec12_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec12_gold_mappings
|   |-- spec13b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec13b_gold_mappings
|   |-- spec14a_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec14a_gold_mappings
|   |-- spec14b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec14b_gold_mappings
|   |-- spec15a_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec15a_gold_mappings
|   |-- spec15b_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec15b_gold_mappings
|   |-- spec_broader_1_family_packs -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec_broader_1_family_packs
|   `-- spec_broader_1_gold_mappings -> ../artifacts/svg_dsl/spec_archive_2026-04-08/spec_broader_1_gold_mappings
|-- scripts
|   |-- dataset
|   `-- parity
|-- src
|-- templates
|-- test
|-- tests
|   |-- contracts
|   `-- fixtures
`-- tools
    `-- src

72 directories