Memory Safety

Design Document
This page describes our memory safety design philosophy and planned features. Sections marked Planned are not yet implemented.

Philosophy: Safety Through Simplicity

Rather than relying on language-level safety (Rust) or runtime garbage collection, C-Kernel-Engine achieves memory safety through architectural simplicity:

Traditional C Problems

malloc/free scattered everywhere
Use-after-free vulnerabilities
Double-free crashes
Runtime size calculations (overflow risk)
Manual pointer arithmetic
Unknown buffer boundaries

Our Approach

One mmap at start, one munmap at end
No frees during execution → no use-after-free
Nothing to double-free
All sizes from config at codegen time
Pre-computed offsets, no pointer math
Every buffer size baked into generated code

Memory Allocation Hierarchy

Why Huge Pages?
A 540MB model needs 138,240 TLB entries with 4KB pages, but only 270 entries with 2MB huge pages. Fewer TLB misses = faster memory access.

The Fallback Hierarchy

void *ck_huge_alloc(size_t bytes)
{
    size_t len = align_up_bytes(bytes, HUGE_PAGE_SIZE);  // 2MB alignment

    // 1. Try explicit huge pages (best performance)
    void *p = mmap(NULL, len,
                   PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                   -1, 0);
    if (p != MAP_FAILED) {
        return p;  // Got 2MB pages directly from kernel
    }

    // 2. Fallback: aligned_alloc + transparent huge page hint
    void *q = aligned_alloc(HUGE_PAGE_SIZE, len);
    if (!q) return NULL;

    // 3. Hint to kernel: please use huge pages if possible
    madvise(q, len, MADV_HUGEPAGE);
    return q;
}

Step 1: MAP_HUGETLB (Explicit Huge Pages)

What it does: Requests 2MB pages directly from the kernel's hugepage pool.

Requirements:

# Reserve 1GB of huge pages (512 × 2MB)
echo 512 | sudo tee /proc/sys/vm/nr_hugepages

# Or make permanent in /etc/sysctl.conf
vm.nr_hugepages = 512

Why it's best: Guaranteed 2MB pages, lowest TLB pressure, physically contiguous memory.

When it fails: No huge pages reserved, or pool exhausted.

Step 2: aligned_alloc + madvise(MADV_HUGEPAGE)

What it does: Allocates 2MB-aligned memory, then hints to kernel to use Transparent Huge Pages (THP).

// Allocate with 2MB alignment
void *q = aligned_alloc(2 * 1024 * 1024, len);

// Tell kernel: "I'd like huge pages here please"
madvise(q, len, MADV_HUGEPAGE);

How THP works: Kernel can promote 4KB pages to 2MB pages in the background (khugepaged daemon).

Trade-off: Not guaranteed, may take time to promote, but works without root.

Step 3: Regular 4KB Pages (Implicit Fallback)

If madvise hint is ignored, you get regular pages. Still works, just more TLB pressure.

Understanding THP and madvise

THP = Transparent Huge Pages
A Linux kernel feature that automatically promotes 4KB pages to 2MB pages. madvise() is how you tell the kernel which memory regions to promote.

THP Modes

# Check current mode
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never

Mode	Behavior	madvise needed?	Recommendation
`always`	Kernel promotes ALL allocations	No, automatic	Can cause stalls
`madvise`	Only promotes if you hint	Yes, you must call it	Best for production
`never`	THP disabled entirely	N/A, ignored	Use explicit hugepages

How THP Modes Affect Your Code

// Your allocation:
void *q = aligned_alloc(2MB, 100MB);
madvise(q, 100MB, MADV_HUGEPAGE);

// What happens next depends on THP mode:

mode = "always"

Kernel already trying to promote everything. Your madvise is redundant but harmless.

mode = "madvise"

Kernel marks your region for promotion. khugepaged will promote it in background.

mode = "never"

THP disabled. Your madvise is ignored. Stays as 4KB pages.

Timeline: What Actually Happens

Time 0ms:    aligned_alloc(100MB)
             → Kernel allocates 25,600 × 4KB pages

Time 0ms:    madvise(ptr, 100MB, MADV_HUGEPAGE)
             → Kernel marks region: "promote this when possible"

Time 1ms:    Your code continues running...
             → Still using 4KB pages, TLB pressure is high

Time 100ms:  khugepaged daemon wakes up
             → Scans memory for marked regions

Time 150ms:  khugepaged finds your region
             → Merges 512 × 4KB pages into 1 × 2MB page
             → Repeats until region is fully promoted

Time 200ms+: Your code now using 2MB pages
             → TLB pressure reduced 512×

Recommended: Use "madvise" Mode

# Set system to madvise mode
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

# Make permanent in /etc/rc.local or systemd service

Why madvise mode is best:

always → kernel tries to promote everything, causes random stalls
madvise → you control exactly which allocations get huge pages
Your model memory → promoted (you called madvise)
Small temp buffers → left as 4KB (no overhead)

Complete Allocation Flow

void *ck_huge_alloc(size_t bytes)
{
    // ATTEMPT 1: Explicit huge pages
    // Bypasses THP entirely - goes directly to kernel's hugepage pool
    void *p = mmap(..., MAP_HUGETLB);
    if (p != MAP_FAILED) {
        return p;  // Got guaranteed 2MB pages, done!
    }

    // ATTEMPT 2: Regular allocation + THP hint
    void *q = aligned_alloc(2MB, len);  // Initially 4KB pages

    madvise(q, len, MADV_HUGEPAGE);     // Hint to kernel
    // ↓
    // If THP="always"  → already being promoted, hint redundant
    // If THP="madvise" → khugepaged will promote in background
    // If THP="never"   → stays 4KB, hint ignored

    return q;
}

Method	Page Size	TLB Entries (540MB)	Requirements	Status
`MAP_HUGETLB`	2MB	270	Hugepages reserved	Implemented
`madvise(MADV_HUGEPAGE)`	2MB (if promoted)	270 (best case)	THP enabled	Implemented
Regular mmap	4KB	138,240	None	Fallback

Why This Hierarchy Works

Best case (production): Pre-reserve huge pages → MAP_HUGETLB succeeds → maximum performance
Development: No reservation → THP kicks in → good performance without root
Worst case: Everything falls back to 4KB → still correct, just slower

The model always works. Performance scales with what the system provides.

Production Tuning: Squeezing Every Ounce

The Hidden Truth
When you buy NVIDIA GPUs, all this tuning is done for you. VRAM, TLB, memory channels - all optimized by NVIDIA engineers. For CPU inference, you are the systems engineer.

1GB Huge Pages for Large Models

For models >1GB, use 1GB huge pages instead of 2MB:

# Check if 1GB pages are supported
grep pdpe1gb /proc/cpuinfo

# Reserve 1GB huge pages (requires reboot or boot param)
# In /etc/default/grub:
GRUB_CMDLINE_LINUX="hugepagesz=1G hugepages=16 default_hugepagesz=1G"

# Or at runtime (if supported):
echo 16 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

Model Size	2MB Pages	1GB Pages	TLB Improvement
7B (28GB)	14,336 entries	28 entries	512× fewer
70B (280GB)	143,360 entries	280 entries	512× fewer

DDR Channel Awareness

Modern CPUs have multiple DDR channels (4-8 on server chips). Each channel provides ~25-50 GB/s bandwidth.

# Check memory topology
lsmem
numactl --hardware
dmidecode -t memory | grep -E "Size|Locator"

Why it matters: A 12GB model allocation with 1GB huge pages gets physically contiguous memory. The kernel can place each 1GB page on a different DDR channel, maximizing bandwidth.

Goal: Spread memory across all channels, not concentrated on one.

NUMA and Core Pinning

Multi-socket servers have Non-Uniform Memory Access. Memory is "local" or "remote" to each CPU.

# Run on specific NUMA node (memory + CPU affinity)
numactl --cpunodebind=0 --membind=0 ./model

# Or pin to specific cores
taskset -c 0-15 ./model

# In code: set thread affinity
#define _GNU_SOURCE
#include <sched.h>
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);  // Pin to core 0
pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);

Rule: Keep threads and their memory on the same NUMA node. Remote memory access is 2-3× slower.

Model Size → System Tuning Guide

Model	Memory	Huge Pages	NUMA	Cores
SmolLM-135M	~540MB	2MB sufficient	Single node	4-8 cores
Llama-7B	~28GB	1GB recommended	Single node	16-32 cores
Llama-70B	~280GB	1GB required	Multi-node, pin carefully	All cores, both sockets

Production Checklist

#!/bin/bash
# production_setup.sh - Run before deploying model

MODEL_GB=$1  # e.g., 28 for 7B model

# 1. Reserve 1GB huge pages (need MODEL_GB + buffer)
PAGES=$((MODEL_GB + 4))
echo $PAGES | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

# 2. Disable THP compaction (avoid latency spikes)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

# 3. Set CPU governor to performance
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance | sudo tee $cpu
done

# 4. Disable swap (avoid page-out during inference)
sudo swapoff -a

# 5. Check NUMA topology
numactl --hardware

echo "System ready for ${MODEL_GB}GB model"

Theoretical Limits
With proper tuning, CPU inference can approach theoretical memory bandwidth limits. A DDR5 system with 8 channels at 4800 MT/s provides ~300 GB/s. For a memory-bound 7B model at 2 bytes/param, that's ~50 tokens/second theoretical max. Tuning gets you close; without it, you're at 10-20%.

Bump Allocator: Why It's Safer

Single Contiguous Memory Block

// At startup: one allocation
void *memory = mmap(NULL, total_bytes, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

// All buffers are offsets into this block
float *token_emb = (float*)(memory + m->token_emb_offset);
float *q = (float*)(memory + layer->q_offset);
float *k = (float*)(memory + layer->k_offset);

// At shutdown: one deallocation
munmap(memory, total_bytes);

No malloc/free during model execution. No fragmentation. No leaks possible.

All Sizes Known at Codegen Time

// Generated at codegen time - not runtime
m->q_offset = 1048576;      // Computed from config
m->q_size = 786432;         // num_heads × tokens × head_dim × sizeof(float)
m->k_offset = 1835008;      // q_offset + q_size + alignment
m->k_size = 262144;         // num_kv_heads × tokens × head_dim × sizeof(float)

Buffer sizes come from model config. No runtime calculations that could overflow.

Defense in Depth

Layer	Protection	Status
Architecture	Bump allocator, no dynamic allocation	Implemented
Codegen	All offsets/sizes computed from config	Implemented
Testing	PyTorch parity tests catch logic errors	Implemented
Canaries	Guard values between buffers	Planned
Bounds Checking	Debug-mode index validation	Planned
Static Analysis	Automated cppcheck/scan-build	Planned
Sanitizers	AddressSanitizer in CI	Planned
Layout Verification	Script to verify no buffer overlap	Planned

Planned Canary System

Guard values placed between buffers to detect overflow at runtime:

Canary Placement

// Memory layout with canaries (debug builds)
// [CANARY][token_emb][CANARY][pos_emb][CANARY][q][CANARY][k][CANARY]...

#define CANARY_VALUE 0xDEADBEEFCAFEBABE
#define CANARY_SIZE 64  // one cacheline

static size_t bump_with_canary(size_t *off, size_t bytes) {
    // Canary before
    write_canary(base + *off);
    *off += CANARY_SIZE;

    size_t data_offset = *off;
    *off += align64(bytes);

    // Canary after
    write_canary(base + *off);
    *off += CANARY_SIZE;

    return data_offset;
}

Canary Verification

void ck_verify_memory_integrity(const TransformerModel *m) {
    const uint8_t *base = m->memory_base;

    for (int i = 0; i < m->num_buffers; i++) {
        CKBufferInfo *buf = &m->buffers[i];

        // Check canary before buffer
        uint64_t *before = (uint64_t*)(base + buf->offset - CANARY_SIZE);
        if (*before != CANARY_VALUE) {
            fprintf(stderr, "UNDERFLOW detected: buffer '%s' at offset %zu\n",
                    buf->name, buf->offset);
            abort();
        }

        // Check canary after buffer
        uint64_t *after = (uint64_t*)(base + buf->offset + buf->size);
        if (*after != CANARY_VALUE) {
            fprintf(stderr, "OVERFLOW detected: buffer '%s' at offset %zu\n",
                    buf->name, buf->offset);
            abort();
        }
    }
}

Planned Bounds-Checked Access

Optional bounds checking in debug builds:

#ifdef CK_DEBUG_BOUNDS

#define CK_LOAD(buf, idx, max) ck_load_checked(buf, idx, max, __FILE__, __LINE__)
#define CK_STORE(buf, idx, max, val) ck_store_checked(buf, idx, max, val, __FILE__, __LINE__)

static inline float ck_load_checked(const float *buf, size_t idx, size_t max,
                                     const char *file, int line) {
    if (idx >= max) {
        fprintf(stderr, "%s:%d: OUT OF BOUNDS READ: idx=%zu, max=%zu\n",
                file, line, idx, max);
        abort();
    }
    return buf[idx];
}

static inline void ck_store_checked(float *buf, size_t idx, size_t max, float val,
                                     const char *file, int line) {
    if (idx >= max) {
        fprintf(stderr, "%s:%d: OUT OF BOUNDS WRITE: idx=%zu, max=%zu\n",
                file, line, idx, max);
        abort();
    }
    buf[idx] = val;
}

#else

#define CK_LOAD(buf, idx, max) ((buf)[idx])
#define CK_STORE(buf, idx, max, val) ((buf)[idx] = (val))

#endif

Planned Verification Script

verify_memory_safety.sh

#!/bin/bash
set -e

echo "=== C-Kernel-Engine Memory Safety Verification ==="

# 1. Static Analysis
echo "[1/5] Running static analysis..."
cppcheck --enable=all --error-exitcode=1 src/kernels/*.c
scan-build -o /tmp/scan-build make clean all

# 2. Verify buffer layout (no overlaps)
echo "[2/5] Verifying buffer layout..."
./build/ck_layout_check --config test/configs/smollm-135m.json

# 3. Build with AddressSanitizer
echo "[3/5] Building with AddressSanitizer..."
make clean
CFLAGS="-fsanitize=address -g -O1" make all

# 4. Run tests with ASAN
echo "[4/5] Running tests with AddressSanitizer..."
./build/test_all

# 5. Valgrind check
echo "[5/5] Running Valgrind..."
valgrind --leak-check=full --error-exitcode=1 ./build/test_orchestration

echo "=== All memory safety checks passed ==="

Planned Layout Verification

Compile-time verification that buffers don't overlap:

int ck_verify_layout(const TransformerModel *m) {
    // Sort buffers by offset
    CKBufferRange ranges[MAX_BUFFERS];
    int n = collect_buffer_ranges(m, ranges);
    qsort(ranges, n, sizeof(CKBufferRange), compare_by_offset);

    // Check for overlaps
    for (int i = 0; i < n - 1; i++) {
        size_t end_i = ranges[i].offset + ranges[i].size;
        size_t start_next = ranges[i + 1].offset;

        if (end_i > start_next) {
            fprintf(stderr, "OVERLAP: '%s' [%zu-%zu] overlaps '%s' [%zu-%zu]\n",
                    ranges[i].name, ranges[i].offset, end_i,
                    ranges[i+1].name, start_next,
                    ranges[i+1].offset + ranges[i+1].size);
            return -1;
        }
    }

    // Verify total doesn't exceed allocation
    size_t last_end = ranges[n-1].offset + ranges[n-1].size;
    if (last_end > m->total_bytes) {
        fprintf(stderr, "OVERFLOW: buffers extend to %zu but only %zu allocated\n",
                last_end, m->total_bytes);
        return -1;
    }

    return 0;  // Layout is valid
}

Rust Comparison

Different Approaches, Same Goal
Rust prevents memory errors at compile time. We prevent them through architectural constraints + runtime verification.

Aspect	Rust	C-Kernel-Engine
Safety guarantee	Compile-time (borrow checker)	Design-time + runtime verification
Dynamic allocation	Safe via ownership	Eliminated entirely (bump allocator)
Buffer bounds	Checked by default	Sizes baked in at codegen + optional runtime checks
Post-compile tampering	Vulnerable (binary can be modified)	Vulnerable (same)
Auditability	Requires Rust expertise	Simple C, one memory block, predictable layout

Guiding Principles

1. Don't Touch Generated Code

The generated model.c comes from tested kernels. Manual edits introduce risk.

2. Trust the Config

All sizes derived from config.json. If config is valid, layout is valid.

3. Verify, Don't Assume

Canaries, bounds checks, and layout verification catch errors early.