Memory Safety

Design Document
This page describes our memory safety design philosophy and planned features. Sections marked Planned are not yet implemented.

Philosophy: Safety Through Simplicity

Rather than relying on language-level safety (Rust) or runtime garbage collection, C-Kernel-Engine achieves memory safety through architectural simplicity:

Traditional C Problems

  • malloc/free scattered everywhere
  • Use-after-free vulnerabilities
  • Double-free crashes
  • Runtime size calculations (overflow risk)
  • Manual pointer arithmetic
  • Unknown buffer boundaries

Our Approach

  • One mmap at start, one munmap at end
  • No frees during execution → no use-after-free
  • Nothing to double-free
  • All sizes from config at codegen time
  • Pre-computed offsets, no pointer math
  • Every buffer size baked into generated code

Memory Allocation Hierarchy

Why Huge Pages?
A 540MB model needs 138,240 TLB entries with 4KB pages, but only 270 entries with 2MB huge pages. Fewer TLB misses = faster memory access.

The Fallback Hierarchy

void *ck_huge_alloc(size_t bytes)
{
    size_t len = align_up_bytes(bytes, HUGE_PAGE_SIZE);  // 2MB alignment

    // 1. Try explicit huge pages (best performance)
    void *p = mmap(NULL, len,
                   PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                   -1, 0);
    if (p != MAP_FAILED) {
        return p;  // Got 2MB pages directly from kernel
    }

    // 2. Fallback: aligned_alloc + transparent huge page hint
    void *q = aligned_alloc(HUGE_PAGE_SIZE, len);
    if (!q) return NULL;

    // 3. Hint to kernel: please use huge pages if possible
    madvise(q, len, MADV_HUGEPAGE);
    return q;
}

Step 1: MAP_HUGETLB (Explicit Huge Pages)

What it does: Requests 2MB pages directly from the kernel's hugepage pool.

Requirements:

# Reserve 1GB of huge pages (512 × 2MB)
echo 512 | sudo tee /proc/sys/vm/nr_hugepages

# Or make permanent in /etc/sysctl.conf
vm.nr_hugepages = 512

Why it's best: Guaranteed 2MB pages, lowest TLB pressure, physically contiguous memory.

When it fails: No huge pages reserved, or pool exhausted.

Step 2: aligned_alloc + madvise(MADV_HUGEPAGE)

What it does: Allocates 2MB-aligned memory, then hints to kernel to use Transparent Huge Pages (THP).

// Allocate with 2MB alignment
void *q = aligned_alloc(2 * 1024 * 1024, len);

// Tell kernel: "I'd like huge pages here please"
madvise(q, len, MADV_HUGEPAGE);

How THP works: Kernel can promote 4KB pages to 2MB pages in the background (khugepaged daemon).

Trade-off: Not guaranteed, may take time to promote, but works without root.

Step 3: Regular 4KB Pages (Implicit Fallback)

If madvise hint is ignored, you get regular pages. Still works, just more TLB pressure.

Understanding THP and madvise

THP = Transparent Huge Pages
A Linux kernel feature that automatically promotes 4KB pages to 2MB pages. madvise() is how you tell the kernel which memory regions to promote.

THP Modes

# Check current mode
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
Mode Behavior madvise needed? Recommendation
always Kernel promotes ALL allocations No, automatic Can cause stalls
madvise Only promotes if you hint Yes, you must call it Best for production
never THP disabled entirely N/A, ignored Use explicit hugepages

How THP Modes Affect Your Code

// Your allocation:
void *q = aligned_alloc(2MB, 100MB);
madvise(q, 100MB, MADV_HUGEPAGE);

// What happens next depends on THP mode:

mode = "always"

Kernel already trying to promote everything. Your madvise is redundant but harmless.

mode = "madvise"

Kernel marks your region for promotion. khugepaged will promote it in background.

mode = "never"

THP disabled. Your madvise is ignored. Stays as 4KB pages.

Timeline: What Actually Happens

Time 0ms:    aligned_alloc(100MB)
             → Kernel allocates 25,600 × 4KB pages

Time 0ms:    madvise(ptr, 100MB, MADV_HUGEPAGE)
             → Kernel marks region: "promote this when possible"

Time 1ms:    Your code continues running...
             → Still using 4KB pages, TLB pressure is high

Time 100ms:  khugepaged daemon wakes up
             → Scans memory for marked regions

Time 150ms:  khugepaged finds your region
             → Merges 512 × 4KB pages into 1 × 2MB page
             → Repeats until region is fully promoted

Time 200ms+: Your code now using 2MB pages
             → TLB pressure reduced 512×

Recommended: Use "madvise" Mode

# Set system to madvise mode
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

# Make permanent in /etc/rc.local or systemd service

Why madvise mode is best:

Complete Allocation Flow

void *ck_huge_alloc(size_t bytes)
{
    // ATTEMPT 1: Explicit huge pages
    // Bypasses THP entirely - goes directly to kernel's hugepage pool
    void *p = mmap(..., MAP_HUGETLB);
    if (p != MAP_FAILED) {
        return p;  // Got guaranteed 2MB pages, done!
    }

    // ATTEMPT 2: Regular allocation + THP hint
    void *q = aligned_alloc(2MB, len);  // Initially 4KB pages

    madvise(q, len, MADV_HUGEPAGE);     // Hint to kernel
    // ↓
    // If THP="always"  → already being promoted, hint redundant
    // If THP="madvise" → khugepaged will promote in background
    // If THP="never"   → stays 4KB, hint ignored

    return q;
}
Method Page Size TLB Entries (540MB) Requirements Status
MAP_HUGETLB 2MB 270 Hugepages reserved Implemented
madvise(MADV_HUGEPAGE) 2MB (if promoted) 270 (best case) THP enabled Implemented
Regular mmap 4KB 138,240 None Fallback

Why This Hierarchy Works

  1. Best case (production): Pre-reserve huge pages → MAP_HUGETLB succeeds → maximum performance
  2. Development: No reservation → THP kicks in → good performance without root
  3. Worst case: Everything falls back to 4KB → still correct, just slower

The model always works. Performance scales with what the system provides.

Production Tuning: Squeezing Every Ounce

The Hidden Truth
When you buy NVIDIA GPUs, all this tuning is done for you. VRAM, TLB, memory channels - all optimized by NVIDIA engineers. For CPU inference, you are the systems engineer.

1GB Huge Pages for Large Models

For models >1GB, use 1GB huge pages instead of 2MB:

# Check if 1GB pages are supported
grep pdpe1gb /proc/cpuinfo

# Reserve 1GB huge pages (requires reboot or boot param)
# In /etc/default/grub:
GRUB_CMDLINE_LINUX="hugepagesz=1G hugepages=16 default_hugepagesz=1G"

# Or at runtime (if supported):
echo 16 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Model Size 2MB Pages 1GB Pages TLB Improvement
7B (28GB) 14,336 entries 28 entries 512× fewer
70B (280GB) 143,360 entries 280 entries 512× fewer

DDR Channel Awareness

Modern CPUs have multiple DDR channels (4-8 on server chips). Each channel provides ~25-50 GB/s bandwidth.

# Check memory topology
lsmem
numactl --hardware
dmidecode -t memory | grep -E "Size|Locator"

Why it matters: A 12GB model allocation with 1GB huge pages gets physically contiguous memory. The kernel can place each 1GB page on a different DDR channel, maximizing bandwidth.

Goal: Spread memory across all channels, not concentrated on one.

NUMA and Core Pinning

Multi-socket servers have Non-Uniform Memory Access. Memory is "local" or "remote" to each CPU.

# Run on specific NUMA node (memory + CPU affinity)
numactl --cpunodebind=0 --membind=0 ./model

# Or pin to specific cores
taskset -c 0-15 ./model

# In code: set thread affinity
#define _GNU_SOURCE
#include <sched.h>
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);  // Pin to core 0
pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);

Rule: Keep threads and their memory on the same NUMA node. Remote memory access is 2-3× slower.

Model Size → System Tuning Guide

Model Memory Huge Pages NUMA Cores
SmolLM-135M ~540MB 2MB sufficient Single node 4-8 cores
Llama-7B ~28GB 1GB recommended Single node 16-32 cores
Llama-70B ~280GB 1GB required Multi-node, pin carefully All cores, both sockets

Production Checklist

#!/bin/bash
# production_setup.sh - Run before deploying model

MODEL_GB=$1  # e.g., 28 for 7B model

# 1. Reserve 1GB huge pages (need MODEL_GB + buffer)
PAGES=$((MODEL_GB + 4))
echo $PAGES | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

# 2. Disable THP compaction (avoid latency spikes)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

# 3. Set CPU governor to performance
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance | sudo tee $cpu
done

# 4. Disable swap (avoid page-out during inference)
sudo swapoff -a

# 5. Check NUMA topology
numactl --hardware

echo "System ready for ${MODEL_GB}GB model"
Theoretical Limits
With proper tuning, CPU inference can approach theoretical memory bandwidth limits. A DDR5 system with 8 channels at 4800 MT/s provides ~300 GB/s. For a memory-bound 7B model at 2 bytes/param, that's ~50 tokens/second theoretical max. Tuning gets you close; without it, you're at 10-20%.

Bump Allocator: Why It's Safer

Single Contiguous Memory Block

// At startup: one allocation
void *memory = mmap(NULL, total_bytes, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

// All buffers are offsets into this block
float *token_emb = (float*)(memory + m->token_emb_offset);
float *q = (float*)(memory + layer->q_offset);
float *k = (float*)(memory + layer->k_offset);

// At shutdown: one deallocation
munmap(memory, total_bytes);

No malloc/free during model execution. No fragmentation. No leaks possible.

All Sizes Known at Codegen Time

// Generated at codegen time - not runtime
m->q_offset = 1048576;      // Computed from config
m->q_size = 786432;         // num_heads × tokens × head_dim × sizeof(float)
m->k_offset = 1835008;      // q_offset + q_size + alignment
m->k_size = 262144;         // num_kv_heads × tokens × head_dim × sizeof(float)

Buffer sizes come from model config. No runtime calculations that could overflow.

Defense in Depth

Layer Protection Status
Architecture Bump allocator, no dynamic allocation Implemented
Codegen All offsets/sizes computed from config Implemented
Testing PyTorch parity tests catch logic errors Implemented
Canaries Guard values between buffers Planned
Bounds Checking Debug-mode index validation Planned
Static Analysis Automated cppcheck/scan-build Planned
Sanitizers AddressSanitizer in CI Planned
Layout Verification Script to verify no buffer overlap Planned

Planned Canary System

Guard values placed between buffers to detect overflow at runtime:

Canary Placement

// Memory layout with canaries (debug builds)
// [CANARY][token_emb][CANARY][pos_emb][CANARY][q][CANARY][k][CANARY]...

#define CANARY_VALUE 0xDEADBEEFCAFEBABE
#define CANARY_SIZE 64  // one cacheline

static size_t bump_with_canary(size_t *off, size_t bytes) {
    // Canary before
    write_canary(base + *off);
    *off += CANARY_SIZE;

    size_t data_offset = *off;
    *off += align64(bytes);

    // Canary after
    write_canary(base + *off);
    *off += CANARY_SIZE;

    return data_offset;
}

Canary Verification

void ck_verify_memory_integrity(const TransformerModel *m) {
    const uint8_t *base = m->memory_base;

    for (int i = 0; i < m->num_buffers; i++) {
        CKBufferInfo *buf = &m->buffers[i];

        // Check canary before buffer
        uint64_t *before = (uint64_t*)(base + buf->offset - CANARY_SIZE);
        if (*before != CANARY_VALUE) {
            fprintf(stderr, "UNDERFLOW detected: buffer '%s' at offset %zu\n",
                    buf->name, buf->offset);
            abort();
        }

        // Check canary after buffer
        uint64_t *after = (uint64_t*)(base + buf->offset + buf->size);
        if (*after != CANARY_VALUE) {
            fprintf(stderr, "OVERFLOW detected: buffer '%s' at offset %zu\n",
                    buf->name, buf->offset);
            abort();
        }
    }
}

Planned Bounds-Checked Access

Optional bounds checking in debug builds:

#ifdef CK_DEBUG_BOUNDS

#define CK_LOAD(buf, idx, max) ck_load_checked(buf, idx, max, __FILE__, __LINE__)
#define CK_STORE(buf, idx, max, val) ck_store_checked(buf, idx, max, val, __FILE__, __LINE__)

static inline float ck_load_checked(const float *buf, size_t idx, size_t max,
                                     const char *file, int line) {
    if (idx >= max) {
        fprintf(stderr, "%s:%d: OUT OF BOUNDS READ: idx=%zu, max=%zu\n",
                file, line, idx, max);
        abort();
    }
    return buf[idx];
}

static inline void ck_store_checked(float *buf, size_t idx, size_t max, float val,
                                     const char *file, int line) {
    if (idx >= max) {
        fprintf(stderr, "%s:%d: OUT OF BOUNDS WRITE: idx=%zu, max=%zu\n",
                file, line, idx, max);
        abort();
    }
    buf[idx] = val;
}

#else

#define CK_LOAD(buf, idx, max) ((buf)[idx])
#define CK_STORE(buf, idx, max, val) ((buf)[idx] = (val))

#endif

Planned Verification Script

verify_memory_safety.sh

#!/bin/bash
set -e

echo "=== C-Kernel-Engine Memory Safety Verification ==="

# 1. Static Analysis
echo "[1/5] Running static analysis..."
cppcheck --enable=all --error-exitcode=1 src/kernels/*.c
scan-build -o /tmp/scan-build make clean all

# 2. Verify buffer layout (no overlaps)
echo "[2/5] Verifying buffer layout..."
./build/ck_layout_check --config test/configs/smollm-135m.json

# 3. Build with AddressSanitizer
echo "[3/5] Building with AddressSanitizer..."
make clean
CFLAGS="-fsanitize=address -g -O1" make all

# 4. Run tests with ASAN
echo "[4/5] Running tests with AddressSanitizer..."
./build/test_all

# 5. Valgrind check
echo "[5/5] Running Valgrind..."
valgrind --leak-check=full --error-exitcode=1 ./build/test_orchestration

echo "=== All memory safety checks passed ==="

Planned Layout Verification

Compile-time verification that buffers don't overlap:

int ck_verify_layout(const TransformerModel *m) {
    // Sort buffers by offset
    CKBufferRange ranges[MAX_BUFFERS];
    int n = collect_buffer_ranges(m, ranges);
    qsort(ranges, n, sizeof(CKBufferRange), compare_by_offset);

    // Check for overlaps
    for (int i = 0; i < n - 1; i++) {
        size_t end_i = ranges[i].offset + ranges[i].size;
        size_t start_next = ranges[i + 1].offset;

        if (end_i > start_next) {
            fprintf(stderr, "OVERLAP: '%s' [%zu-%zu] overlaps '%s' [%zu-%zu]\n",
                    ranges[i].name, ranges[i].offset, end_i,
                    ranges[i+1].name, start_next,
                    ranges[i+1].offset + ranges[i+1].size);
            return -1;
        }
    }

    // Verify total doesn't exceed allocation
    size_t last_end = ranges[n-1].offset + ranges[n-1].size;
    if (last_end > m->total_bytes) {
        fprintf(stderr, "OVERFLOW: buffers extend to %zu but only %zu allocated\n",
                last_end, m->total_bytes);
        return -1;
    }

    return 0;  // Layout is valid
}

Rust Comparison

Different Approaches, Same Goal
Rust prevents memory errors at compile time. We prevent them through architectural constraints + runtime verification.
Aspect Rust C-Kernel-Engine
Safety guarantee Compile-time (borrow checker) Design-time + runtime verification
Dynamic allocation Safe via ownership Eliminated entirely (bump allocator)
Buffer bounds Checked by default Sizes baked in at codegen + optional runtime checks
Post-compile tampering Vulnerable (binary can be modified) Vulnerable (same)
Auditability Requires Rust expertise Simple C, one memory block, predictable layout

Guiding Principles

1. Don't Touch Generated Code

The generated model.c comes from tested kernels. Manual edits introduce risk.

2. Trust the Config

All sizes derived from config.json. If config is valid, layout is valid.

3. Verify, Don't Assume

Canaries, bounds checks, and layout verification catch errors early.

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close