Memory Safety
This page describes our memory safety design philosophy and planned features. Sections marked Planned are not yet implemented.
Philosophy: Safety Through Simplicity
Rather than relying on language-level safety (Rust) or runtime garbage collection, C-Kernel-Engine achieves memory safety through architectural simplicity:
Traditional C Problems
malloc/freescattered everywhere- Use-after-free vulnerabilities
- Double-free crashes
- Runtime size calculations (overflow risk)
- Manual pointer arithmetic
- Unknown buffer boundaries
Our Approach
- One
mmapat start, onemunmapat end - No frees during execution → no use-after-free
- Nothing to double-free
- All sizes from config at codegen time
- Pre-computed offsets, no pointer math
- Every buffer size baked into generated code
Memory Allocation Hierarchy
A 540MB model needs 138,240 TLB entries with 4KB pages, but only 270 entries with 2MB huge pages. Fewer TLB misses = faster memory access.
The Fallback Hierarchy
void *ck_huge_alloc(size_t bytes)
{
size_t len = align_up_bytes(bytes, HUGE_PAGE_SIZE); // 2MB alignment
// 1. Try explicit huge pages (best performance)
void *p = mmap(NULL, len,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-1, 0);
if (p != MAP_FAILED) {
return p; // Got 2MB pages directly from kernel
}
// 2. Fallback: aligned_alloc + transparent huge page hint
void *q = aligned_alloc(HUGE_PAGE_SIZE, len);
if (!q) return NULL;
// 3. Hint to kernel: please use huge pages if possible
madvise(q, len, MADV_HUGEPAGE);
return q;
}
Step 1: MAP_HUGETLB (Explicit Huge Pages)
What it does: Requests 2MB pages directly from the kernel's hugepage pool.
Requirements:
# Reserve 1GB of huge pages (512 × 2MB) echo 512 | sudo tee /proc/sys/vm/nr_hugepages # Or make permanent in /etc/sysctl.conf vm.nr_hugepages = 512
Why it's best: Guaranteed 2MB pages, lowest TLB pressure, physically contiguous memory.
When it fails: No huge pages reserved, or pool exhausted.
Step 2: aligned_alloc + madvise(MADV_HUGEPAGE)
What it does: Allocates 2MB-aligned memory, then hints to kernel to use Transparent Huge Pages (THP).
// Allocate with 2MB alignment void *q = aligned_alloc(2 * 1024 * 1024, len); // Tell kernel: "I'd like huge pages here please" madvise(q, len, MADV_HUGEPAGE);
How THP works: Kernel can promote 4KB pages to 2MB pages in the background (khugepaged daemon).
Trade-off: Not guaranteed, may take time to promote, but works without root.
Step 3: Regular 4KB Pages (Implicit Fallback)
If madvise hint is ignored, you get regular pages. Still works, just more TLB pressure.
Understanding THP and madvise
A Linux kernel feature that automatically promotes 4KB pages to 2MB pages.
madvise() is how you tell the kernel which memory regions to promote.
THP Modes
# Check current mode cat /sys/kernel/mm/transparent_hugepage/enabled # [always] madvise never
| Mode | Behavior | madvise needed? | Recommendation |
|---|---|---|---|
always |
Kernel promotes ALL allocations | No, automatic | Can cause stalls |
madvise |
Only promotes if you hint | Yes, you must call it | Best for production |
never |
THP disabled entirely | N/A, ignored | Use explicit hugepages |
How THP Modes Affect Your Code
// Your allocation: void *q = aligned_alloc(2MB, 100MB); madvise(q, 100MB, MADV_HUGEPAGE); // What happens next depends on THP mode:
mode = "always"
Kernel already trying to promote everything. Your madvise is redundant but harmless.
mode = "madvise"
Kernel marks your region for promotion. khugepaged will promote it in background.
mode = "never"
THP disabled. Your madvise is ignored. Stays as 4KB pages.
Timeline: What Actually Happens
Time 0ms: aligned_alloc(100MB)
→ Kernel allocates 25,600 × 4KB pages
Time 0ms: madvise(ptr, 100MB, MADV_HUGEPAGE)
→ Kernel marks region: "promote this when possible"
Time 1ms: Your code continues running...
→ Still using 4KB pages, TLB pressure is high
Time 100ms: khugepaged daemon wakes up
→ Scans memory for marked regions
Time 150ms: khugepaged finds your region
→ Merges 512 × 4KB pages into 1 × 2MB page
→ Repeats until region is fully promoted
Time 200ms+: Your code now using 2MB pages
→ TLB pressure reduced 512×
Recommended: Use "madvise" Mode
# Set system to madvise mode echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled # Make permanent in /etc/rc.local or systemd service
Why madvise mode is best:
always→ kernel tries to promote everything, causes random stallsmadvise→ you control exactly which allocations get huge pages- Your model memory → promoted (you called madvise)
- Small temp buffers → left as 4KB (no overhead)
Complete Allocation Flow
void *ck_huge_alloc(size_t bytes)
{
// ATTEMPT 1: Explicit huge pages
// Bypasses THP entirely - goes directly to kernel's hugepage pool
void *p = mmap(..., MAP_HUGETLB);
if (p != MAP_FAILED) {
return p; // Got guaranteed 2MB pages, done!
}
// ATTEMPT 2: Regular allocation + THP hint
void *q = aligned_alloc(2MB, len); // Initially 4KB pages
madvise(q, len, MADV_HUGEPAGE); // Hint to kernel
// ↓
// If THP="always" → already being promoted, hint redundant
// If THP="madvise" → khugepaged will promote in background
// If THP="never" → stays 4KB, hint ignored
return q;
}
| Method | Page Size | TLB Entries (540MB) | Requirements | Status |
|---|---|---|---|---|
MAP_HUGETLB |
2MB | 270 | Hugepages reserved | Implemented |
madvise(MADV_HUGEPAGE) |
2MB (if promoted) | 270 (best case) | THP enabled | Implemented |
| Regular mmap | 4KB | 138,240 | None | Fallback |
Why This Hierarchy Works
- Best case (production): Pre-reserve huge pages →
MAP_HUGETLBsucceeds → maximum performance - Development: No reservation → THP kicks in → good performance without root
- Worst case: Everything falls back to 4KB → still correct, just slower
The model always works. Performance scales with what the system provides.
Production Tuning: Squeezing Every Ounce
When you buy NVIDIA GPUs, all this tuning is done for you. VRAM, TLB, memory channels - all optimized by NVIDIA engineers. For CPU inference, you are the systems engineer.
1GB Huge Pages for Large Models
For models >1GB, use 1GB huge pages instead of 2MB:
# Check if 1GB pages are supported grep pdpe1gb /proc/cpuinfo # Reserve 1GB huge pages (requires reboot or boot param) # In /etc/default/grub: GRUB_CMDLINE_LINUX="hugepagesz=1G hugepages=16 default_hugepagesz=1G" # Or at runtime (if supported): echo 16 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
| Model Size | 2MB Pages | 1GB Pages | TLB Improvement |
|---|---|---|---|
| 7B (28GB) | 14,336 entries | 28 entries | 512× fewer |
| 70B (280GB) | 143,360 entries | 280 entries | 512× fewer |
DDR Channel Awareness
Modern CPUs have multiple DDR channels (4-8 on server chips). Each channel provides ~25-50 GB/s bandwidth.
# Check memory topology lsmem numactl --hardware dmidecode -t memory | grep -E "Size|Locator"
Why it matters: A 12GB model allocation with 1GB huge pages gets physically contiguous memory. The kernel can place each 1GB page on a different DDR channel, maximizing bandwidth.
Goal: Spread memory across all channels, not concentrated on one.
NUMA and Core Pinning
Multi-socket servers have Non-Uniform Memory Access. Memory is "local" or "remote" to each CPU.
# Run on specific NUMA node (memory + CPU affinity) numactl --cpunodebind=0 --membind=0 ./model # Or pin to specific cores taskset -c 0-15 ./model # In code: set thread affinity #define _GNU_SOURCE #include <sched.h> cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(0, &cpuset); // Pin to core 0 pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
Rule: Keep threads and their memory on the same NUMA node. Remote memory access is 2-3× slower.
Model Size → System Tuning Guide
| Model | Memory | Huge Pages | NUMA | Cores |
|---|---|---|---|---|
| SmolLM-135M | ~540MB | 2MB sufficient | Single node | 4-8 cores |
| Llama-7B | ~28GB | 1GB recommended | Single node | 16-32 cores |
| Llama-70B | ~280GB | 1GB required | Multi-node, pin carefully | All cores, both sockets |
Production Checklist
#!/bin/bash
# production_setup.sh - Run before deploying model
MODEL_GB=$1 # e.g., 28 for 7B model
# 1. Reserve 1GB huge pages (need MODEL_GB + buffer)
PAGES=$((MODEL_GB + 4))
echo $PAGES | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# 2. Disable THP compaction (avoid latency spikes)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
# 3. Set CPU governor to performance
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance | sudo tee $cpu
done
# 4. Disable swap (avoid page-out during inference)
sudo swapoff -a
# 5. Check NUMA topology
numactl --hardware
echo "System ready for ${MODEL_GB}GB model"
With proper tuning, CPU inference can approach theoretical memory bandwidth limits. A DDR5 system with 8 channels at 4800 MT/s provides ~300 GB/s. For a memory-bound 7B model at 2 bytes/param, that's ~50 tokens/second theoretical max. Tuning gets you close; without it, you're at 10-20%.
Bump Allocator: Why It's Safer
Single Contiguous Memory Block
// At startup: one allocation
void *memory = mmap(NULL, total_bytes, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// All buffers are offsets into this block
float *token_emb = (float*)(memory + m->token_emb_offset);
float *q = (float*)(memory + layer->q_offset);
float *k = (float*)(memory + layer->k_offset);
// At shutdown: one deallocation
munmap(memory, total_bytes);
No malloc/free during model execution. No fragmentation. No leaks possible.
All Sizes Known at Codegen Time
// Generated at codegen time - not runtime m->q_offset = 1048576; // Computed from config m->q_size = 786432; // num_heads × tokens × head_dim × sizeof(float) m->k_offset = 1835008; // q_offset + q_size + alignment m->k_size = 262144; // num_kv_heads × tokens × head_dim × sizeof(float)
Buffer sizes come from model config. No runtime calculations that could overflow.
Defense in Depth
| Layer | Protection | Status |
|---|---|---|
| Architecture | Bump allocator, no dynamic allocation | Implemented |
| Codegen | All offsets/sizes computed from config | Implemented |
| Testing | PyTorch parity tests catch logic errors | Implemented |
| Canaries | Guard values between buffers | Planned |
| Bounds Checking | Debug-mode index validation | Planned |
| Static Analysis | Automated cppcheck/scan-build | Planned |
| Sanitizers | AddressSanitizer in CI | Planned |
| Layout Verification | Script to verify no buffer overlap | Planned |
Planned Canary System
Guard values placed between buffers to detect overflow at runtime:
Canary Placement
// Memory layout with canaries (debug builds)
// [CANARY][token_emb][CANARY][pos_emb][CANARY][q][CANARY][k][CANARY]...
#define CANARY_VALUE 0xDEADBEEFCAFEBABE
#define CANARY_SIZE 64 // one cacheline
static size_t bump_with_canary(size_t *off, size_t bytes) {
// Canary before
write_canary(base + *off);
*off += CANARY_SIZE;
size_t data_offset = *off;
*off += align64(bytes);
// Canary after
write_canary(base + *off);
*off += CANARY_SIZE;
return data_offset;
}
Canary Verification
void ck_verify_memory_integrity(const TransformerModel *m) {
const uint8_t *base = m->memory_base;
for (int i = 0; i < m->num_buffers; i++) {
CKBufferInfo *buf = &m->buffers[i];
// Check canary before buffer
uint64_t *before = (uint64_t*)(base + buf->offset - CANARY_SIZE);
if (*before != CANARY_VALUE) {
fprintf(stderr, "UNDERFLOW detected: buffer '%s' at offset %zu\n",
buf->name, buf->offset);
abort();
}
// Check canary after buffer
uint64_t *after = (uint64_t*)(base + buf->offset + buf->size);
if (*after != CANARY_VALUE) {
fprintf(stderr, "OVERFLOW detected: buffer '%s' at offset %zu\n",
buf->name, buf->offset);
abort();
}
}
}
Planned Bounds-Checked Access
Optional bounds checking in debug builds:
#ifdef CK_DEBUG_BOUNDS
#define CK_LOAD(buf, idx, max) ck_load_checked(buf, idx, max, __FILE__, __LINE__)
#define CK_STORE(buf, idx, max, val) ck_store_checked(buf, idx, max, val, __FILE__, __LINE__)
static inline float ck_load_checked(const float *buf, size_t idx, size_t max,
const char *file, int line) {
if (idx >= max) {
fprintf(stderr, "%s:%d: OUT OF BOUNDS READ: idx=%zu, max=%zu\n",
file, line, idx, max);
abort();
}
return buf[idx];
}
static inline void ck_store_checked(float *buf, size_t idx, size_t max, float val,
const char *file, int line) {
if (idx >= max) {
fprintf(stderr, "%s:%d: OUT OF BOUNDS WRITE: idx=%zu, max=%zu\n",
file, line, idx, max);
abort();
}
buf[idx] = val;
}
#else
#define CK_LOAD(buf, idx, max) ((buf)[idx])
#define CK_STORE(buf, idx, max, val) ((buf)[idx] = (val))
#endif
Planned Verification Script
verify_memory_safety.sh
#!/bin/bash set -e echo "=== C-Kernel-Engine Memory Safety Verification ===" # 1. Static Analysis echo "[1/5] Running static analysis..." cppcheck --enable=all --error-exitcode=1 src/kernels/*.c scan-build -o /tmp/scan-build make clean all # 2. Verify buffer layout (no overlaps) echo "[2/5] Verifying buffer layout..." ./build/ck_layout_check --config test/configs/smollm-135m.json # 3. Build with AddressSanitizer echo "[3/5] Building with AddressSanitizer..." make clean CFLAGS="-fsanitize=address -g -O1" make all # 4. Run tests with ASAN echo "[4/5] Running tests with AddressSanitizer..." ./build/test_all # 5. Valgrind check echo "[5/5] Running Valgrind..." valgrind --leak-check=full --error-exitcode=1 ./build/test_orchestration echo "=== All memory safety checks passed ==="
Planned Layout Verification
Compile-time verification that buffers don't overlap:
int ck_verify_layout(const TransformerModel *m) {
// Sort buffers by offset
CKBufferRange ranges[MAX_BUFFERS];
int n = collect_buffer_ranges(m, ranges);
qsort(ranges, n, sizeof(CKBufferRange), compare_by_offset);
// Check for overlaps
for (int i = 0; i < n - 1; i++) {
size_t end_i = ranges[i].offset + ranges[i].size;
size_t start_next = ranges[i + 1].offset;
if (end_i > start_next) {
fprintf(stderr, "OVERLAP: '%s' [%zu-%zu] overlaps '%s' [%zu-%zu]\n",
ranges[i].name, ranges[i].offset, end_i,
ranges[i+1].name, start_next,
ranges[i+1].offset + ranges[i+1].size);
return -1;
}
}
// Verify total doesn't exceed allocation
size_t last_end = ranges[n-1].offset + ranges[n-1].size;
if (last_end > m->total_bytes) {
fprintf(stderr, "OVERFLOW: buffers extend to %zu but only %zu allocated\n",
last_end, m->total_bytes);
return -1;
}
return 0; // Layout is valid
}
Rust Comparison
Rust prevents memory errors at compile time. We prevent them through architectural constraints + runtime verification.
| Aspect | Rust | C-Kernel-Engine |
|---|---|---|
| Safety guarantee | Compile-time (borrow checker) | Design-time + runtime verification |
| Dynamic allocation | Safe via ownership | Eliminated entirely (bump allocator) |
| Buffer bounds | Checked by default | Sizes baked in at codegen + optional runtime checks |
| Post-compile tampering | Vulnerable (binary can be modified) | Vulnerable (same) |
| Auditability | Requires Rust expertise | Simple C, one memory block, predictable layout |
Guiding Principles
1. Don't Touch Generated Code
The generated model.c comes from tested kernels. Manual edits introduce risk.
2. Trust the Config
All sizes derived from config.json. If config is valid, layout is valid.
3. Verify, Don't Assume
Canaries, bounds checks, and layout verification catch errors early.