Scaling Philosophy
The Bet Behind This Project
The bet here is simple: AI will not stay locked inside premium proprietary boxes forever. C-Kernel-Engine takes a different path: Linux-only, CPU-only, server-grade hardware, open software, and standard data-center parts. If the software gets good enough, ordinary servers become practical AI machines.
This is not a claim that CPUs always beat GPU systems on raw peak throughput. That is not the point. The real question is whether CPU-only Linux systems can become good enough, cheap enough, and accessible enough to handle serious inference and eventually serious training. That is the thesis being tested here.
The early numbers already show that CPU-only inference is practical. A quantized 0.6B model on a 12th-gen Intel Alder Lake machine has reached about 100 tokens/sec when the system is otherwise idle. On an older 4-core machine, the same model still runs at about 20–25 tokens/sec. No GPU. No CUDA. No special hardware. Just common x86 instructions and Linux.
The bigger target is server-grade CPU infrastructure. Testing is underway on 5th-gen Intel Xeon Scalable systems with AVX-512 and AMX — the same class of machines already sitting in real data centers. The bet is that these machines will be cheaper, easier to procure, easier to operate, and more broadly deployable than proprietary accelerator-heavy stacks.
The method is straightforward: profile, find the real bottleneck, fix one kernel at a time, and measure again.
C-Kernel-Engine uses VTune, FlameGraph, Intel Advisor, perf stat, and roofline analysis to do exactly that.
The point of this page is not hype. It is to state the engineering bet clearly and then earn it with measurements.
π§ Engineering Compass — What This Page Is Really For
This page is the reminder to stay focused on the real thesis: make CPU-only Linux systems useful for modern AI by improving the software until commodity server hardware becomes practical. The scaling story is a direction, not a marketing slogan.
The engineering discipline is simple and repeatable:
- Find the slowest-moving part. Profile it. Understand why it's slow.
- Find the fastest-moving part. Understand what makes it fast. Replicate the pattern.
- Get more RAM, more cores. Test on bigger hardware. See if the architecture holds.
- Keep profiling. VTune, perf stat, roofline, FlameGraph — every run, every change.
- Fix one kernel at a time. Don't boil the ocean. One bottleneck, one fix, one measurement.
People will disagree with the thesis, and that is fine. The only useful answer is better measurement, better kernels, and clearer system design. Follow the data, not the hype.
1) 0 Γ β = 0: If the model doesn't fit in memory, FLOPS don't matter.
2) Theory of Constraints: At the Ethernet boundary, CPUs and GPUs face the same bottleneck.
The Two First Principles
Adding CPU servers still requires sharding, scheduling, networking, and NUMA discipline. But it avoids vendor-specific programming models, specialized accelerator fabrics, and the assumption that serious AI must depend on external accelerators from the start. The bet here is that standard Linux servers on Ethernet remain a simpler operational path for many real deployments.
The Computation Is Not Exotic
AI training and inference reduce to five operations: matrix multiply, attention, softmax, layer normalization, and backpropagation. These are linear algebra and calculus β mathematics developed in the 17th through 19th centuries, long before the first computer. Nothing about the computation requires physically exotic hardware. A CPU with SIMD instructions executes every one of these operations natively. If the math isn't exotic, the hardware requirement isn't permanent β it's a market condition. And market conditions change.
In 1995, "You can't run serious workloads on cheap PCs" was conventional wisdom.
In 2025, "You can't run serious AI on cheap CPUs" is conventional wisdom.
One of these beliefs aged very poorly. The pattern suggests which way this one goes.
Historical reference
| Era | Proprietary Incumbent | Commodity Disruptor | Result |
|---|---|---|---|
| 1990s | SPARC, Alpha, PA-RISC | x86 commodity chips | Proprietary RISC faded |
| 1998 | Sun/SGI servers ($500K+) | x86 PCs + MapReduce/GFS | Sun bankrupt (2010) |
| 2009 | Teradata, Netezza ($1M+) | Hadoop on commodity clusters | Big data democratized |
| Now | GPU clusters ($M+) | CPU clusters + software | β ? |
Principle 1: The Cost of 0 Γ β = 0
The GPU Memory Trap
Yes, you CAN fit a 70B model on GPUs using tensor/pipeline parallelism. That's not the point.
The point is: you're now FORCED to buy 8+ GPUs in a cluster.
And GPUs are:
- Proprietary β locked to NVIDIA (CUDA), no open ecosystem
- Export-controlled β H100/H200 blocked in many countries, supply constrained
- Cluster-required β single GPUs can't handle large models, need NVLink infrastructure
- Expensive β $40K+ per GPU, $200K+ for NVLink switches
The Math That Matters
GPU Path:
Model doesn't fit in 80GB β Buy 8 GPUs β $320K for GPUs alone
Need NVLink for fast communication β Another $50K+
Need DGX chassis β Another $80K+
Total: $450K+ just to START
CPU Path:
Model fits in 4TB RAM β Buy 1-2 servers β $30K each
Standard Ethernet networking β $2K
Total: $60K and you're running
The "0 Γ β = 0" principle forces GPU users into expensive multi-GPU setups. CPUs avoid this entirely.
Target Platform
Server-Grade Hardware by Instruction Set
C-Kernel-Engine uses ck_features.h for feature detection. We target by SIMD capability, not CPU model:
Instruction Set Priority
- AMX - 512-bit tile ops (Intel Sapphire Rapids+)
- AVX-512 - 512-bit vector (Intel Skylake-X+, AMD Zen 4)
- AVX2+FMA - 256-bit with FMA (Intel Haswell+, AMD Zen 2+)
- AVX - 256-bit vector (Intel Sandy Bridge+, AMD Zen 1)
- NEON - 128-bit (ARM64, Apple Silicon)
Auto-detection: The engine selects the best kernel at build time with runtime dispatch for optional extensions.
CPU Requirements
- High core count - 64-128+ cores per socket
- Large L3 cache - Good core-to-cache ratio (1-2MB/core)
- Vector width - 256-bit minimum (AVX)
- FMA - Recommended for 2x throughput
- Multiple sockets - NUMA-aware memory access
Memory Requirements
- DDR5 - Higher bandwidth, lower latency
- Multi-channel - 8-12 channels per socket
- Large capacity - 512GB - 2TB+ per node
- ECC - Error correction for reliability
- NUMA-local - Pin threads to local memory
Accelerators
- Intel DSA - Data Streaming Accelerator for memory copies
- Intel IAA - Analytics Accelerator for compression
- Intel QAT - QuickAssist for crypto (if needed)
- CXL - Memory expansion and pooling (future)
Networking
- RDMA - InfiniBand or RoCEv2
- 100-400 Gbps - High bandwidth interconnect
- Low latency - 1-2 ΞΌs for RDMA operations
- Kernel bypass - Zero-copy transfers
Operating System
Linux-only. We use Linux-specific features:
mmap()withMAP_HUGETLBfor huge pagesmadvise(MADV_HUGEPAGE)for transparent huge pagesnumactl/set_mempolicy()for NUMA bindingsched_setaffinity()for core pinningperffor profilingio_uringfor async I/O (weight loading)- Intel DSA via
libaccel-config/idxddriver
C-Kernel-Engine targets by instruction set capability, not CPU model. Any server-grade CPU with AVX2+FMA or better is a valid target β specific models change, the instruction sets don't. See include/ck_features.h for detection logic.
Why CPU-Only?
GPUs dominate when you can keep them highly utilized. Large batches, dense GEMMs, and well-packed workloads that fit comfortably in VRAM let GPUs exercise their theoretical FLOPS advantage. C-Kernel-Engine isn't anti-GPUβwe're anti-waste: wasted money on unused capacity, wasted energy at low utilization, and wasted coordination overhead at scale.
Advantages
- No vendor lock-in - Works on any x86/ARM CPU
- Commodity hardware - Standard servers, not $40K GPUs
- Larger memory - 2TB RAM per node, no 80GB VRAM limit
- Better debugging - GDB, Valgrind, perf all work
- Simpler deployment - No CUDA, no driver hell
- Open ecosystem - GCC, Linux, standard tools
The Trade-off
- GPUs have higher peak FLOPS
- But: memory bandwidth often bottlenecks anyway
- But: PCIe transfer overhead for large models
- But: multi-GPU coordination is complex
- But: CPU memory is 10-100x larger and cheaper
For inference: CPUs are often faster for batch=1
For training: Scale horizontally with RDMA
The Fundamental Math: 0 Γ β = 0
It doesn't matter how fast your compute is if your model won't fit in memory. Being 10x faster at computing doesn't help if you're limited by 0 Γ β = 0.
CPU: Memory Wins
- Dual-socket server: 4-6TB DDR5
- Can train: 1TB model in BF16
- Math: 4-6TB Γ dual socket = non-zero, model loads
- Result: Actually trains the model
GPU: Compute Fast, Memory Fails
- Single GPU VRAM: tops out well below what large models need
- Each GPU generation: more HBM β at exponentially higher price per unit
- The constraint: 1TB model Γ· per-GPU VRAM = many GPUs, minimum, just for weights
- Math: 0 utility per GPU (model won't fit alone) Γ fast FLOPS = 0 β compute speed doesn't solve a memory problem
- Result: You buy more GPUs. Cost compounds. Complexity compounds. Memory is still the bottleneck.
The GPU Cluster Reality
GPUs Require Clusters
Here's the fundamental problem: no single GPU can handle large models. You need a cluster.
- Even the biggest GPU: 80GB VRAM maxβcan't fit 70B+ models
- Multi-GPU needed: 8-32 GPUs for practical workloads
- NVLink required: $200K+ in interconnects for fast GPU-to-GPU communication
- DGX systems: Pre-configured clusters start at $250K+
Every GPU hits the same VRAM wall. Whether 24GB or 80GB per GPU, large models require massive GPU counts in coordinated clusters. This is the fundamental constraint that CPU-only architecture bypasses entirely.
Energy Efficiency: The CPU Advantage at Realistic Utilization
This is the final argument GPU advocates use. But the math changes dramatically when we look at realistic utilization, not theoretical peak FLOPS.
The Utilization Problem
GPU efficiency claims assume 100% compute utilization. Real inference doesn't work that way:
- Batch=1 latency: Most production inference is single-request
- Memory-bound: KV cache and weight loading dominate
- Token-to-token: 90%+ of time is waiting for next token
- I/O bound: Network, disk, and tokenization overhead
The dirty secret: GPUs spend most of their time idle, still drawing full power.
The Idle Power Reality
GPU: Always Hungry
- High-end GPU at idle: ~150W (just sitting there)
- High-end GPU at compute: ~700W
- PCIe overhead: +50W for data transfer
- VRAM stays powered: Weights must remain loaded
Real-world: If your GPU is only computing 20% of the time, you're wasting 80% of that 700W.
CPU: Scales Down
- Dual Xeon at idle: ~100-150W (bare OS, minimal load)
- Dual Xeon at compute: ~800-1000W (full load)
- DVFS: Scales from 0.8GHz to 3.5GHz dynamically
- C-states: Deep sleep cores when waiting for I/O
Real-world: Enterprise server with 2TB RAM typically draws 200-400W average for inference workloads.
Power-Per-Token Analysis
| Scenario | GPU Power | CPU Power | Winner |
|---|---|---|---|
| Theoretical peak FLOPS | 700W / 989 TFLOPS = 0.71 W/TFLOPS | 1000W / 6 TFLOPS = 167 W/TFLOPS | GPU (theoretical) |
| Memory-bound (typical inference) | 700W (can't scale down) | 200-400W (scales with load) | CPU (2-3.5x less) |
| Batch=1, high I/O wait | 300W average (60% idle) | 150-200W average (70% idle) | CPU (1.5-2x less) |
| Multi-tenant (6 models) | 6 Γ 700W = 4,200W (all active) | 800-1000W (all on one server) | CPU (4-5x less) |
The Utilization Math
GPU Cluster (6Γ high-end GPUs) for 6-department enterprise:
6 departments Γ 1 GPU each = 4,200W continuous
Even when only 1-2 departments are active.
Plus: $240,000+ in hardware, NVLink complexity.
CPU (1Γ Dual Xeon Platinum) for 6-department enterprise:
All 6 models resident in 2TB RAM = ~1000W max
Each department waits its turn = efficient time-sharing
Scales power with actual compute load (not fixed at max)
Net difference: 4-5x less power, 10x lower hardware cost
Watts Per Token: The Real Numbers
Enterprise Deployment Comparison
Scenario: 6 models, 24/7 operation, mixed workload
GPU Cluster (6Γ high-end GPUs):
Idle power: 6 Γ 150W = 900W
Compute power: 6 Γ 700W = 4,200W (when all busy)
Average (typical 20% compute): ~1,500W
Power/24hr: 36 kWh
Power/year: 13,140 kWh
@ $0.10/kWh: $1,314/year
CPU Server (1Γ Dual Xeon Platinum, 2TB RAM):
Idle power: ~150W (bare OS, all models in RAM)
Compute power: ~1000W (all models active)
Average (typical 20% compute): ~320W
Power/24hr: 7.7 kWh
Power/year: 2,800 kWh
@ $0.10/kWh: $280/year
Net difference: 4-5x less power = ~$1,000+/year savings
Carbon Footprint: Real-World Impact
CPU Advantage
- 4-5x less electricity for multi-tenant inference
- No GPU manufacturing impact (TSMC 4nm vs 5nm)
- Lower cooling due to lower heat output
- Uses existing infrastructure - no new hardware needed
- 10x lower hardware cost ($60K vs $600K+)
When GPUs Make Sense
- Training large models (100B+) at 100% utilization
- Very high throughput with batching
- Research where peak FLOPS matter more than efficiency
GPU efficiency claims assume 100% compute utilization. Real inference workloads are typically 10-30% compute-bound. At realistic utilization, CPUs consume 3-5x less power for multi-tenant inference. Plus: 10x lower hardware cost. This isn't theory - it's simple thermodynamics based on how often your hardware is actually doing work.
The Hidden Cost: Power Delivery and Signal Integrity
GPU marketing quotes peak TFLOPS. What they don't mention is the electrical engineering nightmare required to actually deliver those peaks.
The GPU Power Profile: Burst-Idle-Burst
GPUs don't draw steady power. They spike to peak compute (hundreds of watts), then drop when waiting for data transfer, then spike again. This burst-idle-burst pattern creates massive di/dt (rate of current change) that cascades into real electrical engineering problems:
- di/dt spikes — Rapid current transitions from idle to peak compute stress every component in the power delivery path
- Signal reflections — High-speed switching creates signal integrity issues on PCB traces and interconnects
- Crosstalk — Adjacent high-speed signal lines interfere with each other at GPU clock speeds
- Ground bounce — Simultaneous switching of thousands of CUDA cores causes ground plane voltage fluctuation
- Power supply design — PSUs must handle massive transient spikes, requiring expensive voltage regulation and capacitor banks
Result: Data centers running GPU clusters need specialized power infrastructure — substations, high-capacity PDUs, and overprovisioned power delivery — to handle these peak bursts that occur for fractions of a second.
GPU: Spiky, Unpredictable Power
Power draw over time:
ββββββββββββββββββββββββββββββββββ
700W 150W 700W 150W 700W 150W
compute wait compute wait compute wait
Peak-to-idle ratio: ~5:1
di/dt: Extreme
Infrastructure: Substation-grade power delivery
CPU: Steady, Predictable Power
Power draw over time:
ββββββββββββββββββββββββββββββββ
200-400W consistent draw
Peak-to-idle ratio: ~2:1
di/dt: Minimal (DVFS transitions are gradual)
Infrastructure: Standard data center power
Theory of Constraints Applied to Power
The fastest moving part and the slowest moving part of a system should be as close together as possible. GPUs violate this principle at the electrical level:
- Fastest part: GPU peak compute at 700W+ burst
- Slowest part: Data transfer at 150W idle
- Gap: 5:1 ratio — massive mismatch that the power infrastructure must absorb
CPUs don't have GPU-level peak FLOPS, but they also don't need substation-grade power infrastructure to handle those peaks. The power draw is consistent and predictable. Standard, well-designed data center power delivery handles it without complication. No specialized substations. No overprovisioned PDUs. No capacitor banks for transient spikes.
The peak FLOPS that GPUs advertise are real — but the cost of actually delivering that power is hidden from every benchmark and every marketing slide.
Principle 2: The Ethernet Equalizer
NVLink is 900 GB/s within a single node. But you can only fit 8 GPUs per node. Go beyond that? You hit Ethernet. And at the Ethernet boundary, CPUs and GPUs face the exact same constraint.
The Bandwidth Reality
Let's be precise about the numbers:
| Connection Type | Bandwidth | Where It Applies | Scales To |
|---|---|---|---|
| NVLink 4.0 | 900 GB/s | GPU-to-GPU within 1 node | 8 GPUs max |
| DDR5 (12-channel) | 460 GB/s | CPU-to-RAM within 1 socket | Per socket |
| 400GbE Ethernet | 50 GB/s | Node-to-node | Infinite nodes |
| 100GbE Ethernet | 12.5 GB/s | Node-to-node | Infinite nodes |
| InfiniBand HDR | 25 GB/s | Node-to-node | Thousands of nodes |
The Theory of Constraints Applied
GPU Cluster at Scale:
Within node: NVLink 900 GB/s (fast!)
Between nodes: Ethernet 50 GB/s (constraint!)
System speed = 50 GB/s (bottleneck)
CPU Cluster at Scale:
Within node: DDR5 460 GB/s
Between nodes: Ethernet 50 GB/s (same constraint!)
System speed = 50 GB/s (same bottleneck)
AT SCALE, THEY HIT THE SAME WALL.
If you can't afford NVLink switches ($200K+), your GPUs communicate over PCIe β Ethernet. That's 12.5-50 GB/s. A server-grade CPU with DDR5 has 460 GB/s to its own memory. For memory-bound workloads, the CPU wins.
The Compute-to-Bandwidth Chasm
Every system has a fastest thing (compute) and a slowest thing (cross-node data movement). C-Kernel-Engine's entire goal is to bring these two into sync. On CPUs, they're close enough that software can bridge the gap. On GPUs, they're orders of magnitude apart — no software fixes that.
On CPUs, the gap between the fastest thing the system can do (compute) and the slowest thing it must do (move data across the network) is close. Peak FLOPS and Ethernet bandwidth live in the same neighborhood. That means the remaining optimization work is pure engineering — tiling, prefetching, computation-communication overlap — real techniques that bring compute and data movement closer to sync. That's what C-Kernel-Engine is built to do.
On GPUs, the fastest and slowest are worlds apart. GPU peak compute is orders of magnitude faster than the Ethernet pipe that feeds it at cluster scale. It's the difference between the summit of Mt. Everest and the floor of the Mariana Trench. Most of that compute sits permanently idle, burning power, waiting for data that will never arrive fast enough. No amount of software engineering changes the physics.
CPU: Rolling Hills — Bridgeable
- Fastest (compute) and slowest (network) are close
- Local memory bandwidth sits in between — a smooth gradient, not a cliff
- Software can bridge the remaining gap: tiling, prefetch, overlap
- Every hardware generation makes the gap smaller (more bandwidth, same physics)
The terrain is gentle enough to walk. C-Kernel-Engine's job is to build the bridge: bring compute throughput and data movement into sync through aggressive kernel engineering.
GPU: Everest to Mariana — Unbridgeable
- Fastest (compute) and slowest (network) are orders of magnitude apart
- Intra-node interconnects are fast, but only reach a handful of GPUs
- At cluster scale, everything hits the same Ethernet wall
- Most compute capacity sits permanently idle, starved for data
The terrain is a cliff face. No bridge spans from the peak of Everest to the bottom of the Mariana Trench. The gap is structural — physics, not engineering.
The goal is straightforward: bring the fastest-moving thing and the slowest-moving thing as close to sync as possible. On CPUs, they're already neighbors — the remaining work is tiling, cache management, prefetching, and overlapping computation with communication. That is a solvable engineering problem, and it's exactly what this project is built to solve. On GPUs at cluster scale, the gap between compute and data movement is structural. No kernel optimization closes it.
Designing Your Ethernet Network
Ethernet Switch Topology for ML Clusters
Since Ethernet is the equalizer at scale, designing it well is critical.
Leaf-Spine Architecture
βββββββββββββββ βββββββββββββββ
β Spine 1 β β Spine 2 β
β 400GbE β β 400GbE β
ββββββββ¬βββββββ ββββββββ¬βββββββ
β β
βββββββββββββββββββΌββββββββββββββββββββΌββββββββββββββββββ
β β β β
ββββββ΄βββββ ββββββ΄βββββ ββββββ΄βββββ ββββββ΄βββββ
β Leaf 1 β β Leaf 2 β β Leaf 3 β β Leaf 4 β
β 100GbE β β 100GbE β β 100GbE β β 100GbE β
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ΄βββββ
β β β β
ββββββ΄βββββ ββββββ΄βββββ ββββββ΄βββββ ββββββ΄βββββ
βServer 1 β βServer 3 β βServer 5 β βServer 7 β
βServer 2 β βServer 4 β βServer 6 β βServer 8 β
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
Switch Sizing Calculator
Bandwidth Requirements
For distributed training with data parallelism:
Gradient size = Model parameters Γ bytes per param
70B model in FP16 = 70B Γ 2 bytes = 140 GB
All-reduce bandwidth needed:
Per iteration: 2 Γ gradient size (reduce-scatter + all-gather)
70B model: 2 Γ 140 GB = 280 GB per iteration
With 100GbE (12.5 GB/s):
All-reduce time = 280 GB Γ· 12.5 GB/s = 22.4 seconds
With 400GbE (50 GB/s):
All-reduce time = 280 GB Γ· 50 GB/s = 5.6 seconds
Small Cluster (8-16 servers)
- Topology: Single 400GbE switch
- Switch: Arista 7060X5 or similar
- Ports: 32Γ 400GbE
- Cost: ~$30,000
- Bisection BW: 12.8 Tbps
Medium Cluster (32-64 servers)
- Topology: Leaf-spine (2 spine, 4 leaf)
- Spine: 2Γ 400GbE switches
- Leaf: 4Γ 100GbE switches
- Cost: ~$120,000
- Bisection BW: 25.6 Tbps
Large Cluster (100+ servers)
- Topology: 3-tier Clos or fat-tree
- Consider: InfiniBand for lower latency
- RDMA: RoCEv2 over Ethernet or native IB
- Cost: $500K-2M depending on scale
Key insight: A GPU cluster at this scale needs the SAME network infrastructure. The Ethernet cost is equal. But CPUs don't need the $2M+ in NVLink switches.
RDMA Configuration
RoCEv2 Setup
# Enable RDMA over Converged Ethernet v2
# On each server with Mellanox/NVIDIA ConnectX NICs:
# 1. Enable PFC (Priority Flow Control) on switch
# Required for lossless Ethernet
# 2. Configure ECN (Explicit Congestion Notification)
mlnx_qos -i eth0 --pfc 0,0,0,1,0,0,0,0 # Enable PFC on priority 3
mlnx_qos -i eth0 --trust=dscp
# 3. Set up RDMA
modprobe rdma_ucm
modprobe rdma_cm
# 4. Verify RDMA is working
ibv_devinfo
rdma link show
Scale Economics: The Real Comparison
Why CPU-Only Wins at Scale
GPU Cluster at Scale
- For 1TB model: 100 GPUs minimum
- Total cost: $4 Billion
- Who can afford: 5 companies globally
- Result: Elite-only access
CPU Cluster at Scale
- For same scale: 80 servers
- Total cost: $240 Million
- Who can afford: 1000+ companies
- Result: Accessible to everyone
The Economics: 16x Cheaper, 200x More Accessible
While GPU costs stay high and GPU access remains limited, CPU clusters deliver the same scale at a fraction of the cost. This makes large-scale ML accessible to 1000+ companies instead of just 5.
The Hybrid Trap: CPU+GPU = CPU-Bound
"Why not use CPU for memory and GPU for compute? Get the best of both!"
Answer: You're then AS FAST AS THE CPU anyway! You get the complexity of both with the performance of neither.
The Hybrid Bottleneck
When CPU holds weights and transfers to GPU for compute:
- Transfer bottleneck: 100 GB/s limits throughput
- GPU idle time: Waits for data from CPU
- You get: GPU performance = CPU performance
- Plus: Double the code complexity, double the cost
Conclusion: If the GPU is limited by the CPU anyway, just use CPUs! Simpler, faster, cheaper.
The GPU Workaround Stack: "Innovations" That Are Actually Patches
The entire field has been optimizing around a hardware constraint and mistaking the workarounds for progress. Every major "breakthrough" in LLM architecture is actually compensating for GPU memory limitations.
Every "Innovation" Maps to a GPU Constraint
| "Innovation" | What It Actually Does | The GPU Constraint It Patches | Needed on CPU with 2-4TB RAM? |
|---|---|---|---|
| GQA (Grouped Query Attention) | Shares KV heads across query heads | KV cache blows up GPU VRAM | No — KV cache fits |
| MoE (Mixture of Experts) | Activates sparse subset of parameters | Dense model won't fit on one GPU | No — dense model fits |
| KV Caching | Stores past attention keys/values — literally a key-value database in every layer | GPU VRAM limits cache size, forces eviction strategies | CPU home turf — this is literally how databases work. CPUs have been running key-value stores for decades. |
| Gradient Checkpointing | Recomputes activations instead of storing them | Training activations don't fit in GPU VRAM | No — store everything |
| Tensor Parallelism | Shards weight matrices across GPUs | Single GPU can't hold the full matrix | No — full matrix fits in RAM |
| Pipeline Parallelism | Distributes layers across GPUs | All layers don't fit on one GPU | No — all layers fit |
| Flash Attention | Online softmax with tiled computation — streams through attention in blocks | Full attention matrix doesn't fit in GPU SRAM/VRAM | Brilliant for CPU — tiling maps naturally to CPU cache hierarchies. CPUs process data in cache-line-sized tiles inherently. |
| Quantization Research | Compresses model weights (Q4, Q8, etc.) | Model doesn't fit in GPU VRAM at full precision | Optional — use for bandwidth, not capacity |
On a CPU with 2-4TB RAM, half of these become unnecessary, and the other half become simpler. The entire research direction has been shaped by GPU limitations, and people have confused "optimizations forced by GPU memory walls" with "fundamental advances in model architecture."
The Sequential Reality: GPUs Were Never Designed for This
Transformers Are Sequential. Period.
A transformer forward pass is:
Layer 1 → Layer 2 → Layer 3 → ... → Layer 80
You cannot compute layer 10 without the output of layer 9. That is the definition of sequential dependency. There is no debate here — it's mathematical fact.
What people mean when they say "LLMs are parallel" is that within a single layer, the matrix multiplication can be parallelized across rows and columns. But that's not the model being parallel — that's a single operation within a sequential pipeline being decomposable. Every CPU has been decomposing matmuls across SIMD lanes and cores for decades.
Amdahl's Law: Why Parallel Hardware Still Hits a Ceiling
Strong-Scaling Limit
Amdahl's Law is the simplest way to say what the sequential transformer argument implies in practice:
speedup(N) = 1 / (S + (1 - S) / N)
Where S is the fraction of work that stays effectively serial or synchronization-bound. Even with infinite parallel hardware, the maximum speedup is still:
max_speedup = 1 / S
| Serial / sync fraction | Theoretical max speedup | What it means |
|---|---|---|
| 10% | 10× | A surprisingly hard wall for giant clusters |
| 5% | 20× | Still not "infinite scaling" |
| 2% | 50× | Requires extraordinary system design |
| 1% | 100× | Already very difficult in real distributed systems |
For LLMs, that serial fraction is not just "the next token depends on the previous token." It also includes layer boundaries, synchronization, collective communication, optimizer steps, routing decisions, and all the places where the system must wait for the slowest participant.
CKE takeaway: do not optimize only for peak FLOPS. Reduce the effective serial fraction, reduce synchronization pressure, and keep compute, memory, and network in the same performance neighborhood.
What GPUs Were Designed For
Independent pixel computation.
Millions of pixels with zero data dependency on each other. Pixel (0,0) doesn't need the result of pixel (1920,1080) to compute its color. That IS embarrassingly parallel.
pixel(0,0) → compute color → done pixel(0,1) → compute color → done pixel(1,0) → compute color → done ... pixel(1920,1080) → compute color → done (all independent, all simultaneous)
What LLM Inference Actually Is
Sequential token generation on a sequential layer stack.
Token N+1 depends on the attention computation over ALL previous tokens. The autoregressive decode loop is the opposite of pixel independence.
token 1 → 80 layers → token 2 token 2 → 80 layers → token 3 token 3 → 80 layers → token 4 ... (each waits for the previous)
The Vocabulary of Mismatch
The fact that the industry bolted LLM training onto hardware designed for independent pixel shading and then invented an entire vocabulary to work around the mismatch should tell you something is wrong with the foundational assumption:
- Tensor parallelism — because one GPU can't hold the weights
- Pipeline parallelism — because one GPU can't hold the layers
- Model parallelism — because the model doesn't fit
- Gradient checkpointing — because activations don't fit
- Flash Attention — because attention intermediates don't fit
- Activation recomputation — because you ran out of memory
Every one of these is a workaround for the same problem: you're running a sequential, memory-hungry workload on hardware designed for embarrassingly parallel, compute-bound pixel shading.
MoE: The Right Hardware for Dynamic Routing
Native Fit vs. Forced Workaround
Yes, MoE runs on GPUs β frontier models do it at scale. But runs on is not the same as naturally fits. The same argument made against CPUs for AI ("it works, but it's not the right tool") applies equally to GPUs for MoE.
MoE routing is conditional: for each token, a gating function decides which experts activate. GPUs β designed for dense, predictable, lockstep computation β handle this through software workarounds: load balancing losses, capacity factors, expert dropout, and auxiliary training objectives just to keep GPU utilization from collapsing under sparse activation patterns.
Consider fixed-function inference chips. They prefer dense computation above all else. The minute you introduce dynamic routing β conditional branching, variable expert selection β they require architectural hacks. GPUs face the same underlying tension, just with more memory headroom to absorb it.
CPUs have handled conditional branching natively since the beginning: branch prediction, out-of-order execution, speculative paths. The hardware was built for exactly this. AVX-512 computes 16 FP32 multiply-accumulates per cycle per core β across 128 cores that's 2048 parallel operations β and the dynamic routing overhead that GPUs must paper over in software is just normal control flow on a CPU.
Why MoE All-to-All Makes Topology Matter
Jensen Huang's point about MoE is not that "one expert equals one GPU" in a literal one-expert-per-device sense. Real deployments usually place multiple experts per GPU, or shard very large experts across several GPUs. But the core communication pattern is the same:
tokens on many GPUs ↓ route top-k experts dispatch token states to expert owners (all-to-all) ↓ run expert MLPs ↓ send expert outputs back / combine (all-to-all again)
That happens at every MoE layer. If the interconnect is a one-hop switched fabric, the communication cost is painful but bounded. If the topology is multi-hop, ring-like, torus-like, or otherwise forces traffic through several devices before reaching the destination, the waiting and queueing add up quickly.
This is where Amdahl's Law shows up in systems form: the expert GEMMs may parallelize beautifully, but the dispatch, synchronization, and combine phases become the part you cannot hide. The more often the model must do all-to-all, the more the communication fraction dominates the ceiling on real speedup.
Training: Batch Size Is Not What You Think
This is true — and that's a GPU problem, not a training requirement. Training does not need batch size greater than 1. There are many ways to simulate larger effective batch sizes without holding multiple sequences in memory simultaneously.
GPU: The Batch Balancing Act
On a GPU, you must balance three competing constraints:
- Model size — weights consume VRAM
- Context length — KV cache consumes VRAM
- Batch size — activations consume VRAM
Increase any one, and you must decrease the others. Want longer context? Reduce batch size. Want larger batch? Reduce context. Want a bigger model? Reduce both.
GPU VRAM budget (80GB):
Model weights: 40GB (fixed)
KV cache: 20GB (varies with context)
Activations: 20GB (varies with batch)
Longer context = less batch
Larger batch = less context
ALWAYS a tradeoff.
CPU: No Tradeoff Required
With 2-4TB of RAM, all three fit simultaneously:
- Model size — even 400B+ models fit (800GB+ in FP16)
- Context length — variable, up to 1M+ tokens
- Batch size — whatever you need
You can have variable context length AND batch greater than one at the same time. No balancing act required.
CPU RAM budget (4TB):
Model weights: 810GB (400B+ FP16)
KV cache: 500GB (long context)
Activations: 200GB (large batch)
Remaining: 2,538GB (room to spare)
Variable context + variable batch
NO tradeoffs.
You Don't Need Large Batches — You Can Simulate Them
Training with batch=1 works. The gradient is noisier, but there are well-established techniques to get the benefits of large batches without the memory cost:
| Technique | How It Works | GPU Benefit | CPU Benefit |
|---|---|---|---|
| Gradient Accumulation | Accumulate gradients over N forward passes, update once | Simulates batch=N with batch=1 memory | Same, but can also do actual batch=N |
| Micro-batching | Process small sub-batches, aggregate gradients | Fits in VRAM per micro-batch | Can use larger micro-batches |
| Online Learning | Update after every single example (batch=1) | Works but GPU underutilized | Natural fit for CPU sequential processing |
| Data Parallelism | Each node processes different batch, average gradients | Requires NVLink/IB for gradient sync | RDMA gradient sync, same principle |
Our kernel architecture supports variable context lengths and flexible batch sizes natively. The quantized GEMV/GEMM kernels (Q4_K, Q5_0, Q5_1, Q5_K, Q6_K, Q8_0) are designed from the ground up for CPU-native inference and training — not as ports of GPU code. We're building the evidence that CPU-only training and inference at scale isn't theoretical — it's practical, and the kernel-level work is happening now.
Why CPU-Only is the Future
The Strategic Advantage
Technical Wins
- Memory capacity: 4-6TB per node vs 80GB VRAM
- No transfer bottleneck: CPU memory + CPU compute
- Optimize interconnect: RDMA, core pinning, cache
- Get close to theoretical FLOPS
Market Wins
- 16x cheaper at scale
- 200x more accessible to companies
- Commodity hardware - no special requirements
- Open ecosystem - standard Linux tools
Commodity Economics: The CPU Trajectory
CPUs Follow Commodity Price Curves
Unlike GPUs which are supply-constrained and vendor-controlled, CPUs are commodity hardware. Every improvement happens automatically, at scale, with competition driving prices down.
| Component | 2023 | 2025 | 2027 (projected) | Trend |
|---|---|---|---|---|
| Memory Standard | DDR5-4800 | DDR5-6400 | DDR6 | +33% bandwidth per gen |
| Channels per Socket | 8 | 12 | 16 | +50% channels |
| Max RAM per Server | 4TB | 8TB | 16TB+ | 2x per generation |
| Cores per Socket | 64 (Genoa) | 128 (Turin) | 192+ | +50% per gen |
| L3 Cache | 256MB | 384MB | 512MB+ | Growing |
| Ethernet Speed | 100GbE | 400GbE | 800GbE | 4x per 3 years |
| $/TFLOP (CPU) | $500 | $300 | $150 | -50% per 2 years |
| $/TFLOP (GPU) | $50 | $40 | $35 | -15% per 2 years |
GPUs will always have higher peak FLOPS per dollar for raw compute. But for memory-bound ML inference (which is most real-world LLM usage), CPUs are already competitive. And the gap closes every year because CPU economics follow commodity curves.
Why CPU Economics Win
- Competition: Intel vs AMD vs ARM
- Volume: Billions of CPUs vs millions of GPUs
- Supply: Multiple fabs, no bottlenecks
- Standards: DDR5 is an industry standard, not proprietary
- Software: Linux, GCC, standard tools (free)
Why GPU Economics Struggle
- Monopoly: NVIDIA dominates (~90% market)
- Supply-constrained: Artificial scarcity
- Proprietary: CUDA lock-in, HBM limited fabs
- High margins: No price competition pressure
- Lead times: 12-24 month waits
Infinite CPU Scaling: The Internet Proof
CPUs Scale to Infinity
Unlike GPUs, which hit cost and availability ceilings, CPUs scale infinitely through distributed computing across the internet. Your computer is part of that infinity!
Want more FLOPs? Add more CPUs + RDMA! Get your 100 PFLOPS while your friends on the other side keep bragging about their limited GPU clusters. FLOPs is just an accumulation problem that CPUs solve through clever distributed training.
β Infinite Scaling
- Internet-scale distributed computing
- No ceiling: Add 10, 100, 1000 CPUs
- Your computer contributes: Every device matters
- Result: Unlimited FLOP accumulation
CPU Improvements Track
- Cores: Increasing year over year
- Cache: Growing bigger
- Efficiency: Getting better
- Cost: Decreasing over time
Memory Scaling
- DDR5 β DDR6: 2400 GB/s β higher bandwidth
- Memory slots: 8 β 12 β 16 slots per node
- Total per node: 4TB β 8TB β 16TB RAM
- Cost/GB: Constantly decreasing
Model Efficiency
- Smaller models: Getting more efficient
- Better quantization: Q4, Q8, hybrid methods
- Architecture improvements: MoE, mixture of experts
- Training optimization: Better algorithms
The C-Kernel Engine Goal
Exploit this infinite CPU scaling on server-grade CPUs. As models get smaller and CPUs get more powerful, the combination becomes unbeatable:
- No GPU dependency: Commodity hardware only
- Accessible to everyone: Your laptop contributes to the cluster
- Sustainable scaling: Economics favor CPU-only approach
- Future-proof: CPU improvements continue while GPU costs stay high
Multi-Model Parallelism: Enterprise Real Workload
Unlike GPUs which are designed for single-model throughput, CPUs excel at running multiple models simultaneously on the same system.
Enterprise & Government: The Multi-Model Reality
In government ministries, large enterprises, and multi-department organizations, different teams need to run different models simultaneously. This is where GPUs become a nightmare.
Government Ministry Example
- Health Dept: Medical document analysis model
- Finance Dept: Fraud detection model
- Education Dept: Student assessment model
- Transport Dept: Traffic prediction model
- HR Dept: Resume screening model
- Legal Dept: Contract analysis model
Reality: 6 departments, 6 different models, all need to run simultaneously.
Enterprise Example
- Sales: Lead scoring + CRM assistant
- Support: Customer chatbot + ticket routing
- Engineering: Code review + documentation
- Marketing: Content generation + analytics
- Security: Threat detection + log analysis
- Finance: Expense analysis + forecasting
Reality: Each department has specialized models for their domain.
The GPU Context-Switching Nightmare
GPU Model Switching: Slow and Expensive
GPU Workflow for Multi-Model Serving:
1. Load Model A weights into VRAM (80GB) β 30-60 seconds
2. Run inference for Department A β fast
3. Unload Model A from VRAM β 10 seconds
4. Load Model B weights into VRAM β 30-60 seconds
5. Run inference for Department B β fast
6. Repeat for each model...
Problem: Each context switch = 40-70 seconds of GPU sitting idle!
With 6 departments cycling through: most of the time is loading/unloading.
GPU Multi-Model Issues
- VRAM limit: 80GB can only hold 1-2 large models
- Context switching: Load/unload weights takes 30-60s per model
- No sharing: GPU can't easily share between containers
- MIG complexity: Multi-Instance GPU splits VRAM, reduces per-model capacity
- Kubernetes pain: GPU scheduling requires special plugins, node affinity
- Cost: Need dedicated GPU per department = $40K Γ 6 = $240K
CPU Multi-Model Solution
- RAM abundance: 4TB holds ALL models simultaneously
- No context switching: All models resident in RAM, instant access
- Native sharing: Standard Linux process isolation works
- Docker native: Just containers, no special GPU passthrough
- Kubernetes native: Standard pod scheduling, no GPU plugins
- Cost: One $60K server runs all 6 departments
Kubernetes & Docker: CPUs Are First-Class Citizens
The Container Orchestration Reality
GPU Kubernetes Deployment:
- Install NVIDIA device plugin
- Configure GPU node pools
- Set resource limits: nvidia.com/gpu: 1
- Deal with GPU memory fragmentation
- Handle driver version mismatches
- Manage MIG partitioning (if sharing)
- Debug CUDA OOM errors in production
CPU Kubernetes Deployment:
- Just deploy your container
- Set resource limits: cpu: "32", memory: "64Gi"
- Done. Standard k8s scheduling handles the rest.
CPU Multi-Model Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Single CPU Server (4TB RAM) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Container 1 β β Container 2 β β Container 3 β β
β β Health Model β β Finance Modelβ β Education β β
β β 70B (140GB) β β 13B (26GB) β β 7B (14GB) β β
β β 32 cores β β 16 cores β β 8 cores β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Container 4 β β Container 5 β β Container 6 β β
β β Transport β β HR Model β β Legal Model β β
β β 7B (14GB) β β 3B (6GB) β β 13B (26GB) β β
β β 8 cores β β 4 cores β β 16 cores β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β Total: 226GB used / 4TB available = 5.6% RAM utilization β
β All 6 models running simultaneously, no context switching β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key insight: 6 models totaling 226GB fit easily in 4TB RAM. All running 24/7, no loading/unloading, instant response for any department.
Most enterprise/government deployments need to run 5-20 different models for different use cases. CPUs handle this natively with standard containers. GPUs require expensive, complex multi-GPU setups with painful orchestration. This alone makes CPUs the practical choice for most real-world deployments.
CPU-Only: Complete Solution for All Model Sizes
One Architecture, All Scales
CPU-only isn't just for large models. It's the complete solution for every model size, for both training and inference.
Small Models (1-7B)
- Training: Single CPU server
- Inference: Laptop or edge device
- Latency: Sub-millisecond
- Cost: < $100/month
Use cases: Chatbots, code completion, mobile AI
Medium Models (7-70B)
- Training: 2-8 CPU servers
- Inference: Single CPU server
- Throughput: High batch processing
- Cost: $500-2000/month
Use cases: Enterprise AI, content generation, analysis
Large Models (70B+)
- Training: 10-100 CPU servers
- Inference: 2-20 CPU servers
- Scale: Distributed across RDMA
- Cost: $5000-50000/month
Use cases: Foundation models, research, large-scale analytics
Same CPU-only architecture scales from your laptop (1B model) to global clusters (1T+ parameters). No architecture changes, no GPU lock-in, no complexity multiplication.
Training at Every Scale
- Small: Fine-tune on laptop (1-7B)
- Medium: Train on workstation (7-70B)
- Large: Distributed training (70B+)
- Methodology: Same principles, same tools
Inference at Every Scale
- Small: Edge deployment (1-7B)
- Medium: Server deployment (7-70B)
- Large: Distributed inference (70B+)
- Performance: Optimized for each tier
Why This Matters
Organizations can start small and scale infinitely without ever switching architectures:
- Start: Train small model on laptop (fine-tuning)
- Grow: Move to server as model size increases
- Scale: Add more servers when needed (distributed training)
- Enterprise: Run multiple models on same hardware
No lock-in. No architecture migrations. No GPU dependency. Just scale as you grow.
Memory Reality: What NVIDIA Marketing Won't Tell You
GPU marketing focuses on FLOPS. Real-world LLM inference is dominated by memory bandwidth and capacity. Here are the actual numbers.
Activation Memory Per Token (Decode)
Memory Writes Per Token
For a typical 0.5B parameter model (hidden=896, intermediate=4864, 24 layers):
| Operation | Per Layer | 24 Layers |
|---|---|---|
| RMSNorm output | 3.5 KB | 84 KB |
| Q, K, V projections | 10.5 KB | 252 KB |
| Attention output | 3.5 KB | 84 KB |
| O projection | 3.5 KB | 84 KB |
| MLP gate + up | 38 KB | 912 KB |
| MLP down | 3.5 KB | 84 KB |
| KV cache (new token) | 1 KB | 24 KB |
| Final logits (once) | - | 600 KB |
| Total per token | ~63 KB | ~2.1 MB |
This is memory bandwidth consumed per generated token. Reducing this is key for decode performance.
Context Length: The Memory Multiplier
KV Cache Formula
KV Cache = 2 Γ n_layers Γ n_kv_heads Γ head_dim Γ context_length Γ bytes_per_element
β
(K and V)
| Context | KV Cache (FP16) | + Model (140GB) | Fits 80GB GPU? | Fits 2TB Server? |
|---|---|---|---|---|
| 8K | 2.6 GB | 143 GB | No (need 2Γ) | Yes |
| 32K | 10 GB | 150 GB | No (need 2Γ) | Yes |
| 128K | 41 GB | 181 GB | No (need 3Γ) | Yes |
| 1M | 320 GB | 460 GB | No (need 6Γ) | Yes |
π Show the Math: How We Calculated This
Model Architecture (70B Dense Model)
| Parameter | Value | Explanation |
|---|---|---|
n_layers | 80 | Number of transformer layers |
n_attention_heads | 64 | Query heads per layer |
n_kv_heads | 8 | KV heads (GQA: 8 groups, each serves 8 Q heads) |
hidden_dim | 8192 | Model hidden dimension |
head_dim | 128 | = hidden_dim / n_attention_heads = 8192/64 |
Step 1: KV Cache Per Token
// FP16 = 2 bytes per element KV_per_token = 2 Γ n_layers Γ n_kv_heads Γ head_dim Γ bytes = 2 Γ 80 Γ 8 Γ 128 Γ 2 = 327,680 bytes = 320 KB per token
Step 2: Scale by Context Length
| Context | Calculation | KV Cache |
|---|---|---|
| 8K | 8,192 Γ 320 KB |
2.62 GB |
| 32K | 32,768 Γ 320 KB |
10.5 GB |
| 128K | 131,072 Γ 320 KB |
41.9 GB |
| 1M | 1,048,576 Γ 320 KB |
335 GB |
Step 3: Add Model Weights
Model weights (70B params Γ 2 bytes FP16) = 140 GB
Total Memory = Model Weights + KV Cache
8K context: 140 + 2.6 = 143 GB β need 2Γ 80GB GPUs
32K context: 140 + 10.5 = 150 GB β need 2Γ 80GB GPUs
128K context: 140 + 42 = 182 GB β need 3Γ 80GB GPUs
1M context: 140 + 335 = 475 GB β need 6Γ 80GB GPUs
Why GQA Matters
Grouped Query Attention (GQA) reduces KV cache by sharing KV heads across Q heads:
| Attention Type | KV Heads | KV per Token | 128K KV Cache |
|---|---|---|---|
| Multi-Head (MHA) | 64 | 2.56 MB | 335 GB |
| Grouped Query (GQA-8) | 8 | 320 KB | 42 GB |
| Multi-Query (MQA) | 1 | 40 KB | 5.2 GB |
GQA-8 gives 8Γ memory savings over MHA with minimal quality loss.
70B model + 1M context needs ~475 GB. That's 6Γ 80GB GPUs ($240K+) vs one 2TB CPU server ($30K). This isn't about FLOPS - it's about where the data can physically exist.
Prefill Memory Scaling
Prefill (processing the input prompt) requires storing activations for ALL tokens simultaneously:
| Prefill Length | Activation Memory | + KV Cache |
|---|---|---|
| 256 tokens | ~400 MB | ~6 MB |
| 1K tokens | ~1.5 GB | ~24 MB |
| 4K tokens | ~6 GB | ~96 MB |
| 16K tokens | ~24 GB | ~384 MB |
Prefill is compute-bound but still needs memory for activations. Long prompts can exceed GPU VRAM.
Visual Guide
Training Advantages: Where CPUs Dominate
Critical Training Scenarios
CPU-only architecture solves training problems that GPU-only simply can't handle.
1. Long Context Training: KV Cache Explosion
The Problem: Long context windows create quadratic KV cache growth
- 4K context: ~32GB KV cache
- 32K context: ~256GB KV cache
- 128K context: ~4TB KV cache
- 1M context: ~128TB KV cache
CPU Advantage:
- 4-16TB RAM: Fits massive KV caches
- No transfer bottleneck: Everything in memory
- GPU Reality: KV cache doesn't fit = can't train
2. Massive Batch Training
The Need: Large batch sizes improve convergence and throughput
- Standard batch: 32-128 samples
- Large batch: 512-2048 samples
- Massive batch: 8192+ samples
- Enterprise batch: 100K+ samples
CPU Scaling Strategy:
- 32 CPUs: 32 batches in parallel
- Each batch: 1-10TB RAM available
- Total effective batch: 32x larger
- No communication overhead: Each CPU independent
The Training Reality Check
GPU Training Limits
- β KV cache fits β max context ~32K
- β Batch size limited by VRAM
- β Large batches require model parallelism
- β Complex coordination across GPUs
- β Expensive hardware (NVLink required)
CPU Training Advantages
- β KV cache fits β context up to 1M+
- β Batch size scales with CPUs
- β Data parallelism = simple scaling
- β RDMA for necessary communication
- β Commodity hardware (Ethernet)
Real-world training needs long context (RAG, document analysis) and large batches (throughput, convergence). CPU-only architecture handles both naturally. GPU-only hits walls that require expensive workarounds.
Distributed Architecture
RDMA-Connected CPU Cluster
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RDMA Fabric (100Gbps+) β
βββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββββββββββββββ€
β β β β
βΌ βΌ βΌ β
βββββββββββ βββββββββββ βββββββββββ β
β Node 0 β β Node 1 β β Node 2 β ... Node N β
β 128 coresβ β 128 coresβ β 128 coresβ β
β 2TB RAM β β 2TB RAM β β 2TB RAM β β
β β β β β β β
β Layers β β Layers β β Layers β β
β 0-15 β β 16-31 β β 32-47 β β
βββββββββββ βββββββββββ βββββββββββ β
Parallelism Strategies
1. Pipeline Parallelism
Different layers on different nodes. Activations flow through the pipeline.
Node 0: Layers 0-15 β activations β Node 1: Layers 16-31 β ...
(forward) (RDMA) (forward)
Communication: Send activations between pipeline stages via RDMA.
2. Tensor Parallelism
Large matrices split across nodes. Each node computes a shard.
// 16384 x 16384 weight matrix split across 4 nodes Node 0: W[0:4096, :] // Shard 0 Node 1: W[4096:8192, :] // Shard 1 Node 2: W[8192:12288, :] // Shard 2 Node 3: W[12288:16384, :] // Shard 3 // After local GEMM, all-reduce to combine
Communication: RDMA all-reduce after each sharded operation.
3. Data Parallelism
Same model replicated. Different batches. Gradient averaging.
Node 0: Model copy, Batch 0 β gradients ββ Node 1: Model copy, Batch 1 β gradients ββΌβ All-reduce β Update all Node 2: Model copy, Batch 2 β gradients ββ€ Node 3: Model copy, Batch 3 β gradients ββ
RDMA: The Key Enabler
Why RDMA?
Remote Direct Memory Access - Zero-copy, kernel-bypass networking.
| Metric | TCP/IP | RDMA |
|---|---|---|
| Latency | ~50-100 ΞΌs | ~1-2 ΞΌs |
| Bandwidth | 10-25 Gbps | 100-400 Gbps |
| CPU overhead | High (kernel, copies) | Near zero |
| Memory copies | Multiple | Zero (DMA) |
RDMA Primitives We Need
// One-sided operations (no remote CPU involvement) rdma_write(remote_addr, local_buf, size); // Write to remote memory rdma_read(local_buf, remote_addr, size); // Read from remote memory // Collective operations (built on one-sided) rdma_allreduce(buf, size, SUM); // Gradient averaging rdma_broadcast(buf, size, root); // Weight distribution rdma_barrier(); // Synchronization
Implementation Roadmap
| Phase | Feature | Status |
|---|---|---|
| 1 | Single-node training (current) | Done |
| 2 | Multi-core parallelism (OpenMP) | Done |
| 3 | RDMA communication primitives | Planned |
| 4 | Pipeline parallelism | Planned |
| 5 | Tensor parallelism (sharded GEMM) | Planned |
| 6 | Encoder + cross-attention | Planned |
| 7 | 600B+ training | Future |
The Math Doesn't Change
From Tiny to Massive: Same Operations
Forward pass (any size model):
1. embed_tokens() // Lookup: tokens β vectors
2. for each layer:
a. rmsnorm() // Normalize
b. linear() Γ 3 // Q, K, V projections
c. rope() // Rotary embeddings
d. attention() // Softmax(QK^T)V
e. linear() // Output projection
f. residual_add() // Skip connection
g. rmsnorm() // Normalize
h. mlp_swiglu() // FFN with gating
i. residual_add() // Skip connection
3. rmsnorm() // Final norm
4. lm_head() // Logits
Backward pass: Same operations in reverse.
SGD: weights -= lr * gradients
That's it. For any model size.
Hardware Recommendations
For Different Scales
| Model Size | Recommended Setup |
|---|---|
| < 7B | Single server, 32+ cores, 128GB+ RAM |
| 7B - 70B | Single server, 128 cores, 512GB-2TB RAM |
| 70B - 200B | 2-4 nodes, RDMA interconnect, 2TB RAM each |
| 200B - 600B | 8-16 nodes, 100Gbps+ RDMA fabric |
| 600B+ | 32+ nodes, 400Gbps RDMA, pipeline + tensor parallel |
Recommended CPUs by Instruction Set
- AMD EPYC 9004/9005 (Genoa/Turin) - Up to 192 cores, AVX-512 (512-bit), 12-channel DDR5
- Intel Xeon Sapphire Rapids/Granite Rapids - Up to 128 cores, AVX-512 + AMX (512-bit tile)
- Ampere Altra Max - 128 ARM cores, NEON (128-bit), good perf/watt
- AWS Graviton3/4 - Cost-effective ARM, NEON + SVE2
Minimum requirement: AVX (256-bit) or NEON (128-bit). For best performance, AVX-512 or AMX.
RDMA Options
- Mellanox/NVIDIA ConnectX-7 - 400Gbps InfiniBand
- Intel E810 - 100Gbps RoCE (RDMA over Ethernet)
- AWS EFA - Cloud RDMA for EC2 instances
Why Not GPU?
"Nvidia, f**k you."
β Linus Torvalds, 2012 (video source) regarding their closed-source Linux drivers
Beyond the open-source concerns:
- Cost - High-end GPU cluster = $200K+. EPYC server with 2TB RAM = $15,000
- Memory - 80GB VRAM vs 2TB+ system RAM
- Availability - Export-controlled, supply constrained vs commodity CPUs
- Flexibility - Run anywhere: cloud, on-prem, edge, embedded
- Debugging - printf works. GDB works. Valgrind works.
- Longevity - C code compiles forever. CUDA versions break.
Further Reading
- Memory Reality - What NVIDIA marketing won't tell you about LLM memory
- Developer Guide - How the engine works
- Memory Safety - Bump allocator design
- Profiling Guide - Performance optimization