Scaling Philosophy

The Bet Behind This Project

The bet here is simple: AI will not stay locked inside premium proprietary boxes forever. C-Kernel-Engine takes a different path: Linux-only, CPU-only, server-grade hardware, open software, and standard data-center parts. If the software gets good enough, ordinary servers become practical AI machines.

This is not a claim that CPUs always beat GPU systems on raw peak throughput. That is not the point. The real question is whether CPU-only Linux systems can become good enough, cheap enough, and accessible enough to handle serious inference and eventually serious training. That is the thesis being tested here.

The early numbers already show that CPU-only inference is practical. A quantized 0.6B model on a 12th-gen Intel Alder Lake machine has reached about 100 tokens/sec when the system is otherwise idle. On an older 4-core machine, the same model still runs at about 20–25 tokens/sec. No GPU. No CUDA. No special hardware. Just common x86 instructions and Linux.

The bigger target is server-grade CPU infrastructure. Testing is underway on 5th-gen Intel Xeon Scalable systems with AVX-512 and AMX — the same class of machines already sitting in real data centers. The bet is that these machines will be cheaper, easier to procure, easier to operate, and more broadly deployable than proprietary accelerator-heavy stacks.

The method is straightforward: profile, find the real bottleneck, fix one kernel at a time, and measure again. C-Kernel-Engine uses VTune, FlameGraph, Intel Advisor, perf stat, and roofline analysis to do exactly that. The point of this page is not hype. It is to state the engineering bet clearly and then earn it with measurements.

Engineering Compass β€” Profile, Find Bottleneck, Fix Kernel, Measure, Repeat

🧭 Engineering Compass — What This Page Is Really For

This page is the reminder to stay focused on the real thesis: make CPU-only Linux systems useful for modern AI by improving the software until commodity server hardware becomes practical. The scaling story is a direction, not a marketing slogan.

The engineering discipline is simple and repeatable:

  1. Find the slowest-moving part. Profile it. Understand why it's slow.
  2. Find the fastest-moving part. Understand what makes it fast. Replicate the pattern.
  3. Get more RAM, more cores. Test on bigger hardware. See if the architecture holds.
  4. Keep profiling. VTune, perf stat, roofline, FlameGraph — every run, every change.
  5. Fix one kernel at a time. Don't boil the ocean. One bottleneck, one fix, one measurement.

People will disagree with the thesis, and that is fine. The only useful answer is better measurement, better kernels, and clearer system design. Follow the data, not the hype.

TL;DR β€” Two First Principles
1) 0 Γ— ∞ = 0: If the model doesn't fit in memory, FLOPS don't matter.
2) Theory of Constraints: At the Ethernet boundary, CPUs and GPUs face the same bottleneck.

The Two First Principles

Principle 1: 0 Γ— ∞ = 0 β€” memory gates everything. Principle 2: Theory of Constraints β€” Ethernet equalizes GPUs and CPUs at cluster scale.
CPU Horizontal Scaling Can Stay Simpler
Adding CPU servers still requires sharding, scheduling, networking, and NUMA discipline. But it avoids vendor-specific programming models, specialized accelerator fabrics, and the assumption that serious AI must depend on external accelerators from the start. The bet here is that standard Linux servers on Ethernet remain a simpler operational path for many real deployments.

The Computation Is Not Exotic

AI training and inference reduce to five operations: matrix multiply, attention, softmax, layer normalization, and backpropagation. These are linear algebra and calculus β€” mathematics developed in the 17th through 19th centuries, long before the first computer. Nothing about the computation requires physically exotic hardware. A CPU with SIMD instructions executes every one of these operations natively. If the math isn't exotic, the hardware requirement isn't permanent β€” it's a market condition. And market conditions change.

Left: AI operations are standard linear algebra and calculus. Right: hardware displacement pattern showing proprietary always loses to commodity.
The same pattern, repeating
In 1995, "You can't run serious workloads on cheap PCs" was conventional wisdom.
In 2025, "You can't run serious AI on cheap CPUs" is conventional wisdom.
One of these beliefs aged very poorly. The pattern suggests which way this one goes.
Historical reference
EraProprietary IncumbentCommodity DisruptorResult
1990s SPARC, Alpha, PA-RISC x86 commodity chips Proprietary RISC faded
1998 Sun/SGI servers ($500K+) x86 PCs + MapReduce/GFS Sun bankrupt (2010)
2009 Teradata, Netezza ($1M+) Hadoop on commodity clusters Big data democratized
Now GPU clusters ($M+) CPU clusters + software β†’ ?

Principle 1: The Cost of 0 Γ— ∞ = 0

The GPU Memory Trap

Yes, you CAN fit a 70B model on GPUs using tensor/pipeline parallelism. That's not the point.

The point is: you're now FORCED to buy 8+ GPUs in a cluster.

And GPUs are:

The Math That Matters

GPU Path:
  Model doesn't fit in 80GB β†’ Buy 8 GPUs β†’ $320K for GPUs alone
  Need NVLink for fast communication β†’ Another $50K+
  Need DGX chassis β†’ Another $80K+
  Total: $450K+ just to START

CPU Path:
  Model fits in 4TB RAM β†’ Buy 1-2 servers β†’ $30K each
  Standard Ethernet networking β†’ $2K
  Total: $60K and you're running
    

The "0 Γ— ∞ = 0" principle forces GPU users into expensive multi-GPU setups. CPUs avoid this entirely.

Target Platform

Server-Grade Hardware by Instruction Set

C-Kernel-Engine uses ck_features.h for feature detection. We target by SIMD capability, not CPU model:

Instruction Set Priority

  • AMX - 512-bit tile ops (Intel Sapphire Rapids+)
  • AVX-512 - 512-bit vector (Intel Skylake-X+, AMD Zen 4)
  • AVX2+FMA - 256-bit with FMA (Intel Haswell+, AMD Zen 2+)
  • AVX - 256-bit vector (Intel Sandy Bridge+, AMD Zen 1)
  • NEON - 128-bit (ARM64, Apple Silicon)

Auto-detection: The engine selects the best kernel at build time with runtime dispatch for optional extensions.

CPU Requirements

  • High core count - 64-128+ cores per socket
  • Large L3 cache - Good core-to-cache ratio (1-2MB/core)
  • Vector width - 256-bit minimum (AVX)
  • FMA - Recommended for 2x throughput
  • Multiple sockets - NUMA-aware memory access

Memory Requirements

  • DDR5 - Higher bandwidth, lower latency
  • Multi-channel - 8-12 channels per socket
  • Large capacity - 512GB - 2TB+ per node
  • ECC - Error correction for reliability
  • NUMA-local - Pin threads to local memory

Accelerators

  • Intel DSA - Data Streaming Accelerator for memory copies
  • Intel IAA - Analytics Accelerator for compression
  • Intel QAT - QuickAssist for crypto (if needed)
  • CXL - Memory expansion and pooling (future)

Networking

  • RDMA - InfiniBand or RoCEv2
  • 100-400 Gbps - High bandwidth interconnect
  • Low latency - 1-2 ΞΌs for RDMA operations
  • Kernel bypass - Zero-copy transfers

Operating System

Linux-only. We use Linux-specific features:

C-Kernel-Engine targets by instruction set capability, not CPU model. Any server-grade CPU with AVX2+FMA or better is a valid target β€” specific models change, the instruction sets don't. See include/ck_features.h for detection logic.

Why CPU-Only?

GPUs dominate when you can keep them highly utilized. Large batches, dense GEMMs, and well-packed workloads that fit comfortably in VRAM let GPUs exercise their theoretical FLOPS advantage. C-Kernel-Engine isn't anti-GPUβ€”we're anti-waste: wasted money on unused capacity, wasted energy at low utilization, and wasted coordination overhead at scale.

Advantages

  • No vendor lock-in - Works on any x86/ARM CPU
  • Commodity hardware - Standard servers, not $40K GPUs
  • Larger memory - 2TB RAM per node, no 80GB VRAM limit
  • Better debugging - GDB, Valgrind, perf all work
  • Simpler deployment - No CUDA, no driver hell
  • Open ecosystem - GCC, Linux, standard tools

The Trade-off

  • GPUs have higher peak FLOPS
  • But: memory bandwidth often bottlenecks anyway
  • But: PCIe transfer overhead for large models
  • But: multi-GPU coordination is complex
  • But: CPU memory is 10-100x larger and cheaper

For inference: CPUs are often faster for batch=1

For training: Scale horizontally with RDMA

The Fundamental Math: 0 Γ— ∞ = 0

The Memory Constraint
It doesn't matter how fast your compute is if your model won't fit in memory. Being 10x faster at computing doesn't help if you're limited by 0 Γ— ∞ = 0.

CPU: Memory Wins

  • Dual-socket server: 4-6TB DDR5
  • Can train: 1TB model in BF16
  • Math: 4-6TB Γ— dual socket = non-zero, model loads
  • Result: Actually trains the model

GPU: Compute Fast, Memory Fails

  • Single GPU VRAM: tops out well below what large models need
  • Each GPU generation: more HBM β€” at exponentially higher price per unit
  • The constraint: 1TB model Γ· per-GPU VRAM = many GPUs, minimum, just for weights
  • Math: 0 utility per GPU (model won't fit alone) Γ— fast FLOPS = 0 β€” compute speed doesn't solve a memory problem
  • Result: You buy more GPUs. Cost compounds. Complexity compounds. Memory is still the bottleneck.
CPU vs GPU Memory Analysis

The GPU Cluster Reality

GPUs Require Clusters

Here's the fundamental problem: no single GPU can handle large models. You need a cluster.

The VRAM Wall
Every GPU hits the same VRAM wall. Whether 24GB or 80GB per GPU, large models require massive GPU counts in coordinated clusters. This is the fundamental constraint that CPU-only architecture bypasses entirely.

Energy Efficiency: The CPU Advantage at Realistic Utilization

"CPUs burn too much power per token"
This is the final argument GPU advocates use. But the math changes dramatically when we look at realistic utilization, not theoretical peak FLOPS.

The Utilization Problem

GPU efficiency claims assume 100% compute utilization. Real inference doesn't work that way:

The dirty secret: GPUs spend most of their time idle, still drawing full power.

The Idle Power Reality

GPU: Always Hungry

  • High-end GPU at idle: ~150W (just sitting there)
  • High-end GPU at compute: ~700W
  • PCIe overhead: +50W for data transfer
  • VRAM stays powered: Weights must remain loaded

Real-world: If your GPU is only computing 20% of the time, you're wasting 80% of that 700W.

CPU: Scales Down

  • Dual Xeon at idle: ~100-150W (bare OS, minimal load)
  • Dual Xeon at compute: ~800-1000W (full load)
  • DVFS: Scales from 0.8GHz to 3.5GHz dynamically
  • C-states: Deep sleep cores when waiting for I/O

Real-world: Enterprise server with 2TB RAM typically draws 200-400W average for inference workloads.

Power-Per-Token Analysis

ScenarioGPU PowerCPU PowerWinner
Theoretical peak FLOPS 700W / 989 TFLOPS = 0.71 W/TFLOPS 1000W / 6 TFLOPS = 167 W/TFLOPS GPU (theoretical)
Memory-bound (typical inference) 700W (can't scale down) 200-400W (scales with load) CPU (2-3.5x less)
Batch=1, high I/O wait 300W average (60% idle) 150-200W average (70% idle) CPU (1.5-2x less)
Multi-tenant (6 models) 6 Γ— 700W = 4,200W (all active) 800-1000W (all on one server) CPU (4-5x less)

The Utilization Math

GPU Cluster (6Γ— high-end GPUs) for 6-department enterprise:
  6 departments Γ— 1 GPU each = 4,200W continuous
  Even when only 1-2 departments are active.
  Plus: $240,000+ in hardware, NVLink complexity.

CPU (1Γ— Dual Xeon Platinum) for 6-department enterprise:
  All 6 models resident in 2TB RAM = ~1000W max
  Each department waits its turn = efficient time-sharing
  Scales power with actual compute load (not fixed at max)

Net difference: 4-5x less power, 10x lower hardware cost
    

Watts Per Token: The Real Numbers

Enterprise Deployment Comparison

Scenario: 6 models, 24/7 operation, mixed workload

GPU Cluster (6Γ— high-end GPUs):
  Idle power: 6 Γ— 150W = 900W
  Compute power: 6 Γ— 700W = 4,200W (when all busy)
  Average (typical 20% compute): ~1,500W
  Power/24hr: 36 kWh
  Power/year: 13,140 kWh
  @ $0.10/kWh: $1,314/year

CPU Server (1Γ— Dual Xeon Platinum, 2TB RAM):
  Idle power: ~150W (bare OS, all models in RAM)
  Compute power: ~1000W (all models active)
  Average (typical 20% compute): ~320W
  Power/24hr: 7.7 kWh
  Power/year: 2,800 kWh
  @ $0.10/kWh: $280/year

Net difference: 4-5x less power = ~$1,000+/year savings
    

Carbon Footprint: Real-World Impact

CPU Advantage

  • 4-5x less electricity for multi-tenant inference
  • No GPU manufacturing impact (TSMC 4nm vs 5nm)
  • Lower cooling due to lower heat output
  • Uses existing infrastructure - no new hardware needed
  • 10x lower hardware cost ($60K vs $600K+)

When GPUs Make Sense

  • Training large models (100B+) at 100% utilization
  • Very high throughput with batching
  • Research where peak FLOPS matter more than efficiency
The Bottom Line
GPU efficiency claims assume 100% compute utilization. Real inference workloads are typically 10-30% compute-bound. At realistic utilization, CPUs consume 3-5x less power for multi-tenant inference. Plus: 10x lower hardware cost. This isn't theory - it's simple thermodynamics based on how often your hardware is actually doing work.

The Hidden Cost: Power Delivery and Signal Integrity

Nobody Talks About This
GPU marketing quotes peak TFLOPS. What they don't mention is the electrical engineering nightmare required to actually deliver those peaks.

The GPU Power Profile: Burst-Idle-Burst

GPUs don't draw steady power. They spike to peak compute (hundreds of watts), then drop when waiting for data transfer, then spike again. This burst-idle-burst pattern creates massive di/dt (rate of current change) that cascades into real electrical engineering problems:

Result: Data centers running GPU clusters need specialized power infrastructure — substations, high-capacity PDUs, and overprovisioned power delivery — to handle these peak bursts that occur for fractions of a second.

GPU: Spiky, Unpredictable Power

Power draw over time:
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘
  700W     150W  700W  150W 700W  150W
  compute  wait  compute wait compute wait

Peak-to-idle ratio: ~5:1
di/dt: Extreme
Infrastructure: Substation-grade power delivery
        

CPU: Steady, Predictable Power

Power draw over time:
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  200-400W consistent draw

Peak-to-idle ratio: ~2:1
di/dt: Minimal (DVFS transitions are gradual)
Infrastructure: Standard data center power
        

Theory of Constraints Applied to Power

The fastest moving part and the slowest moving part of a system should be as close together as possible. GPUs violate this principle at the electrical level:

CPUs don't have GPU-level peak FLOPS, but they also don't need substation-grade power infrastructure to handle those peaks. The power draw is consistent and predictable. Standard, well-designed data center power delivery handles it without complication. No specialized substations. No overprovisioned PDUs. No capacitor banks for transient spikes.

The peak FLOPS that GPUs advertise are real — but the cost of actually delivering that power is hidden from every benchmark and every marketing slide.

Power Delivery Reality - GPU burst-idle-burst spikes vs CPU steady state power draw, showing signal integrity problems, Theory of Constraints applied to power, and hidden infrastructure costs

Principle 2: The Ethernet Equalizer

NVLink Doesn't Scale Infinitely
NVLink is 900 GB/s within a single node. But you can only fit 8 GPUs per node. Go beyond that? You hit Ethernet. And at the Ethernet boundary, CPUs and GPUs face the exact same constraint.

The Bandwidth Reality

Let's be precise about the numbers:

Connection TypeBandwidthWhere It AppliesScales To
NVLink 4.0 900 GB/s GPU-to-GPU within 1 node 8 GPUs max
DDR5 (12-channel) 460 GB/s CPU-to-RAM within 1 socket Per socket
400GbE Ethernet 50 GB/s Node-to-node Infinite nodes
100GbE Ethernet 12.5 GB/s Node-to-node Infinite nodes
InfiniBand HDR 25 GB/s Node-to-node Thousands of nodes

The Theory of Constraints Applied

GPU Cluster at Scale:
  Within node:  NVLink 900 GB/s (fast!)
  Between nodes: Ethernet 50 GB/s (constraint!)
  System speed = 50 GB/s (bottleneck)

CPU Cluster at Scale:
  Within node:  DDR5 460 GB/s
  Between nodes: Ethernet 50 GB/s (same constraint!)
  System speed = 50 GB/s (same bottleneck)

AT SCALE, THEY HIT THE SAME WALL.
    
Without NVLink, GPUs Are Often Slower
If you can't afford NVLink switches ($200K+), your GPUs communicate over PCIe β†’ Ethernet. That's 12.5-50 GB/s. A server-grade CPU with DDR5 has 460 GB/s to its own memory. For memory-bound workloads, the CPU wins.

The Compute-to-Bandwidth Chasm

The Ethernet Equalizer Shows Both Hit the Same Wall — But the Pain Is Wildly Different
Every system has a fastest thing (compute) and a slowest thing (cross-node data movement). C-Kernel-Engine's entire goal is to bring these two into sync. On CPUs, they're close enough that software can bridge the gap. On GPUs, they're orders of magnitude apart — no software fixes that.

On CPUs, the gap between the fastest thing the system can do (compute) and the slowest thing it must do (move data across the network) is close. Peak FLOPS and Ethernet bandwidth live in the same neighborhood. That means the remaining optimization work is pure engineering — tiling, prefetching, computation-communication overlap — real techniques that bring compute and data movement closer to sync. That's what C-Kernel-Engine is built to do.

On GPUs, the fastest and slowest are worlds apart. GPU peak compute is orders of magnitude faster than the Ethernet pipe that feeds it at cluster scale. It's the difference between the summit of Mt. Everest and the floor of the Mariana Trench. Most of that compute sits permanently idle, burning power, waiting for data that will never arrive fast enough. No amount of software engineering changes the physics.

On CPUs the gap between compute speed and network speed is small β€” C-Kernel-Engine bridges it. On GPUs the gap is orders of magnitude β€” unbridgeable at scale.

CPU: Rolling Hills — Bridgeable

  • Fastest (compute) and slowest (network) are close
  • Local memory bandwidth sits in between — a smooth gradient, not a cliff
  • Software can bridge the remaining gap: tiling, prefetch, overlap
  • Every hardware generation makes the gap smaller (more bandwidth, same physics)

The terrain is gentle enough to walk. C-Kernel-Engine's job is to build the bridge: bring compute throughput and data movement into sync through aggressive kernel engineering.

GPU: Everest to Mariana — Unbridgeable

  • Fastest (compute) and slowest (network) are orders of magnitude apart
  • Intra-node interconnects are fast, but only reach a handful of GPUs
  • At cluster scale, everything hits the same Ethernet wall
  • Most compute capacity sits permanently idle, starved for data

The terrain is a cliff face. No bridge spans from the peak of Everest to the bottom of the Mariana Trench. The gap is structural — physics, not engineering.

C-Kernel-Engine's Optimization Thesis
The goal is straightforward: bring the fastest-moving thing and the slowest-moving thing as close to sync as possible. On CPUs, they're already neighbors — the remaining work is tiling, cache management, prefetching, and overlapping computation with communication. That is a solvable engineering problem, and it's exactly what this project is built to solve. On GPUs at cluster scale, the gap between compute and data movement is structural. No kernel optimization closes it.

Designing Your Ethernet Network

Ethernet Switch Topology for ML Clusters

Since Ethernet is the equalizer at scale, designing it well is critical.

Leaf-Spine Architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Spine 1    β”‚     β”‚  Spine 2    β”‚
                    β”‚ 400GbE      β”‚     β”‚ 400GbE      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                           β”‚                   β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                 β”‚                   β”‚                 β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
    β”‚ Leaf 1  β”‚       β”‚ Leaf 2  β”‚       β”‚ Leaf 3  β”‚       β”‚ Leaf 4  β”‚
    β”‚ 100GbE  β”‚       β”‚ 100GbE  β”‚       β”‚ 100GbE  β”‚       β”‚ 100GbE  β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
         β”‚                 β”‚                 β”‚                 β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
    β”‚Server 1 β”‚       β”‚Server 3 β”‚       β”‚Server 5 β”‚       β”‚Server 7 β”‚
    β”‚Server 2 β”‚       β”‚Server 4 β”‚       β”‚Server 6 β”‚       β”‚Server 8 β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    

Switch Sizing Calculator

Bandwidth Requirements

For distributed training with data parallelism:
  Gradient size = Model parameters Γ— bytes per param
  70B model in FP16 = 70B Γ— 2 bytes = 140 GB

All-reduce bandwidth needed:
  Per iteration: 2 Γ— gradient size (reduce-scatter + all-gather)
  70B model: 2 Γ— 140 GB = 280 GB per iteration

With 100GbE (12.5 GB/s):
  All-reduce time = 280 GB Γ· 12.5 GB/s = 22.4 seconds

With 400GbE (50 GB/s):
  All-reduce time = 280 GB Γ· 50 GB/s = 5.6 seconds
    

Small Cluster (8-16 servers)

  • Topology: Single 400GbE switch
  • Switch: Arista 7060X5 or similar
  • Ports: 32Γ— 400GbE
  • Cost: ~$30,000
  • Bisection BW: 12.8 Tbps

Medium Cluster (32-64 servers)

  • Topology: Leaf-spine (2 spine, 4 leaf)
  • Spine: 2Γ— 400GbE switches
  • Leaf: 4Γ— 100GbE switches
  • Cost: ~$120,000
  • Bisection BW: 25.6 Tbps

Large Cluster (100+ servers)

Key insight: A GPU cluster at this scale needs the SAME network infrastructure. The Ethernet cost is equal. But CPUs don't need the $2M+ in NVLink switches.

RDMA Configuration

RoCEv2 Setup

# Enable RDMA over Converged Ethernet v2
# On each server with Mellanox/NVIDIA ConnectX NICs:

# 1. Enable PFC (Priority Flow Control) on switch
#    Required for lossless Ethernet

# 2. Configure ECN (Explicit Congestion Notification)
mlnx_qos -i eth0 --pfc 0,0,0,1,0,0,0,0  # Enable PFC on priority 3
mlnx_qos -i eth0 --trust=dscp

# 3. Set up RDMA
modprobe rdma_ucm
modprobe rdma_cm

# 4. Verify RDMA is working
ibv_devinfo
rdma link show
    
Ethernet Switch Topology for ML Clusters

Scale Economics: The Real Comparison

Why CPU-Only Wins at Scale

Scale Economics Comparison

GPU Cluster at Scale

  • For 1TB model: 100 GPUs minimum
  • Total cost: $4 Billion
  • Who can afford: 5 companies globally
  • Result: Elite-only access

CPU Cluster at Scale

  • For same scale: 80 servers
  • Total cost: $240 Million
  • Who can afford: 1000+ companies
  • Result: Accessible to everyone

The Economics: 16x Cheaper, 200x More Accessible

While GPU costs stay high and GPU access remains limited, CPU clusters deliver the same scale at a fraction of the cost. This makes large-scale ML accessible to 1000+ companies instead of just 5.

The Hybrid Trap: CPU+GPU = CPU-Bound

Common Question:
"Why not use CPU for memory and GPU for compute? Get the best of both!"

Answer: You're then AS FAST AS THE CPU anyway! You get the complexity of both with the performance of neither.
Hybrid CPU+GPU Bottleneck

The Hybrid Bottleneck

When CPU holds weights and transfers to GPU for compute:

Conclusion: If the GPU is limited by the CPU anyway, just use CPUs! Simpler, faster, cheaper.

The GPU Workaround Stack: "Innovations" That Are Actually Patches

The Uncomfortable Truth
The entire field has been optimizing around a hardware constraint and mistaking the workarounds for progress. Every major "breakthrough" in LLM architecture is actually compensating for GPU memory limitations.

Every "Innovation" Maps to a GPU Constraint

"Innovation"What It Actually DoesThe GPU Constraint It PatchesNeeded on CPU with 2-4TB RAM?
GQA (Grouped Query Attention) Shares KV heads across query heads KV cache blows up GPU VRAM No — KV cache fits
MoE (Mixture of Experts) Activates sparse subset of parameters Dense model won't fit on one GPU No — dense model fits
KV Caching Stores past attention keys/values — literally a key-value database in every layer GPU VRAM limits cache size, forces eviction strategies CPU home turf — this is literally how databases work. CPUs have been running key-value stores for decades.
Gradient Checkpointing Recomputes activations instead of storing them Training activations don't fit in GPU VRAM No — store everything
Tensor Parallelism Shards weight matrices across GPUs Single GPU can't hold the full matrix No — full matrix fits in RAM
Pipeline Parallelism Distributes layers across GPUs All layers don't fit on one GPU No — all layers fit
Flash Attention Online softmax with tiled computation — streams through attention in blocks Full attention matrix doesn't fit in GPU SRAM/VRAM Brilliant for CPU — tiling maps naturally to CPU cache hierarchies. CPUs process data in cache-line-sized tiles inherently.
Quantization Research Compresses model weights (Q4, Q8, etc.) Model doesn't fit in GPU VRAM at full precision Optional — use for bandwidth, not capacity
The Punchline
On a CPU with 2-4TB RAM, half of these become unnecessary, and the other half become simpler. The entire research direction has been shaped by GPU limitations, and people have confused "optimizations forced by GPU memory walls" with "fundamental advances in model architecture."
GPU Workaround Convergence - Every GPU innovation (GQA, MoE, KV Caching, Flash Attention, Gradient Checkpointing, Tensor/Pipeline Parallelism, Quantization) converges toward capabilities CPUs have had natively for decades

The Sequential Reality: GPUs Were Never Designed for This

Transformers Are Sequential. Period.

A transformer forward pass is:

Layer 1 → Layer 2 → Layer 3 → ... → Layer 80

You cannot compute layer 10 without the output of layer 9. That is the definition of sequential dependency. There is no debate here — it's mathematical fact.

What people mean when they say "LLMs are parallel" is that within a single layer, the matrix multiplication can be parallelized across rows and columns. But that's not the model being parallel — that's a single operation within a sequential pipeline being decomposable. Every CPU has been decomposing matmuls across SIMD lanes and cores for decades.

Amdahl's Law: Why Parallel Hardware Still Hits a Ceiling

Strong-Scaling Limit

Amdahl's Law is the simplest way to say what the sequential transformer argument implies in practice:

speedup(N) = 1 / (S + (1 - S) / N)

Where S is the fraction of work that stays effectively serial or synchronization-bound. Even with infinite parallel hardware, the maximum speedup is still:

max_speedup = 1 / S
Serial / sync fractionTheoretical max speedupWhat it means
10%10×A surprisingly hard wall for giant clusters
5%20×Still not "infinite scaling"
2%50×Requires extraordinary system design
1%100×Already very difficult in real distributed systems

For LLMs, that serial fraction is not just "the next token depends on the previous token." It also includes layer boundaries, synchronization, collective communication, optimizer steps, routing decisions, and all the places where the system must wait for the slowest participant.

CKE takeaway: do not optimize only for peak FLOPS. Reduce the effective serial fraction, reduce synchronization pressure, and keep compute, memory, and network in the same performance neighborhood.

What GPUs Were Designed For

Independent pixel computation.

Millions of pixels with zero data dependency on each other. Pixel (0,0) doesn't need the result of pixel (1920,1080) to compute its color. That IS embarrassingly parallel.

pixel(0,0)   → compute color → done
pixel(0,1)   → compute color → done
pixel(1,0)   → compute color → done
...
pixel(1920,1080) → compute color → done
(all independent, all simultaneous)

What LLM Inference Actually Is

Sequential token generation on a sequential layer stack.

Token N+1 depends on the attention computation over ALL previous tokens. The autoregressive decode loop is the opposite of pixel independence.

token 1 → 80 layers → token 2
token 2 → 80 layers → token 3
token 3 → 80 layers → token 4
...
(each waits for the previous)

The Vocabulary of Mismatch

The fact that the industry bolted LLM training onto hardware designed for independent pixel shading and then invented an entire vocabulary to work around the mismatch should tell you something is wrong with the foundational assumption:

Every one of these is a workaround for the same problem: you're running a sequential, memory-hungry workload on hardware designed for embarrassingly parallel, compute-bound pixel shading.

MoE: The Right Hardware for Dynamic Routing

Native Fit vs. Forced Workaround

Yes, MoE runs on GPUs β€” frontier models do it at scale. But runs on is not the same as naturally fits. The same argument made against CPUs for AI ("it works, but it's not the right tool") applies equally to GPUs for MoE.

MoE routing is conditional: for each token, a gating function decides which experts activate. GPUs β€” designed for dense, predictable, lockstep computation β€” handle this through software workarounds: load balancing losses, capacity factors, expert dropout, and auxiliary training objectives just to keep GPU utilization from collapsing under sparse activation patterns.

Consider fixed-function inference chips. They prefer dense computation above all else. The minute you introduce dynamic routing β€” conditional branching, variable expert selection β€” they require architectural hacks. GPUs face the same underlying tension, just with more memory headroom to absorb it.

CPUs have handled conditional branching natively since the beginning: branch prediction, out-of-order execution, speculative paths. The hardware was built for exactly this. AVX-512 computes 16 FP32 multiply-accumulates per cycle per core β€” across 128 cores that's 2048 parallel operations β€” and the dynamic routing overhead that GPUs must paper over in software is just normal control flow on a CPU.

Why MoE All-to-All Makes Topology Matter

Jensen Huang's point about MoE is not that "one expert equals one GPU" in a literal one-expert-per-device sense. Real deployments usually place multiple experts per GPU, or shard very large experts across several GPUs. But the core communication pattern is the same:

tokens on many GPUs
   ↓ route top-k experts
dispatch token states to expert owners (all-to-all)
   ↓
run expert MLPs
   ↓
send expert outputs back / combine (all-to-all again)

That happens at every MoE layer. If the interconnect is a one-hop switched fabric, the communication cost is painful but bounded. If the topology is multi-hop, ring-like, torus-like, or otherwise forces traffic through several devices before reaching the destination, the waiting and queueing add up quickly.

This is where Amdahl's Law shows up in systems form: the expert GEMMs may parallelize beautifully, but the dispatch, synchronization, and combine phases become the part you cannot hide. The more often the model must do all-to-all, the more the communication fraction dominates the ceiling on real speedup.

Training: Batch Size Is Not What You Think

"But GPUs need large batches for efficiency!"
This is true — and that's a GPU problem, not a training requirement. Training does not need batch size greater than 1. There are many ways to simulate larger effective batch sizes without holding multiple sequences in memory simultaneously.

GPU: The Batch Balancing Act

On a GPU, you must balance three competing constraints:

  • Model size — weights consume VRAM
  • Context length — KV cache consumes VRAM
  • Batch size — activations consume VRAM

Increase any one, and you must decrease the others. Want longer context? Reduce batch size. Want larger batch? Reduce context. Want a bigger model? Reduce both.

GPU VRAM budget (80GB):
  Model weights:    40GB (fixed)
  KV cache:         20GB (varies with context)
  Activations:      20GB (varies with batch)

  Longer context = less batch
  Larger batch   = less context
  ALWAYS a tradeoff.
        

CPU: No Tradeoff Required

With 2-4TB of RAM, all three fit simultaneously:

  • Model size — even 400B+ models fit (800GB+ in FP16)
  • Context length — variable, up to 1M+ tokens
  • Batch size — whatever you need

You can have variable context length AND batch greater than one at the same time. No balancing act required.

CPU RAM budget (4TB):
  Model weights:    810GB (400B+ FP16)
  KV cache:         500GB (long context)
  Activations:      200GB (large batch)
  Remaining:        2,538GB (room to spare)

  Variable context + variable batch
  NO tradeoffs.
        

You Don't Need Large Batches — You Can Simulate Them

Training with batch=1 works. The gradient is noisier, but there are well-established techniques to get the benefits of large batches without the memory cost:

TechniqueHow It WorksGPU BenefitCPU Benefit
Gradient Accumulation Accumulate gradients over N forward passes, update once Simulates batch=N with batch=1 memory Same, but can also do actual batch=N
Micro-batching Process small sub-batches, aggregate gradients Fits in VRAM per micro-batch Can use larger micro-batches
Online Learning Update after every single example (batch=1) Works but GPU underutilized Natural fit for CPU sequential processing
Data Parallelism Each node processes different batch, average gradients Requires NVLink/IB for gradient sync RDMA gradient sync, same principle
The C-Kernel-Engine is actively being developed to prove this.
Our kernel architecture supports variable context lengths and flexible batch sizes natively. The quantized GEMV/GEMM kernels (Q4_K, Q5_0, Q5_1, Q5_K, Q6_K, Q8_0) are designed from the ground up for CPU-native inference and training — not as ports of GPU code. We're building the evidence that CPU-only training and inference at scale isn't theoretical — it's practical, and the kernel-level work is happening now.

Why CPU-Only is the Future

The Strategic Advantage

Technical Wins

  • Memory capacity: 4-6TB per node vs 80GB VRAM
  • No transfer bottleneck: CPU memory + CPU compute
  • Optimize interconnect: RDMA, core pinning, cache
  • Get close to theoretical FLOPS

Market Wins

  • 16x cheaper at scale
  • 200x more accessible to companies
  • Commodity hardware - no special requirements
  • Open ecosystem - standard Linux tools

Commodity Economics: The CPU Trajectory

CPUs Follow Commodity Price Curves

Unlike GPUs which are supply-constrained and vendor-controlled, CPUs are commodity hardware. Every improvement happens automatically, at scale, with competition driving prices down.

Component202320252027 (projected)Trend
Memory Standard DDR5-4800 DDR5-6400 DDR6 +33% bandwidth per gen
Channels per Socket 8 12 16 +50% channels
Max RAM per Server 4TB 8TB 16TB+ 2x per generation
Cores per Socket 64 (Genoa) 128 (Turin) 192+ +50% per gen
L3 Cache 256MB 384MB 512MB+ Growing
Ethernet Speed 100GbE 400GbE 800GbE 4x per 3 years
$/TFLOP (CPU) $500 $300 $150 -50% per 2 years
$/TFLOP (GPU) $50 $40 $35 -15% per 2 years
The Crossover Point
GPUs will always have higher peak FLOPS per dollar for raw compute. But for memory-bound ML inference (which is most real-world LLM usage), CPUs are already competitive. And the gap closes every year because CPU economics follow commodity curves.

Why CPU Economics Win

  • Competition: Intel vs AMD vs ARM
  • Volume: Billions of CPUs vs millions of GPUs
  • Supply: Multiple fabs, no bottlenecks
  • Standards: DDR5 is an industry standard, not proprietary
  • Software: Linux, GCC, standard tools (free)

Why GPU Economics Struggle

  • Monopoly: NVIDIA dominates (~90% market)
  • Supply-constrained: Artificial scarcity
  • Proprietary: CUDA lock-in, HBM limited fabs
  • High margins: No price competition pressure
  • Lead times: 12-24 month waits

Infinite CPU Scaling: The Internet Proof

CPUs Scale to Infinity

Unlike GPUs, which hit cost and availability ceilings, CPUs scale infinitely through distributed computing across the internet. Your computer is part of that infinity!

The Insight
Want more FLOPs? Add more CPUs + RDMA! Get your 100 PFLOPS while your friends on the other side keep bragging about their limited GPU clusters. FLOPs is just an accumulation problem that CPUs solve through clever distributed training.

∞ Infinite Scaling

  • Internet-scale distributed computing
  • No ceiling: Add 10, 100, 1000 CPUs
  • Your computer contributes: Every device matters
  • Result: Unlimited FLOP accumulation

CPU Improvements Track

  • Cores: Increasing year over year
  • Cache: Growing bigger
  • Efficiency: Getting better
  • Cost: Decreasing over time

Memory Scaling

  • DDR5 β†’ DDR6: 2400 GB/s β†’ higher bandwidth
  • Memory slots: 8 β†’ 12 β†’ 16 slots per node
  • Total per node: 4TB β†’ 8TB β†’ 16TB RAM
  • Cost/GB: Constantly decreasing

Model Efficiency

  • Smaller models: Getting more efficient
  • Better quantization: Q4, Q8, hybrid methods
  • Architecture improvements: MoE, mixture of experts
  • Training optimization: Better algorithms

The C-Kernel Engine Goal

Exploit this infinite CPU scaling on server-grade CPUs. As models get smaller and CPUs get more powerful, the combination becomes unbeatable:

Multi-Model Parallelism: Enterprise Real Workload

Unlike GPUs which are designed for single-model throughput, CPUs excel at running multiple models simultaneously on the same system.

Enterprise & Government: The Multi-Model Reality

The Real Enterprise Scenario
In government ministries, large enterprises, and multi-department organizations, different teams need to run different models simultaneously. This is where GPUs become a nightmare.

Government Ministry Example

  • Health Dept: Medical document analysis model
  • Finance Dept: Fraud detection model
  • Education Dept: Student assessment model
  • Transport Dept: Traffic prediction model
  • HR Dept: Resume screening model
  • Legal Dept: Contract analysis model

Reality: 6 departments, 6 different models, all need to run simultaneously.

Enterprise Example

  • Sales: Lead scoring + CRM assistant
  • Support: Customer chatbot + ticket routing
  • Engineering: Code review + documentation
  • Marketing: Content generation + analytics
  • Security: Threat detection + log analysis
  • Finance: Expense analysis + forecasting

Reality: Each department has specialized models for their domain.

The GPU Context-Switching Nightmare

GPU Model Switching: Slow and Expensive

GPU Workflow for Multi-Model Serving:
  1. Load Model A weights into VRAM (80GB)     β†’ 30-60 seconds
  2. Run inference for Department A            β†’ fast
  3. Unload Model A from VRAM                  β†’ 10 seconds
  4. Load Model B weights into VRAM            β†’ 30-60 seconds
  5. Run inference for Department B            β†’ fast
  6. Repeat for each model...

Problem: Each context switch = 40-70 seconds of GPU sitting idle!
With 6 departments cycling through: most of the time is loading/unloading.
    

GPU Multi-Model Issues

  • VRAM limit: 80GB can only hold 1-2 large models
  • Context switching: Load/unload weights takes 30-60s per model
  • No sharing: GPU can't easily share between containers
  • MIG complexity: Multi-Instance GPU splits VRAM, reduces per-model capacity
  • Kubernetes pain: GPU scheduling requires special plugins, node affinity
  • Cost: Need dedicated GPU per department = $40K Γ— 6 = $240K

CPU Multi-Model Solution

  • RAM abundance: 4TB holds ALL models simultaneously
  • No context switching: All models resident in RAM, instant access
  • Native sharing: Standard Linux process isolation works
  • Docker native: Just containers, no special GPU passthrough
  • Kubernetes native: Standard pod scheduling, no GPU plugins
  • Cost: One $60K server runs all 6 departments

Kubernetes & Docker: CPUs Are First-Class Citizens

The Container Orchestration Reality

GPU Kubernetes Deployment:
  - Install NVIDIA device plugin
  - Configure GPU node pools
  - Set resource limits: nvidia.com/gpu: 1
  - Deal with GPU memory fragmentation
  - Handle driver version mismatches
  - Manage MIG partitioning (if sharing)
  - Debug CUDA OOM errors in production

CPU Kubernetes Deployment:
  - Just deploy your container
  - Set resource limits: cpu: "32", memory: "64Gi"
  - Done. Standard k8s scheduling handles the rest.
    

CPU Multi-Model Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Single CPU Server (4TB RAM)                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚ Container 1  β”‚  β”‚ Container 2  β”‚  β”‚ Container 3  β”‚           β”‚
β”‚  β”‚ Health Model β”‚  β”‚ Finance Modelβ”‚  β”‚ Education    β”‚           β”‚
β”‚  β”‚ 70B (140GB)  β”‚  β”‚ 13B (26GB)   β”‚  β”‚ 7B (14GB)    β”‚           β”‚
β”‚  β”‚ 32 cores     β”‚  β”‚ 16 cores     β”‚  β”‚ 8 cores      β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚ Container 4  β”‚  β”‚ Container 5  β”‚  β”‚ Container 6  β”‚           β”‚
β”‚  β”‚ Transport    β”‚  β”‚ HR Model     β”‚  β”‚ Legal Model  β”‚           β”‚
β”‚  β”‚ 7B (14GB)    β”‚  β”‚ 3B (6GB)     β”‚  β”‚ 13B (26GB)   β”‚           β”‚
β”‚  β”‚ 8 cores      β”‚  β”‚ 4 cores      β”‚  β”‚ 16 cores     β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                                                                  β”‚
β”‚  Total: 226GB used / 4TB available = 5.6% RAM utilization       β”‚
β”‚  All 6 models running simultaneously, no context switching      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    

Key insight: 6 models totaling 226GB fit easily in 4TB RAM. All running 24/7, no loading/unloading, instant response for any department.

Enterprise Reality Check
Most enterprise/government deployments need to run 5-20 different models for different use cases. CPUs handle this natively with standard containers. GPUs require expensive, complex multi-GPU setups with painful orchestration. This alone makes CPUs the practical choice for most real-world deployments.

CPU-Only: Complete Solution for All Model Sizes

One Architecture, All Scales

CPU-only isn't just for large models. It's the complete solution for every model size, for both training and inference.

Small Models (1-7B)

  • Training: Single CPU server
  • Inference: Laptop or edge device
  • Latency: Sub-millisecond
  • Cost: < $100/month

Use cases: Chatbots, code completion, mobile AI

Medium Models (7-70B)

  • Training: 2-8 CPU servers
  • Inference: Single CPU server
  • Throughput: High batch processing
  • Cost: $500-2000/month

Use cases: Enterprise AI, content generation, analysis

Large Models (70B+)

  • Training: 10-100 CPU servers
  • Inference: 2-20 CPU servers
  • Scale: Distributed across RDMA
  • Cost: $5000-50000/month

Use cases: Foundation models, research, large-scale analytics

The Unified Approach
Same CPU-only architecture scales from your laptop (1B model) to global clusters (1T+ parameters). No architecture changes, no GPU lock-in, no complexity multiplication.

Training at Every Scale

  • Small: Fine-tune on laptop (1-7B)
  • Medium: Train on workstation (7-70B)
  • Large: Distributed training (70B+)
  • Methodology: Same principles, same tools

Inference at Every Scale

  • Small: Edge deployment (1-7B)
  • Medium: Server deployment (7-70B)
  • Large: Distributed inference (70B+)
  • Performance: Optimized for each tier

Why This Matters

Organizations can start small and scale infinitely without ever switching architectures:

  1. Start: Train small model on laptop (fine-tuning)
  2. Grow: Move to server as model size increases
  3. Scale: Add more servers when needed (distributed training)
  4. Enterprise: Run multiple models on same hardware

No lock-in. No architecture migrations. No GPU dependency. Just scale as you grow.

Memory Reality: What NVIDIA Marketing Won't Tell You

The Hidden Truth About LLM Memory
GPU marketing focuses on FLOPS. Real-world LLM inference is dominated by memory bandwidth and capacity. Here are the actual numbers.

Activation Memory Per Token (Decode)

Memory Writes Per Token

For a typical 0.5B parameter model (hidden=896, intermediate=4864, 24 layers):

OperationPer Layer24 Layers
RMSNorm output3.5 KB84 KB
Q, K, V projections10.5 KB252 KB
Attention output3.5 KB84 KB
O projection3.5 KB84 KB
MLP gate + up38 KB912 KB
MLP down3.5 KB84 KB
KV cache (new token)1 KB24 KB
Final logits (once)-600 KB
Total per token~63 KB~2.1 MB

This is memory bandwidth consumed per generated token. Reducing this is key for decode performance.

Real-world profile (0.6B model, Q8_0 quantization, AVX-only CPU): decode time is dominated by memory-moving kernels -- ~48% MLP (gate/up + down), ~21% logits, ~29% attention projections (q/k/v + out), and ~2% attention core. This matches the bandwidth thesis: most time is spent streaming weights/activations, not math.

Context Length: The Memory Multiplier

KV Cache Formula

KV Cache = 2 Γ— n_layers Γ— n_kv_heads Γ— head_dim Γ— context_length Γ— bytes_per_element
           ↑
         (K and V)
ContextKV Cache (FP16)+ Model (140GB)Fits 80GB GPU?Fits 2TB Server?
8K 2.6 GB 143 GB No (need 2Γ—) Yes
32K 10 GB 150 GB No (need 2Γ—) Yes
128K 41 GB 181 GB No (need 3Γ—) Yes
1M 320 GB 460 GB No (need 6Γ—) Yes
πŸ“ Show the Math: How We Calculated This

Model Architecture (70B Dense Model)

ParameterValueExplanation
n_layers80Number of transformer layers
n_attention_heads64Query heads per layer
n_kv_heads8KV heads (GQA: 8 groups, each serves 8 Q heads)
hidden_dim8192Model hidden dimension
head_dim128= hidden_dim / n_attention_heads = 8192/64

Step 1: KV Cache Per Token

// FP16 = 2 bytes per element
KV_per_token = 2 Γ— n_layers Γ— n_kv_heads Γ— head_dim Γ— bytes
             = 2 Γ— 80 Γ— 8 Γ— 128 Γ— 2
             = 327,680 bytes
             = 320 KB per token

Step 2: Scale by Context Length

ContextCalculationKV Cache
8K 8,192 Γ— 320 KB 2.62 GB
32K 32,768 Γ— 320 KB 10.5 GB
128K 131,072 Γ— 320 KB 41.9 GB
1M 1,048,576 Γ— 320 KB 335 GB

Step 3: Add Model Weights

Model weights (70B params Γ— 2 bytes FP16) = 140 GB

Total Memory = Model Weights + KV Cache
  8K context:   140 + 2.6  = 143 GB  β†’ need 2Γ— 80GB GPUs
  32K context:  140 + 10.5 = 150 GB  β†’ need 2Γ— 80GB GPUs
  128K context: 140 + 42   = 182 GB  β†’ need 3Γ— 80GB GPUs
  1M context:   140 + 335  = 475 GB  β†’ need 6Γ— 80GB GPUs

Why GQA Matters

Grouped Query Attention (GQA) reduces KV cache by sharing KV heads across Q heads:

Attention TypeKV HeadsKV per Token128K KV Cache
Multi-Head (MHA) 64 2.56 MB 335 GB
Grouped Query (GQA-8) 8 320 KB 42 GB
Multi-Query (MQA) 1 40 KB 5.2 GB

GQA-8 gives 8Γ— memory savings over MHA with minimal quality loss.

The Long Context Reality
70B model + 1M context needs ~475 GB. That's 6Γ— 80GB GPUs ($240K+) vs one 2TB CPU server ($30K). This isn't about FLOPS - it's about where the data can physically exist.

Prefill Memory Scaling

Prefill (processing the input prompt) requires storing activations for ALL tokens simultaneously:

Prefill LengthActivation Memory+ KV Cache
256 tokens~400 MB~6 MB
1K tokens~1.5 GB~24 MB
4K tokens~6 GB~96 MB
16K tokens~24 GB~384 MB

Prefill is compute-bound but still needs memory for activations. Long prompts can exceed GPU VRAM.

Visual Guide

Memory Reality Infographic

GPU vs CPU memory comparison for LLM inference:

Memory Reality - GPU vs CPU comparison showing memory capacity, 70B model fit analysis, and cost comparison

Click to view full size

Training Advantages: Where CPUs Dominate

Critical Training Scenarios

CPU-only architecture solves training problems that GPU-only simply can't handle.

1. Long Context Training: KV Cache Explosion

The Problem: Long context windows create quadratic KV cache growth

  • 4K context: ~32GB KV cache
  • 32K context: ~256GB KV cache
  • 128K context: ~4TB KV cache
  • 1M context: ~128TB KV cache

CPU Advantage:

  • 4-16TB RAM: Fits massive KV caches
  • No transfer bottleneck: Everything in memory
  • GPU Reality: KV cache doesn't fit = can't train
Result: GPU training hits memory wall at 32K context. CPUs handle 1M+ context natively.

2. Massive Batch Training

The Need: Large batch sizes improve convergence and throughput

  • Standard batch: 32-128 samples
  • Large batch: 512-2048 samples
  • Massive batch: 8192+ samples
  • Enterprise batch: 100K+ samples

CPU Scaling Strategy:

  • 32 CPUs: 32 batches in parallel
  • Each batch: 1-10TB RAM available
  • Total effective batch: 32x larger
  • No communication overhead: Each CPU independent
Result: Scale to any batch size by adding more CPUs. No GPU memory constraints!

The Training Reality Check

GPU Training Limits

  • ❌ KV cache fits β†’ max context ~32K
  • ❌ Batch size limited by VRAM
  • ❌ Large batches require model parallelism
  • ❌ Complex coordination across GPUs
  • ❌ Expensive hardware (NVLink required)

CPU Training Advantages

  • βœ… KV cache fits β†’ context up to 1M+
  • βœ… Batch size scales with CPUs
  • βœ… Data parallelism = simple scaling
  • βœ… RDMA for necessary communication
  • βœ… Commodity hardware (Ethernet)
Why This Matters
Real-world training needs long context (RAG, document analysis) and large batches (throughput, convergence). CPU-only architecture handles both naturally. GPU-only hits walls that require expensive workarounds.

Distributed Architecture

RDMA-Connected CPU Cluster

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     RDMA Fabric (100Gbps+)                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                 β”‚                 β”‚                             β”‚
β–Ό                 β–Ό                 β–Ό                             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
β”‚ Node 0  β”‚   β”‚ Node 1  β”‚   β”‚ Node 2  β”‚  ...  Node N             β”‚
β”‚ 128 coresβ”‚   β”‚ 128 coresβ”‚   β”‚ 128 coresβ”‚                        β”‚
β”‚ 2TB RAM β”‚   β”‚ 2TB RAM β”‚   β”‚ 2TB RAM β”‚                          β”‚
β”‚         β”‚   β”‚         β”‚   β”‚         β”‚                          β”‚
β”‚ Layers  β”‚   β”‚ Layers  β”‚   β”‚ Layers  β”‚                          β”‚
β”‚ 0-15    β”‚   β”‚ 16-31   β”‚   β”‚ 32-47   β”‚                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
    

Parallelism Strategies

1. Pipeline Parallelism

Different layers on different nodes. Activations flow through the pipeline.

Node 0: Layers 0-15   β†’  activations  β†’  Node 1: Layers 16-31  β†’  ...
        (forward)           (RDMA)              (forward)

Communication: Send activations between pipeline stages via RDMA.

2. Tensor Parallelism

Large matrices split across nodes. Each node computes a shard.

// 16384 x 16384 weight matrix split across 4 nodes
Node 0: W[0:4096, :]      // Shard 0
Node 1: W[4096:8192, :]   // Shard 1
Node 2: W[8192:12288, :]  // Shard 2
Node 3: W[12288:16384, :] // Shard 3

// After local GEMM, all-reduce to combine

Communication: RDMA all-reduce after each sharded operation.

3. Data Parallelism

Same model replicated. Different batches. Gradient averaging.

Node 0: Model copy, Batch 0  β†’  gradients  ─┐
Node 1: Model copy, Batch 1  β†’  gradients  ─┼→  All-reduce  β†’  Update all
Node 2: Model copy, Batch 2  β†’  gradients  ──
Node 3: Model copy, Batch 3  β†’  gradients  β”€β”˜

RDMA: The Key Enabler

Why RDMA?

Remote Direct Memory Access - Zero-copy, kernel-bypass networking.

MetricTCP/IPRDMA
Latency~50-100 ΞΌs~1-2 ΞΌs
Bandwidth10-25 Gbps100-400 Gbps
CPU overheadHigh (kernel, copies)Near zero
Memory copiesMultipleZero (DMA)

RDMA Primitives We Need

// One-sided operations (no remote CPU involvement)
rdma_write(remote_addr, local_buf, size);  // Write to remote memory
rdma_read(local_buf, remote_addr, size);   // Read from remote memory

// Collective operations (built on one-sided)
rdma_allreduce(buf, size, SUM);  // Gradient averaging
rdma_broadcast(buf, size, root); // Weight distribution
rdma_barrier();                   // Synchronization

Implementation Roadmap

PhaseFeatureStatus
1 Single-node training (current) Done
2 Multi-core parallelism (OpenMP) Done
3 RDMA communication primitives Planned
4 Pipeline parallelism Planned
5 Tensor parallelism (sharded GEMM) Planned
6 Encoder + cross-attention Planned
7 600B+ training Future

The Math Doesn't Change

From Tiny to Massive: Same Operations

Forward pass (any size model):
1. embed_tokens()           // Lookup: tokens β†’ vectors
2. for each layer:
   a. rmsnorm()             // Normalize
   b. linear() Γ— 3          // Q, K, V projections
   c. rope()                // Rotary embeddings
   d. attention()           // Softmax(QK^T)V
   e. linear()              // Output projection
   f. residual_add()        // Skip connection
   g. rmsnorm()             // Normalize
   h. mlp_swiglu()          // FFN with gating
   i. residual_add()        // Skip connection
3. rmsnorm()                // Final norm
4. lm_head()                // Logits

Backward pass: Same operations in reverse.
SGD: weights -= lr * gradients

That's it. For any model size.
    

Hardware Recommendations

For Different Scales

Model SizeRecommended Setup
< 7B Single server, 32+ cores, 128GB+ RAM
7B - 70B Single server, 128 cores, 512GB-2TB RAM
70B - 200B 2-4 nodes, RDMA interconnect, 2TB RAM each
200B - 600B 8-16 nodes, 100Gbps+ RDMA fabric
600B+ 32+ nodes, 400Gbps RDMA, pipeline + tensor parallel

Recommended CPUs by Instruction Set

Minimum requirement: AVX (256-bit) or NEON (128-bit). For best performance, AVX-512 or AMX.

RDMA Options

Why Not GPU?

"Nvidia, f**k you."
β€” Linus Torvalds, 2012 (video source) regarding their closed-source Linux drivers

Beyond the open-source concerns:

Further Reading

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close