CKE Throughput Unit
The 1 PB/s North-Star Unit
C-Kernel-Engine cares about a practical systems question: how fast can the cumulative compute system cycle active model data to produce tokens? FLOPS and TOPS are useful hardware numbers, but they do not directly answer whether a real model can move its weights, activations, KV cache, intermediate buffers, and network payloads quickly enough to generate the next token.
The proposed CKE unit is aggregate bytes cycled per second across the full token path. The north-star target is 1 petabyte/sec, which is the same as 1000 terabytes/sec, or 1 terabyte every millisecond.
The Unit
CKU = active_bytes_per_token / seconds_per_token
CKU means C-Kernel throughput unit. The unit is not just DRAM bandwidth from one socket and it is not just matrix FLOPS. It is an end-to-end systems rate: how many bytes the runtime must touch, transform, cache, transmit, or reuse in order to produce tokens. Depending on the model and phase, that can include:
- active weights read for the token path
- activation buffers and scratch buffers
- KV cache reads and writes, or recurrent state updates for SSM-style models
- prefill token blocks, decode token steps, and batching effects
- NUMA traffic, cache movement, and inter-node communication
- quantized layout conversion, repacking, and dequantization overhead
Why 1 PB/s?
A very large modern model can easily imply hundreds of gigabytes to terabytes of active weight and state movement across the token path, especially before accounting for prefill, KV cache, MoE routing, recurrent state, or distributed execution. Quantization and MoE reduce the active bytes touched per token, but the optimization problem remains the same: the faster the system can cycle the active bytes, the faster it can produce useful output.
| Active bytes per token | Token latency | Required aggregate rate | Implied token rate |
|---|---|---|---|
| 1 TB | 1 second | 1 TB/s | 1 token/sec |
| 1 TB | 10 ms | 100 TB/s | 100 tokens/sec |
| 1 TB | 1 ms | 1 PB/s | 1000 tokens/sec |
| 100 GB | 1 ms | 100 TB/s | 1000 tokens/sec |
This is why the 1 PB/s unit matters. It turns a vague performance goal into a concrete systems target: if the active token path is one terabyte and the target is one millisecond per token, the cumulative runtime has to behave like a one-petabyte-per-second machine. That machine does not have to be one CPU. It can be a coordinated set of CPU sockets, memory channels, NUMA domains, and Linux nodes.
What Counts as Active Bytes?
Active bytes are the bytes that matter for the current token path. They are not the total size of every file on disk and they are not necessarily the total parameter count of the model. For a dense model, the active path may touch most weights every token. For a Mixture-of-Experts model, only the selected experts are active. For recurrent or SSM-style models, the KV-cache term may shrink or disappear for many layers, but recurrent state updates still count.
This makes CKU useful across model families. A 5 TB model with quantization, MoE routing, and sparse activation may have a much smaller active-byte path per token than its full stored footprint. CKE cares about the measured active path: which weights are actually read, which activations are produced, which cache/state buffers are updated, and which bytes cross sockets or nodes.
What CKE Optimizes
CKE does not optimize a single number in isolation. It tries to organize the whole token path so useful bytes stay close to useful compute:
- Kernel layout: Q4/Q5/Q6/Q8 formats, packed layouts, SIMD-friendly loops, and shape-gated dispatch.
- Memory layout: contiguous bump files, planned activation buffers, cache-aligned sections, and deterministic offsets.
- Runtime scheduling: prefill vs decode kernels, persistent thread pools, batching, and pipeline staging.
- Linux tuning: CPU affinity, huge pages, NUMA placement, cache behavior, and perf/VTune/Advisor measurement.
- Cluster scaling: model/layer ownership, activation movement, gradient movement, MPI/RDMA paths, and topology-aware placement.
How to Measure It
CKU is meant to be measured, not asserted. A practical report should show the model, phase, context length, batch size, active-byte estimate, token latency, and the resulting aggregate rate. The same run should also show CPU utilization, memory bandwidth, cache behavior, and network transfer if the run spans nodes.
active_bytes_per_token = weights_touched
+ activations_touched
+ cache_or_state_touched
+ communication_touched
seconds_per_token = measured_wall_time / generated_tokens
CKU = active_bytes_per_token / seconds_per_token
This is the reason CKE keeps investing in visualizers, memory layouts, kernel maps, perf counters, VTune/Advisor profiles, and future MPI lanes. The generated C code is only one part of the system. The larger optimization problem is arranging the full data path so the machine spends less time waiting for bytes and more time turning those bytes into useful output.
Not a FLOPS Replacement
FLOPS still matter. Matrix units still matter. SIMD still matters. But for LLM inference and training, raw arithmetic throughput is only useful when the data reaches the execution units at the right time and in the right layout. CKU is a complementary metric: it asks whether the whole system can cycle the bytes that the model actually needs.
A single workstation is not expected to cycle 1 PB/s today. The point is to make the long-horizon optimization problem explicit: organize CPU kernels, memory hierarchy, Linux scheduling, and distributed nodes so aggregate active-byte throughput keeps rising.