GGUF to Bump Allocator
This document explains how C-Kernel-Engine handles quantized weights from GGUF files (llama.cpp format) and converts them to our bump allocator layout for efficient inference and training.
Why Quantization Matters
Q4_K uses ~4.5 bits/weight vs 32 bits for FP32
Less data to move from DRAM = faster inference
Q4_K_M preserves model quality remarkably well
GGUF File Format Overview
GGUF (GGML Universal Format) is llama.cpp's binary format for storing quantized models. Understanding it is essential for our conversion pipeline.
Q4_K Block Structure
Q4_K is a "K-quant" format with nested scales. Each super-block contains 256 weights organized into 8 sub-blocks of 32 weights each. This two-level scaling provides better accuracy than simple 4-bit quantization.
Scale Unpacking (The Tricky Part)
The 12-byte scales array packs 8 scales and 8 mins in 6-bit format. Unpacking requires careful bit manipulation:
6-bit Packed Layout
// scales[12] = 96 bits = 16 x 6-bit values
// First 8 values: scales (sc[0..7])
// Last 8 values: mins (m[0..7])
// Unpacking sc[0..7] from bytes 0-5:
sc[0] = scales[0] & 0x3F; // bits 0-5 of byte 0
sc[1] = (scales[0] >> 6) | ((scales[1] & 0x0F) << 2); // bits 6-7 of byte 0, bits 0-3 of byte 1
sc[2] = (scales[1] >> 4) | ((scales[2] & 0x03) << 4); // bits 4-7 of byte 1, bits 0-1 of byte 2
sc[3] = scales[2] >> 2; // bits 2-7 of byte 2
// ... pattern repeats for sc[4..7] using bytes 3-5
// Unpacking m[0..7] from bytes 6-11:
// Same pattern as scales
GGUF to Bump Conversion Pipeline
Bump File Header Format
The bump allocator uses a 128-byte header that contains all model configuration needed at runtime:
// Bump Header Structure (128 bytes total)
// ========================================
Offset Size Field Description
------ ---- ----- -----------
0x00 8 magic "BUMPWGT3" (8 chars)
0x08 4 version 3 (current version)
0x0C 4 model_type 1 = decoder-only LLM
0x10 4 num_layers Number of transformer layers
0x14 4 vocab_size Vocabulary size
0x18 4 embed_dim Hidden dimension (hidden_size)
0x1C 4 context_len Max sequence length
0x20 4 num_heads Number of attention heads
0x24 4 head_dim Per-head dimension (embed_dim / num_heads)
0x28 8 aligned_embed_dim 64-byte aligned embed dim
0x30 8 aligned_head_dim 64-byte aligned head dim
0x38 8 aligned_context 64-byte aligned context length
0x40 32 checksum SHA-256 of payload (after header)
0x60 32 reserved Future use (zeros)
Why Aligned Dimensions?
Cache Line Alignment
Modern CPUs fetch data in 64-byte cache lines. Misaligned access causes:
- Two cache line fetches instead of one
- Cache line splitting penalties
- Reduced SIMD throughput
AVX-512 Requirements
AVX-512 processes 16 floats (64 bytes) per instruction:
- Aligned loads:
_mm512_load_ps(fast) - Unaligned loads:
_mm512_loadu_ps(slower) - Padding zeros don't affect results
Per-Layer Weight Layout
Each transformer layer's weights are stored contiguously in the bump file:
Dequantization Kernels
At inference time, we dequantize weights on-the-fly during GEMM operations. This trades compute for memory bandwidth - a good trade on modern CPUs where memory is the bottleneck.
Q4_K Dequantization (Scalar Reference)
void dequant_q4_k_block(const block_q4_K *block, float *output) {
const float d = fp16_to_fp32(block->d);
const float dmin = fp16_to_fp32(block->dmin);
// Unpack 6-bit scales and mins
uint8_t sc[8], m[8];
unpack_q4_k_scales(block->scales, sc, m);
// Process 8 sub-blocks of 32 weights each
for (int sub = 0; sub < 8; sub++) {
const float scale = d * (float)sc[sub];
const float min_val = dmin * (float)m[sub];
const uint8_t *qs = &block->qs[sub * 16];
float *out = &output[sub * 32];
for (int i = 0; i < 16; i++) {
const uint8_t packed = qs[i];
const int8_t q0 = (packed & 0x0F) - 8; // Lower nibble
const int8_t q1 = (packed >> 4) - 8; // Upper nibble
out[2*i + 0] = scale * (float)q0 + min_val;
out[2*i + 1] = scale * (float)q1 + min_val;
}
}
}
AVX-512 Optimized Dequantization
static inline void dequant_q4_k_subblock_avx512(
const uint8_t *qs, // 16 bytes = 32 x 4-bit weights
float scale, // d * sc[sub]
float min_val, // dmin * m[sub]
__m512 *out0, // Output: weights 0-15
__m512 *out1) // Output: weights 16-31
{
const __m512 vscale = _mm512_set1_ps(scale);
const __m512 vmin = _mm512_set1_ps(min_val);
const __m512i offset = _mm512_set1_epi32(8);
const __m512i mask_lo = _mm512_set1_epi32(0x0F);
// Load 16 bytes, expand to 32-bit integers
__m128i packed = _mm_loadu_si128((const __m128i *)qs);
__m512i bytes = _mm512_cvtepu8_epi32(packed);
// Extract lower nibbles: (byte & 0x0F) - 8
__m512i lo = _mm512_and_epi32(bytes, mask_lo);
lo = _mm512_sub_epi32(lo, offset);
// Extract upper nibbles: (byte >> 4) - 8
__m512i hi = _mm512_srli_epi32(bytes, 4);
hi = _mm512_sub_epi32(hi, offset);
// Convert to float, scale, add min: FMA!
*out0 = _mm512_fmadd_ps(_mm512_cvtepi32_ps(lo), vscale, vmin);
*out1 = _mm512_fmadd_ps(_mm512_cvtepi32_ps(hi), vscale, vmin);
}
Make Commands
Use these commands to work with GGUF files:
| Command | Description |
|---|---|
make test-quant |
Run quantization kernel tests (dequant + q4/q8 gemm) |
make gguf-inspect GGUF=path |
Inspect GGUF tensor dtypes (what is quantized?) |
make gguf-list GGUF=path |
List all GGUF tensors (name/type/shape) |
make gguf-to-bump GGUF=path |
Convert GGUF to bump weights (outputs to GGUF_OUT/) |
Example Usage
# Inspect a GGUF file to see what's quantized
make gguf-inspect GGUF=models/qwen2.5-3b-instruct-q4_k_m.gguf
# Output:
# [gguf] version=3 arch=llama tensors=291 kv=26 alignment=32
# [gguf] tensor types:
# - Q4_K: 225 tensors, bytes=1.71 GiB
# - F32: 66 tensors, bytes=48.05 MiB
# [gguf] key tensors:
# - token_embd.weight: Q4_K dims=(3584, 151936)
# - blk.0.attn_q.weight: Q4_K dims=(3584, 3584)
# Convert to bump format
make gguf-to-bump GGUF=models/qwen2.5-3b-instruct-q4_k_m.gguf
# Output files:
# GGUF_OUT/weights.bump - Quantized weights
# GGUF_OUT/config.json - HuggingFace-style config
Quantization Format Comparison
| Format | Block Size | Bytes/Block | Bits/Weight | Quality | Use Case |
|---|---|---|---|---|---|
FP32 |
1 | 4 | 32.0 | Baseline | Training, norms |
FP16 |
1 | 2 | 16.0 | ~Same | GPU inference |
Q8_0 |
32 | 34 | 8.5 | Excellent | Activations |
Q4_0 |
32 | 18 | 4.5 | Good | Simple quant |
Q4_K |
256 | 144 | 4.5 | Very Good | Weights (primary) |
Q6_K |
256 | 210 | 6.6 | Excellent | Quality-sensitive |
Integration with Bump Allocator
The bump allocator loads weights once at startup and keeps them in memory. The key insight is that quantized weights stay quantized - we never allocate space for dequantized FP32 weights.
Source Files
| File | Description |
|---|---|
scripts/convert_gguf_to_bump.py |
GGUF to bump conversion tool |
include/ckernel_quant.h |
Block structures (block_q4_K, etc.) |
src/kernels/dequant_kernels.c |
Dequantization implementations |
include/ckernel_dtype.h |
Data type enums and utilities |