GGUF to Bump Allocator

This document explains how C-Kernel-Engine handles quantized weights from GGUF files (llama.cpp format) and converts them to our bump allocator layout for efficient inference and training.

📖

Prerequisites: For quantization fundamentals (why quantize, block formats, grouping), see Quantization Deep Dive. For step-by-step byte-level GGUF parsing, see GGUF Conversion Guide.

💡

Key Insight: We don't dequantize weights at load time. We keep them quantized in memory and dequantize on-the-fly during GEMM operations. This saves 4x memory for Q4_K models.

Why Quantization Matters

Memory Reduction

Q4_K uses ~4.5 bits/weight vs 32 bits for FP32

Bandwidth Savings

Less data to move from DRAM = faster inference

<1%

Quality Loss

Q4_K_M preserves model quality remarkably well

GGUF File Format Overview

GGUF (GGML Universal Format) is llama.cpp's binary format for storing quantized models. Understanding it is essential for our conversion pipeline.

Q4_K Block Structure

Q4_K is a "K-quant" format with nested scales. Each super-block contains 256 weights organized into 8 sub-blocks of 32 weights each. This two-level scaling provides better accuracy than simple 4-bit quantization.

Scale Unpacking (The Tricky Part)

The 12-byte scales array packs 8 scales and 8 mins in 6-bit format. Unpacking requires careful bit manipulation:

6-bit Packed Layout

// scales[12] = 96 bits = 16 x 6-bit values
// First 8 values: scales (sc[0..7])
// Last 8 values: mins (m[0..7])

// Unpacking sc[0..7] from bytes 0-5:
sc[0] = scales[0] & 0x3F;                              // bits 0-5 of byte 0
sc[1] = (scales[0] >> 6) | ((scales[1] & 0x0F) << 2);  // bits 6-7 of byte 0, bits 0-3 of byte 1
sc[2] = (scales[1] >> 4) | ((scales[2] & 0x03) << 4);  // bits 4-7 of byte 1, bits 0-1 of byte 2
sc[3] = scales[2] >> 2;                                // bits 2-7 of byte 2
// ... pattern repeats for sc[4..7] using bytes 3-5

// Unpacking m[0..7] from bytes 6-11:
// Same pattern as scales

GGUF to Bump Conversion Pipeline

Bump File Header Format

The bump allocator uses a 128-byte header that contains all model configuration needed at runtime:

// Bump Header Structure (128 bytes total)
// ========================================

Offset  Size    Field                   Description
------  ----    -----                   -----------
0x00    8       magic                   "BUMPWGT3" (8 chars)
0x08    4       version                 3 (current version)
0x0C    4       model_type              1 = decoder-only LLM
0x10    4       num_layers              Number of transformer layers
0x14    4       vocab_size              Vocabulary size
0x18    4       embed_dim               Hidden dimension (hidden_size)
0x1C    4       context_len             Max sequence length
0x20    4       num_heads               Number of attention heads
0x24    4       head_dim                Per-head dimension (embed_dim / num_heads)
0x28    8       aligned_embed_dim       64-byte aligned embed dim
0x30    8       aligned_head_dim        64-byte aligned head dim
0x38    8       aligned_context         64-byte aligned context length
0x40    32      checksum                SHA-256 of payload (after header)
0x60    32      reserved                Future use (zeros)

Why Aligned Dimensions?

Cache Line Alignment

Modern CPUs fetch data in 64-byte cache lines. Misaligned access causes:

Two cache line fetches instead of one
Cache line splitting penalties
Reduced SIMD throughput

AVX-512 Requirements

AVX-512 processes 16 floats (64 bytes) per instruction:

Aligned loads: _mm512_load_ps (fast)
Unaligned loads: _mm512_loadu_ps (slower)
Padding zeros don't affect results

Per-Layer Weight Layout

Each transformer layer's weights are stored contiguously in the bump file:

Dequantization Kernels

At inference time, we dequantize weights on-the-fly during GEMM operations. This trades compute for memory bandwidth - a good trade on modern CPUs where memory is the bottleneck.

Q4_K Dequantization (Scalar Reference)

void dequant_q4_k_block(const block_q4_K *block, float *output) {
    const float d = fp16_to_fp32(block->d);
    const float dmin = fp16_to_fp32(block->dmin);

    // Unpack 6-bit scales and mins
    uint8_t sc[8], m[8];
    unpack_q4_k_scales(block->scales, sc, m);

    // Process 8 sub-blocks of 32 weights each
    for (int sub = 0; sub < 8; sub++) {
        const float scale = d * (float)sc[sub];
        const float min_val = dmin * (float)m[sub];

        const uint8_t *qs = &block->qs[sub * 16];
        float *out = &output[sub * 32];

        for (int i = 0; i < 16; i++) {
            const uint8_t packed = qs[i];
            const int8_t q0 = (packed & 0x0F) - 8;  // Lower nibble
            const int8_t q1 = (packed >> 4) - 8;    // Upper nibble

            out[2*i + 0] = scale * (float)q0 + min_val;
            out[2*i + 1] = scale * (float)q1 + min_val;
        }
    }
}

AVX-512 Optimized Dequantization

static inline void dequant_q4_k_subblock_avx512(
    const uint8_t *qs,      // 16 bytes = 32 x 4-bit weights
    float scale,            // d * sc[sub]
    float min_val,          // dmin * m[sub]
    __m512 *out0,           // Output: weights 0-15
    __m512 *out1)           // Output: weights 16-31
{
    const __m512 vscale = _mm512_set1_ps(scale);
    const __m512 vmin = _mm512_set1_ps(min_val);
    const __m512i offset = _mm512_set1_epi32(8);
    const __m512i mask_lo = _mm512_set1_epi32(0x0F);

    // Load 16 bytes, expand to 32-bit integers
    __m128i packed = _mm_loadu_si128((const __m128i *)qs);
    __m512i bytes = _mm512_cvtepu8_epi32(packed);

    // Extract lower nibbles: (byte & 0x0F) - 8
    __m512i lo = _mm512_and_epi32(bytes, mask_lo);
    lo = _mm512_sub_epi32(lo, offset);

    // Extract upper nibbles: (byte >> 4) - 8
    __m512i hi = _mm512_srli_epi32(bytes, 4);
    hi = _mm512_sub_epi32(hi, offset);

    // Convert to float, scale, add min: FMA!
    *out0 = _mm512_fmadd_ps(_mm512_cvtepi32_ps(lo), vscale, vmin);
    *out1 = _mm512_fmadd_ps(_mm512_cvtepi32_ps(hi), vscale, vmin);
}

Make Commands

Use these commands to work with GGUF files:

Command	Description
`make test-quant`	Run quantization kernel tests (dequant + q4/q8 gemm)
`make gguf-inspect GGUF=path`	Inspect GGUF tensor dtypes (what is quantized?)
`make gguf-list GGUF=path`	List all GGUF tensors (name/type/shape)
`make gguf-to-bump GGUF=path`	Convert GGUF to bump weights (outputs to GGUF_OUT/)

Example Usage

# Inspect a GGUF file to see what's quantized
make gguf-inspect GGUF=models/qwen2.5-3b-instruct-q4_k_m.gguf

# Output:
# [gguf] version=3 arch=llama tensors=291 kv=26 alignment=32
# [gguf] tensor types:
#   -      Q4_K:   225 tensors, bytes=1.71 GiB
#   -        F32:    66 tensors, bytes=48.05 MiB
# [gguf] key tensors:
#   - token_embd.weight: Q4_K dims=(3584, 151936)
#   - blk.0.attn_q.weight: Q4_K dims=(3584, 3584)

# Convert to bump format
make gguf-to-bump GGUF=models/qwen2.5-3b-instruct-q4_k_m.gguf

# Output files:
#   GGUF_OUT/weights.bump     - Quantized weights
#   GGUF_OUT/config.json      - HuggingFace-style config

Quantization Format Comparison

Format	Block Size	Bytes/Block	Bits/Weight	Quality	Use Case
`FP32`	1	4	32.0	Baseline	Training, norms
`FP16`	1	2	16.0	~Same	GPU inference
`Q8_0`	32	34	8.5	Excellent	Activations
`Q4_0`	32	18	4.5	Good	Simple quant
`Q4_K`	256	144	4.5	Very Good	Weights (primary)
`Q6_K`	256	210	6.6	Excellent	Quality-sensitive

✅

Recommendation: Use Q4_K_M models for the best balance of quality and memory savings. The "M" in Q4_K_M means "medium" quantization - it keeps more precision in attention layers.

Integration with Bump Allocator

The bump allocator loads weights once at startup and keeps them in memory. The key insight is that quantized weights stay quantized - we never allocate space for dequantized FP32 weights.

Source Files

File	Description
`scripts/convert_gguf_to_bump.py`	GGUF to bump conversion tool
`include/ckernel_quant.h`	Block structures (block_q4_K, etc.)
`src/kernels/dequant_kernels.c`	Dequantization implementations
`include/ckernel_dtype.h`	Data type enums and utilities