GGUF to Bump Allocator

This document explains how C-Kernel-Engine handles quantized weights from GGUF files (llama.cpp format) and converts them to our bump allocator layout for efficient inference and training.

📖
Prerequisites: For quantization fundamentals (why quantize, block formats, grouping), see Quantization Deep Dive. For step-by-step byte-level GGUF parsing, see GGUF Conversion Guide.
💡
Key Insight: We don't dequantize weights at load time. We keep them quantized in memory and dequantize on-the-fly during GEMM operations. This saves 4x memory for Q4_K models.

Why Quantization Matters

4x
Memory Reduction

Q4_K uses ~4.5 bits/weight vs 32 bits for FP32

2x
Bandwidth Savings

Less data to move from DRAM = faster inference

<1%
Quality Loss

Q4_K_M preserves model quality remarkably well

GGUF File Format Overview

GGUF (GGML Universal Format) is llama.cpp's binary format for storing quantized models. Understanding it is essential for our conversion pipeline.

GGUF File Structure Magic: "GGUF" (4 bytes) Version: u32 n_tensors: u64 n_kv: u64 Metadata (Key-Value Pairs) llama.block_count: 32 llama.embedding_length: 4096 llama.attention.head_count: 32 llama.rope.freq_base: 10000.0 llama.norm_rms_eps: 1e-5 Tensor Info (Headers Only - No Data) name: "blk.0.attn_q.weight" dims: [4096, 4096] type: Q4_K offset: 0x1000 name: "blk.0.attn_k.weight" dims: [4096, 1024] type: Q4_K offset: 0x801000 Alignment Padding (to 32-byte boundary) Tensor Data (Quantized Blocks) block_q4_K[...] for each tensor

Q4_K Block Structure

Q4_K is a "K-quant" format with nested scales. Each super-block contains 256 weights organized into 8 sub-blocks of 32 weights each. This two-level scaling provides better accuracy than simple 4-bit quantization.

Q4_K Super-Block: 144 bytes = 256 weights d FP16 2 bytes dmin FP16 2 bytes scales[12] 8 scales + 8 mins 6-bit packed qs[128] 256 x 4-bit weights 2 per byte = 128 bytes 8 Sub-blocks (32 weights each) Sub 0 16 bytes Sub 1 16 bytes Sub 2 16 bytes ... Sub 7 16 bytes Dequantization Formula: w_fp32 = (q - 8) * d * sc[sub] + dmin * m[sub] Where: q = 4-bit value (0-15), sc = sub-block scale, m = sub-block min Total: 2 + 2 + 12 + 128 = 144 bytes for 256 weights = 4.5 bits/weight Q4_K: 144 bytes vs FP32: 1024 bytes (7x more)

Scale Unpacking (The Tricky Part)

The 12-byte scales array packs 8 scales and 8 mins in 6-bit format. Unpacking requires careful bit manipulation:

6-bit Packed Layout

// scales[12] = 96 bits = 16 x 6-bit values
// First 8 values: scales (sc[0..7])
// Last 8 values: mins (m[0..7])

// Unpacking sc[0..7] from bytes 0-5:
sc[0] = scales[0] & 0x3F;                              // bits 0-5 of byte 0
sc[1] = (scales[0] >> 6) | ((scales[1] & 0x0F) << 2);  // bits 6-7 of byte 0, bits 0-3 of byte 1
sc[2] = (scales[1] >> 4) | ((scales[2] & 0x03) << 4);  // bits 4-7 of byte 1, bits 0-1 of byte 2
sc[3] = scales[2] >> 2;                                // bits 2-7 of byte 2
// ... pattern repeats for sc[4..7] using bytes 3-5

// Unpacking m[0..7] from bytes 6-11:
// Same pattern as scales

GGUF to Bump Conversion Pipeline

GGUF to Bump Allocator Conversion 1. Parse GGUF Read magic, version Extract metadata Build tensor index 2. Validate Shapes Check tensor dims Verify Q4_K alignment (ne0 % 256 == 0) 3. Build Dtype Table Per-tensor type codes Q4_K=6, Q6_K=7, FP32=0 Enables mixed precision 4. Write Bump Header (128 bytes) "BUMPWGT3" version=3 num_layers, vocab_size, embed_dim context_len, num_heads, head_dim SHA-256 checksum (32 bytes) 5. Stream Tensor Data (No Dequantization!) token_embd Q4_K blocks Layer 0 Q/K/V/O + MLP ... Layer N-1 Q/K/V/O + MLP output_norm FP32 gamma weights.bump (ready for inference)

Bump File Header Format

The bump allocator uses a 128-byte header that contains all model configuration needed at runtime:

// Bump Header Structure (128 bytes total)
// ========================================

Offset  Size    Field                   Description
------  ----    -----                   -----------
0x00    8       magic                   "BUMPWGT3" (8 chars)
0x08    4       version                 3 (current version)
0x0C    4       model_type              1 = decoder-only LLM
0x10    4       num_layers              Number of transformer layers
0x14    4       vocab_size              Vocabulary size
0x18    4       embed_dim               Hidden dimension (hidden_size)
0x1C    4       context_len             Max sequence length
0x20    4       num_heads               Number of attention heads
0x24    4       head_dim                Per-head dimension (embed_dim / num_heads)
0x28    8       aligned_embed_dim       64-byte aligned embed dim
0x30    8       aligned_head_dim        64-byte aligned head dim
0x38    8       aligned_context         64-byte aligned context length
0x40    32      checksum                SHA-256 of payload (after header)
0x60    32      reserved                Future use (zeros)

Why Aligned Dimensions?

Cache Line Alignment

Modern CPUs fetch data in 64-byte cache lines. Misaligned access causes:

  • Two cache line fetches instead of one
  • Cache line splitting penalties
  • Reduced SIMD throughput

AVX-512 Requirements

AVX-512 processes 16 floats (64 bytes) per instruction:

  • Aligned loads: _mm512_load_ps (fast)
  • Unaligned loads: _mm512_loadu_ps (slower)
  • Padding zeros don't affect results

Per-Layer Weight Layout

Each transformer layer's weights are stored contiguously in the bump file:

Per-Layer Weight Layout in Bump File Layer N ln1_gamma FP32 [embed_dim] ln2_gamma FP32 [embed_dim] Attention Projections (Q4_K): Wq [E x E] bq=0 Wk [E x kv] bk=0 Wv [E x kv] bv=0 Wo [E x E] bo=0 MLP Projections (Q4_K): W_gate [E x intermediate] W_up [E x intermediate] b1=0 W_down [intermediate x E] b2=0 Q4_K quantized FP32 (norms) Zero bias placeholders E = embed_dim, kv = num_kv_heads * head_dim

Dequantization Kernels

At inference time, we dequantize weights on-the-fly during GEMM operations. This trades compute for memory bandwidth - a good trade on modern CPUs where memory is the bottleneck.

Q4_K Dequantization (Scalar Reference)

void dequant_q4_k_block(const block_q4_K *block, float *output) {
    const float d = fp16_to_fp32(block->d);
    const float dmin = fp16_to_fp32(block->dmin);

    // Unpack 6-bit scales and mins
    uint8_t sc[8], m[8];
    unpack_q4_k_scales(block->scales, sc, m);

    // Process 8 sub-blocks of 32 weights each
    for (int sub = 0; sub < 8; sub++) {
        const float scale = d * (float)sc[sub];
        const float min_val = dmin * (float)m[sub];

        const uint8_t *qs = &block->qs[sub * 16];
        float *out = &output[sub * 32];

        for (int i = 0; i < 16; i++) {
            const uint8_t packed = qs[i];
            const int8_t q0 = (packed & 0x0F) - 8;  // Lower nibble
            const int8_t q1 = (packed >> 4) - 8;    // Upper nibble

            out[2*i + 0] = scale * (float)q0 + min_val;
            out[2*i + 1] = scale * (float)q1 + min_val;
        }
    }
}

AVX-512 Optimized Dequantization

static inline void dequant_q4_k_subblock_avx512(
    const uint8_t *qs,      // 16 bytes = 32 x 4-bit weights
    float scale,            // d * sc[sub]
    float min_val,          // dmin * m[sub]
    __m512 *out0,           // Output: weights 0-15
    __m512 *out1)           // Output: weights 16-31
{
    const __m512 vscale = _mm512_set1_ps(scale);
    const __m512 vmin = _mm512_set1_ps(min_val);
    const __m512i offset = _mm512_set1_epi32(8);
    const __m512i mask_lo = _mm512_set1_epi32(0x0F);

    // Load 16 bytes, expand to 32-bit integers
    __m128i packed = _mm_loadu_si128((const __m128i *)qs);
    __m512i bytes = _mm512_cvtepu8_epi32(packed);

    // Extract lower nibbles: (byte & 0x0F) - 8
    __m512i lo = _mm512_and_epi32(bytes, mask_lo);
    lo = _mm512_sub_epi32(lo, offset);

    // Extract upper nibbles: (byte >> 4) - 8
    __m512i hi = _mm512_srli_epi32(bytes, 4);
    hi = _mm512_sub_epi32(hi, offset);

    // Convert to float, scale, add min: FMA!
    *out0 = _mm512_fmadd_ps(_mm512_cvtepi32_ps(lo), vscale, vmin);
    *out1 = _mm512_fmadd_ps(_mm512_cvtepi32_ps(hi), vscale, vmin);
}

Make Commands

Use these commands to work with GGUF files:

Command Description
make test-quant Run quantization kernel tests (dequant + q4/q8 gemm)
make gguf-inspect GGUF=path Inspect GGUF tensor dtypes (what is quantized?)
make gguf-list GGUF=path List all GGUF tensors (name/type/shape)
make gguf-to-bump GGUF=path Convert GGUF to bump weights (outputs to GGUF_OUT/)

Example Usage

# Inspect a GGUF file to see what's quantized
make gguf-inspect GGUF=models/qwen2.5-3b-instruct-q4_k_m.gguf

# Output:
# [gguf] version=3 arch=llama tensors=291 kv=26 alignment=32
# [gguf] tensor types:
#   -      Q4_K:   225 tensors, bytes=1.71 GiB
#   -        F32:    66 tensors, bytes=48.05 MiB
# [gguf] key tensors:
#   - token_embd.weight: Q4_K dims=(3584, 151936)
#   - blk.0.attn_q.weight: Q4_K dims=(3584, 3584)

# Convert to bump format
make gguf-to-bump GGUF=models/qwen2.5-3b-instruct-q4_k_m.gguf

# Output files:
#   GGUF_OUT/weights.bump     - Quantized weights
#   GGUF_OUT/config.json      - HuggingFace-style config

Quantization Format Comparison

Format Block Size Bytes/Block Bits/Weight Quality Use Case
FP32 1 4 32.0 Baseline Training, norms
FP16 1 2 16.0 ~Same GPU inference
Q8_0 32 34 8.5 Excellent Activations
Q4_0 32 18 4.5 Good Simple quant
Q4_K 256 144 4.5 Very Good Weights (primary)
Q6_K 256 210 6.6 Excellent Quality-sensitive
Recommendation: Use Q4_K_M models for the best balance of quality and memory savings. The "M" in Q4_K_M means "medium" quantization - it keeps more precision in attention layers.

Integration with Bump Allocator

The bump allocator loads weights once at startup and keeps them in memory. The key insight is that quantized weights stay quantized - we never allocate space for dequantized FP32 weights.

Memory Layout: Quantized Weights + FP32 Activations Quantized Weights (Q4_K) ~1.7 GB for 7B model Activations (FP32) Reused per layer Memory Comparison (7B Model) FP32 Weights: ~28 GB Q4_K: ~3.5 GB 8x memory reduction!

Source Files

File Description
scripts/convert_gguf_to_bump.py GGUF to bump conversion tool
include/ckernel_quant.h Block structures (block_q4_K, etc.)
src/kernels/dequant_kernels.c Dequantization implementations
include/ckernel_dtype.h Data type enums and utilities
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close