GGUF to Bump: Step-by-Step Conversion

This document walks through every step of converting a GGUF file to C-Kernel-Engine's bump allocator format. We'll trace through the actual bytes, show the parsing logic, and explain how GGUF tensor names map to our weight layout.

🔍
Source Code: scripts/convert_gguf_to_bump.py implements this entire pipeline. This document explains the "why" behind each step.

Overview: The 8-Step Pipeline

Step 1 Read Magic Step 2 Parse Header Step 3 Read Metadata Step 4 Tensor Info Step 5 Align & Seek Step 6 Dtype Table Step 7 Write Header Step 8 Stream Data GGUF → Bump Conversion Pipeline Input: model.Q4_K_M.gguf → Output: weights.bump + config.json

Step 1: Read and Validate Magic

Every GGUF file starts with the 4-byte magic string "GGUF". This identifies the file format and catches truncated or corrupted files early.

GGUF Magic Bytes (Offset 0x00) G 0x47 G 0x47 U 0x55 F 0x46 +0 +1 +2 +3

Code: Magic Validation

magic = f.read(4)
if magic != b"GGUF":
    raise GGUFError(f"Invalid magic {magic!r} (expected b'GGUF')")

Step 2: Parse GGUF Header

After the magic, we read the version and counts. GGUF v2+ uses 64-bit counts (v1 used 32-bit).

GGUF Header Structure (After Magic) magic 4 bytes version u32 (4 bytes) n_tensors u64 (8 bytes, v2+) n_kv u64 (8 bytes, v2+) metadata[n_kv] variable length... 0x00 0x04 0x08 0x10 0x18 Example (Qwen2.5-3B): version=3 n_tensors=291 n_kv=26

Code: Header Parsing

version = r.u32()  # Read 4-byte unsigned int

if version >= 2:
    n_tensors = r.u64()  # 8-byte count
    n_kv = r.u64()
else:
    n_tensors = r.u32()  # v1: 4-byte count
    n_kv = r.u32()

# Sanity check (catches corrupt headers)
if n_tensors > 1_000_000 or n_kv > 1_000_000:
    raise GGUFError(f"Header counts look corrupt: {n_tensors=}, {n_kv=}")

Step 3: Read Metadata Key-Value Pairs

GGUF metadata contains model configuration. We only parse the keys we need and skip the rest for performance (especially important for large tokenizer arrays).

Metadata Key-Value Structure Each metadata entry: key_len u64 key_string[key_len] val_type u32 value_data (type-dependent) Value Types: Type 0-6: u8, i8, u16, i16, u32, i32, f32 Type 7: bool (u8) Type 8: string (u64 len + bytes) Type 9: array (u32 elem_type + u64 len + elements) ← tokenizer vocab! Type 10-12: u64, i64, f64 ⚠ Array type can be huge (100K+ tokens). We skip reading array contents.

Keys We Extract

GGUF Key Type Maps To Example
general.architecture string model_type "llama"
llama.block_count u32 num_layers 36
llama.embedding_length u32 hidden_size 3584
llama.feed_forward_length u32 intermediate_size 18944
llama.attention.head_count u32 num_heads 28
llama.attention.head_count_kv u32 num_kv_heads 4
llama.context_length u32 max_position_embeddings 32768
llama.rope.freq_base f32 rope_theta 1000000.0
llama.norm_rms_eps f32 rms_norm_eps 1e-6
general.alignment u32 (internal) 32

Code: Selective Metadata Parsing

wanted_meta = {
    "general.architecture",
    "llama.block_count",
    "llama.embedding_length",
    # ... other keys we need
}

meta = {}
for _ in range(n_kv):
    key = r.key_str()      # Read key length + key bytes
    vtype = r.u32()        # Read value type

    if key in wanted_meta:
        meta[key] = _gguf_read_value(r, vtype)  # Parse value
    else:
        _gguf_skip_value(r, vtype)              # Skip unknown keys (fast!)

Step 4: Read Tensor Info (Headers Only)

For each tensor, we read its name, dimensions, type, and offset within the data section. The actual tensor data comes later - this is just the index.

Tensor Info Structure (Per Tensor) name_len u64 name[name_len] n_dims u32 dims[n_dims] (u64 each) ggml_type u32 data_offset u64 Example Tensor: name = "blk.0.attn_q.weight" n_dims = 2 dims = [3584, 3584] ggml_type = 12 (Q4_K) offset = 0x00123000 (relative to data_start) ⚠ GGML dims are [ne0, ne1, ...] = [inner_dim, outer_dim]. For weight [3584,3584]: ne0=3584 cols, ne1=3584 rows

Code: Building Tensor Index

@dataclass
class TensorInfo:
    name: str
    dims: Tuple[int, ...]  # GGML order: ne0, ne1, ...
    ggml_type: int
    offset: int            # Relative to data section start

tensors: Dict[str, TensorInfo] = {}

for _ in range(n_tensors):
    name = r.key_str()           # Read name
    n_dims = r.u32()             # Read dimension count
    dims = tuple(r.u64() for _ in range(n_dims))
    ggml_type = r.u32()          # Q4_K=12, F32=0, etc.
    offset = r.u64()             # Byte offset in data section

    tensors[name] = TensorInfo(name=name, dims=dims,
                               ggml_type=ggml_type, offset=offset)

Step 5: Calculate Data Section Start (Alignment)

GGUF aligns the data section to a boundary (typically 32 bytes). This ensures tensor data is properly aligned for SIMD operations.

Finding Data Section Start Header Metadata (variable) Tensor Info (variable) current_pos pad Tensor Data data_start Formula: data_start = align_up(current_pos, alignment) align_up(n, a) = ((n + a - 1) // a) * a

Code: Calculate and Seek to Data

alignment = meta.get("general.alignment", 32)  # Default 32 bytes

# Current position after reading all tensor info
current_pos = r.tell()

# Align to boundary
data_start = ((current_pos + alignment - 1) // alignment) * alignment

# Seek to data section (some writers already align, this handles both)
r.seek(data_start)

Step 6: Build Dtype Table

The bump format stores a per-tensor dtype table so the runtime knows which dequantization kernel to use. Each tensor gets a 1-byte type code.

Dtype Table: Per-Tensor Type Codes CK Dtype Codes: CK_DT_FP32 = 0 32-bit float CK_DT_BF16 = 1 Brain float 16 CK_DT_FP16 = 2 IEEE float 16 CK_DT_Q4_K = 6 K-quant 4-bit CK_DT_Q6_K = 7 K-quant 6-bit GGML → CK Mapping: GGML_TYPE_F32 (0) → CK_DT_FP32 GGML_TYPE_F16 (1) → CK_DT_FP16 GGML_TYPE_BF16 (16) → CK_DT_BF16 GGML_TYPE_Q4_K (12) → CK_DT_Q4_K GGML_TYPE_Q6_K (14) → CK_DT_Q6_K Dtype Table Layout (written to bump file): tok_emb pos_emb ln1 ln2 Wq bq ... W↓ b2 × num_layers final_ln final_b

Code: Build Dtype Table

# Start with token embedding dtype
token_dtype = ck_dtype_from_ggml_type(tok.ggml_type)
dtype_table = [token_dtype, CK_DT_FP32]  # tok_emb, pos_emb

# Per-layer dtypes (14 entries per layer)
for layer in range(num_layers):
    wq = tensors[f"blk.{layer}.attn_q.weight"]
    wk = tensors[f"blk.{layer}.attn_k.weight"]
    # ... get other tensors

    dtype_table.extend([
        CK_DT_FP32,  # ln1_gamma (always FP32)
        CK_DT_FP32,  # ln2_gamma (always FP32)
        ck_dtype_from_ggml_type(wq.ggml_type),  # Wq
        CK_DT_FP32,  # bq (bias placeholder)
        ck_dtype_from_ggml_type(wk.ggml_type),  # Wk
        CK_DT_FP32,  # bk
        # ... etc
    ])

# Final layer
dtype_table.extend([CK_DT_FP32, CK_DT_FP32])  # final_ln, final_bias

# Write as bytes
out_f.write(struct.pack("<I", len(dtype_table)))
out_f.write(bytes(dtype_table))

Step 7: Write Bump Header

The bump header is 128 bytes containing everything the runtime needs to interpret the weight data. We write a placeholder first, then fill it in after computing the SHA-256 checksum.

Bump File Header (128 bytes) "BUMPWGT3" 8 bytes (magic) version u32 = 3 model_type u32 = 1 num_layers u32 vocab_size u32 embed_dim u32 context_len u32 num_heads u32 head_dim u32 aligned_embed u64 aligned_head u64 aligned_ctx u64 SHA-256 Checksum (32 bytes) Computed over all data after header Reserved (32 bytes zeros) Byte Offsets: 0x00: magic[8] 0x08: version 0x0C: model_type 0x10: num_layers 0x14: vocab_size 0x18: embed_dim 0x1C: context_len 0x20: num_heads 0x24: head_dim 0x28: aligned_embed (u64) 0x30: aligned_head (u64) 0x38: aligned_ctx (u64) 0x40: checksum[32] 0x60: reserved[32]

Step 8: Stream Tensor Data

This is where the actual conversion happens. We read tensors from GGUF and write them to bump in a specific order. Importantly, we don't dequantize - we copy the quantized blocks directly.

GGUF → Bump Tensor Name Mapping

GGUF Tensor Names → Bump Layout Order GGUF Tensor Names Global: token_embd.weight (no pos_emb - RoPE models) Per-Layer (blk.{N}.*): blk.{N}.attn_norm.weight blk.{N}.ffn_norm.weight blk.{N}.attn_q.weight blk.{N}.attn_k.weight blk.{N}.attn_v.weight blk.{N}.attn_output.weight blk.{N}.ffn_gate.weight blk.{N}.ffn_up.weight blk.{N}.ffn_down.weight Final: output_norm.weight (output.weight often tied to embed) Bump Layout Order 1. Embeddings: tok_emb (Q4_K blocks copied) pos_emb (zeros - RoPE in kernel) 2. Per-Layer (×num_layers): ln1_gamma (FP32, read from F16/BF16) ln2_gamma (FP32) Wq (Q4_K, head-packed) bq (zeros - no bias) Wk (Q4_K, head-packed) bk (zeros) Wv, bv, Wo, bo... W_gate (Q4_K) W_up (Q4_K) b1 (zeros) W_down, b2... 3. Final: final_ln_gamma, final_bias

Key Transformations During Copy

1. Norm Vectors: FP16/BF16 → FP32

GGUF stores RMSNorm gammas as FP16 or BF16. We convert to FP32 for the kernel:

def read_vector_f32(f, info):
    raw = f.read(row_bytes)
    if info.ggml_type == GGML_TYPE_F16:
        return np.frombuffer(raw, np.float16).astype(np.float32)
    elif info.ggml_type == GGML_TYPE_BF16:
        u16 = np.frombuffer(raw, np.uint16)
        u32 = u16.astype(np.uint32) << 16
        return u32.view(np.float32)
    return np.frombuffer(raw, np.float32)

2. Q/K/V: Head-Packed Layout

GGUF stores Q/K/V as flat [in_dim × out_dim]. We repack per-head for aligned access:

# For each head, copy head_dim rows
# Then pad to aligned_head_dim if needed
for head in range(num_heads):
    for row in range(head_dim):
        copy_row(src, dst)
    # Zero-pad extra rows
    for row in range(head_dim, aligned_head_dim):
        write_zero_row(dst)

3. Q4_K Blocks: Direct Copy (No Dequant!)

The key optimization: we copy quantized blocks byte-for-byte. Dequantization happens at inference time in the GEMM kernel.

def copy_bytes_stream(f_in, src_pos, nbytes, w_out, chunk=1MB):
    """Stream copy without loading entire tensor into memory"""
    f_in.seek(src_pos)
    remaining = nbytes
    while remaining > 0:
        buf = f_in.read(min(remaining, chunk))
        w_out.write(buf)
        remaining -= len(buf)

Complete Data Flow Example

Let's trace a single weight tensor through the entire pipeline:

Example: blk.0.attn_q.weight (Q4_K) GGUF File Tensor Info (from header): name = "blk.0.attn_q.weight" dims = [3584, 3584] (ne0=cols, ne1=rows) type = Q4_K (12) offset = 0x1A2B3C Raw Data (in data section): total_elements = 3584 × 3584 = 12,845,056 Q4_K blocks = 12,845,056 / 256 = 50,176 bytes = 50,176 × 144 = 7,225,344 ≈ 6.9 MB (vs 49 MB for FP32!) Each Q4_K Block: [d:2][dmin:2][scales:12][qs:128] = 144 bytes Contains 256 weights @ 4.5 bits each Copy (no dequant) Bump File Position in Bump: After: tok_emb, pos_emb, ln1, ln2 Layer 0, Weight Index: Wq (4th tensor) dtype_table[4] = CK_DT_Q4_K (6) Head-Packed Layout: num_heads = 28, head_dim = 128 For each head (0..27): Copy 128 rows of [3584] Q4_K blocks (padding rows if aligned_head > head_dim) At Runtime: 1. Check dtype_table → CK_DT_Q4_K 2. Call dequant_q4_k_* during GEMM

Checksumming

After writing all data, we compute a SHA-256 checksum over the payload (everything after the 128-byte header). This catches corruption during download or storage.

Code: Checksum with HashingWriter

class HashingWriter:
    """Wraps file writes to compute running SHA-256"""
    def __init__(self, f):
        self._f = f
        self._h = hashlib.sha256()
        self.bytes_written = 0

    def write(self, data):
        self._f.write(data)
        self._h.update(data)
        self.bytes_written += len(data)

    def digest(self):
        return self._h.digest()  # 32 bytes

# Usage:
with open(output_path, "w+b") as f:
    f.write(b"\x00" * 128)  # Placeholder header
    w = HashingWriter(f)

    # ... write all tensor data through w ...

    checksum = w.digest()

    # Go back and fill in header
    f.seek(0)
    f.write(b"BUMPWGT3")
    # ... write other fields ...
    f.seek(0x40)
    f.write(checksum)  # 32 bytes at offset 0x40

Validation Checklist

The converter validates many conditions before writing:

Shape Validation

  • Q4_K tensors must have ne0 % 256 == 0
  • embed_dim must be divisible by num_heads
  • K/V dimensions must match GQA ratios
  • MLP gate/up must have same dtype

Required Tensors

  • token_embd.weight (embeddings)
  • output_norm.weight (final norm)
  • All per-layer attention tensors (q/k/v/o)
  • All per-layer MLP tensors (gate/up/down)

Output Files

The converter produces two files:

File Purpose
weights.bump Binary weight data (header + dtype table + tensors)
config.json HuggingFace-compatible model config (for tooling)

Example config.json Output

{
  "architectures": ["LlamaForCausalLM"],
  "model_type": "llama",
  "num_hidden_layers": 36,
  "hidden_size": 3584,
  "intermediate_size": 18944,
  "num_attention_heads": 28,
  "num_key_value_heads": 4,
  "vocab_size": 151936,
  "max_position_embeddings": 32768,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0
}

Summary

Key Insight: The GGUF→Bump conversion is a layout transformation, not a dequantization. Quantized blocks are copied byte-for-byte. The only conversions are:
  • Norm gammas: FP16/BF16 → FP32 (for kernel compatibility)
  • Attention weights: Flat → Head-packed (for aligned access)
  • Biases: Generate zeros (GGUF models typically have no bias)
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close