GGUF to Bump: Step-by-Step Conversion

This document walks through every step of converting a GGUF file to C-Kernel-Engine's bump allocator format. We'll trace through the actual bytes, show the parsing logic, and explain how GGUF tensor names map to our weight layout.

🔍

Source Code: scripts/convert_gguf_to_bump.py implements this entire pipeline. This document explains the "why" behind each step.

Overview: The 8-Step Pipeline

Step 1: Read and Validate Magic

Every GGUF file starts with the 4-byte magic string "GGUF". This identifies the file format and catches truncated or corrupted files early.

Code: Magic Validation

magic = f.read(4)
if magic != b"GGUF":
    raise GGUFError(f"Invalid magic {magic!r} (expected b'GGUF')")

Step 2: Parse GGUF Header

After the magic, we read the version and counts. GGUF v2+ uses 64-bit counts (v1 used 32-bit).

Code: Header Parsing

version = r.u32()  # Read 4-byte unsigned int

if version >= 2:
    n_tensors = r.u64()  # 8-byte count
    n_kv = r.u64()
else:
    n_tensors = r.u32()  # v1: 4-byte count
    n_kv = r.u32()

# Sanity check (catches corrupt headers)
if n_tensors > 1_000_000 or n_kv > 1_000_000:
    raise GGUFError(f"Header counts look corrupt: {n_tensors=}, {n_kv=}")

Step 3: Read Metadata Key-Value Pairs

GGUF metadata contains model configuration. We only parse the keys we need and skip the rest for performance (especially important for large tokenizer arrays).

Keys We Extract

GGUF Key	Type	Maps To	Example
`general.architecture`	string	model_type	"llama"
`llama.block_count`	u32	num_layers	36
`llama.embedding_length`	u32	hidden_size	3584
`llama.feed_forward_length`	u32	intermediate_size	18944
`llama.attention.head_count`	u32	num_heads	28
`llama.attention.head_count_kv`	u32	num_kv_heads	4
`llama.context_length`	u32	max_position_embeddings	32768
`llama.rope.freq_base`	f32	rope_theta	1000000.0
`llama.norm_rms_eps`	f32	rms_norm_eps	1e-6
`general.alignment`	u32	(internal)	32

Code: Selective Metadata Parsing

wanted_meta = {
    "general.architecture",
    "llama.block_count",
    "llama.embedding_length",
    # ... other keys we need
}

meta = {}
for _ in range(n_kv):
    key = r.key_str()      # Read key length + key bytes
    vtype = r.u32()        # Read value type

    if key in wanted_meta:
        meta[key] = _gguf_read_value(r, vtype)  # Parse value
    else:
        _gguf_skip_value(r, vtype)              # Skip unknown keys (fast!)

Step 4: Read Tensor Info (Headers Only)

For each tensor, we read its name, dimensions, type, and offset within the data section. The actual tensor data comes later - this is just the index.

Code: Building Tensor Index

@dataclass
class TensorInfo:
    name: str
    dims: Tuple[int, ...]  # GGML order: ne0, ne1, ...
    ggml_type: int
    offset: int            # Relative to data section start

tensors: Dict[str, TensorInfo] = {}

for _ in range(n_tensors):
    name = r.key_str()           # Read name
    n_dims = r.u32()             # Read dimension count
    dims = tuple(r.u64() for _ in range(n_dims))
    ggml_type = r.u32()          # Q4_K=12, F32=0, etc.
    offset = r.u64()             # Byte offset in data section

    tensors[name] = TensorInfo(name=name, dims=dims,
                               ggml_type=ggml_type, offset=offset)

Step 5: Calculate Data Section Start (Alignment)

GGUF aligns the data section to a boundary (typically 32 bytes). This ensures tensor data is properly aligned for SIMD operations.

Code: Calculate and Seek to Data

alignment = meta.get("general.alignment", 32)  # Default 32 bytes

# Current position after reading all tensor info
current_pos = r.tell()

# Align to boundary
data_start = ((current_pos + alignment - 1) // alignment) * alignment

# Seek to data section (some writers already align, this handles both)
r.seek(data_start)

Step 6: Build Dtype Table

The bump format stores a per-tensor dtype table so the runtime knows which dequantization kernel to use. Each tensor gets a 1-byte type code.

Code: Build Dtype Table

# Start with token embedding dtype
token_dtype = ck_dtype_from_ggml_type(tok.ggml_type)
dtype_table = [token_dtype, CK_DT_FP32]  # tok_emb, pos_emb

# Per-layer dtypes (14 entries per layer)
for layer in range(num_layers):
    wq = tensors[f"blk.{layer}.attn_q.weight"]
    wk = tensors[f"blk.{layer}.attn_k.weight"]
    # ... get other tensors

    dtype_table.extend([
        CK_DT_FP32,  # ln1_gamma (always FP32)
        CK_DT_FP32,  # ln2_gamma (always FP32)
        ck_dtype_from_ggml_type(wq.ggml_type),  # Wq
        CK_DT_FP32,  # bq (bias placeholder)
        ck_dtype_from_ggml_type(wk.ggml_type),  # Wk
        CK_DT_FP32,  # bk
        # ... etc
    ])

# Final layer
dtype_table.extend([CK_DT_FP32, CK_DT_FP32])  # final_ln, final_bias

# Write as bytes
out_f.write(struct.pack("<I", len(dtype_table)))
out_f.write(bytes(dtype_table))

Step 7: Write Bump Header

The bump header is 128 bytes containing everything the runtime needs to interpret the weight data. We write a placeholder first, then fill it in after computing the SHA-256 checksum.

Step 8: Stream Tensor Data

This is where the actual conversion happens. We read tensors from GGUF and write them to bump in a specific order. Importantly, we don't dequantize - we copy the quantized blocks directly.

GGUF → Bump Tensor Name Mapping

Key Transformations During Copy

1. Norm Vectors: FP16/BF16 → FP32

GGUF stores RMSNorm gammas as FP16 or BF16. We convert to FP32 for the kernel:

def read_vector_f32(f, info):
    raw = f.read(row_bytes)
    if info.ggml_type == GGML_TYPE_F16:
        return np.frombuffer(raw, np.float16).astype(np.float32)
    elif info.ggml_type == GGML_TYPE_BF16:
        u16 = np.frombuffer(raw, np.uint16)
        u32 = u16.astype(np.uint32) << 16
        return u32.view(np.float32)
    return np.frombuffer(raw, np.float32)

2. Q/K/V: Head-Packed Layout

GGUF stores Q/K/V as flat [in_dim × out_dim]. We repack per-head for aligned access:

# For each head, copy head_dim rows
# Then pad to aligned_head_dim if needed
for head in range(num_heads):
    for row in range(head_dim):
        copy_row(src, dst)
    # Zero-pad extra rows
    for row in range(head_dim, aligned_head_dim):
        write_zero_row(dst)

3. Q4_K Blocks: Direct Copy (No Dequant!)

The key optimization: we copy quantized blocks byte-for-byte. Dequantization happens at inference time in the GEMM kernel.

def copy_bytes_stream(f_in, src_pos, nbytes, w_out, chunk=1MB):
    """Stream copy without loading entire tensor into memory"""
    f_in.seek(src_pos)
    remaining = nbytes
    while remaining > 0:
        buf = f_in.read(min(remaining, chunk))
        w_out.write(buf)
        remaining -= len(buf)

Complete Data Flow Example

Let's trace a single weight tensor through the entire pipeline:

Checksumming

After writing all data, we compute a SHA-256 checksum over the payload (everything after the 128-byte header). This catches corruption during download or storage.

Code: Checksum with HashingWriter

class HashingWriter:
    """Wraps file writes to compute running SHA-256"""
    def __init__(self, f):
        self._f = f
        self._h = hashlib.sha256()
        self.bytes_written = 0

    def write(self, data):
        self._f.write(data)
        self._h.update(data)
        self.bytes_written += len(data)

    def digest(self):
        return self._h.digest()  # 32 bytes

# Usage:
with open(output_path, "w+b") as f:
    f.write(b"\x00" * 128)  # Placeholder header
    w = HashingWriter(f)

    # ... write all tensor data through w ...

    checksum = w.digest()

    # Go back and fill in header
    f.seek(0)
    f.write(b"BUMPWGT3")
    # ... write other fields ...
    f.seek(0x40)
    f.write(checksum)  # 32 bytes at offset 0x40

Validation Checklist

The converter validates many conditions before writing:

Shape Validation

Q4_K tensors must have ne0 % 256 == 0
embed_dim must be divisible by num_heads
K/V dimensions must match GQA ratios
MLP gate/up must have same dtype

Required Tensors

token_embd.weight (embeddings)
output_norm.weight (final norm)
All per-layer attention tensors (q/k/v/o)
All per-layer MLP tensors (gate/up/down)

Output Files

The converter produces two files:

File	Purpose
`weights.bump`	Binary weight data (header + dtype table + tensors)
`config.json`	HuggingFace-compatible model config (for tooling)

Example config.json Output

{
  "architectures": ["LlamaForCausalLM"],
  "model_type": "llama",
  "num_hidden_layers": 36,
  "hidden_size": 3584,
  "intermediate_size": 18944,
  "num_attention_heads": 28,
  "num_key_value_heads": 4,
  "vocab_size": 151936,
  "max_position_embeddings": 32768,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0
}

Summary

✅

Key Insight: The GGUF→Bump conversion is a layout transformation, not a dequantization. Quantized blocks are copied byte-for-byte. The only conversions are:

Norm gammas: FP16/BF16 → FP32 (for kernel compatibility)
Attention weights: Flat → Head-packed (for aligned access)
Biases: Generate zeros (GGUF models typically have no bias)