GGUF to Bump: Step-by-Step Conversion
This document walks through every step of converting a GGUF file to C-Kernel-Engine's bump allocator format. We'll trace through the actual bytes, show the parsing logic, and explain how GGUF tensor names map to our weight layout.
scripts/convert_gguf_to_bump.py implements this entire pipeline. This document explains the "why" behind each step.
Overview: The 8-Step Pipeline
Step 1: Read and Validate Magic
Every GGUF file starts with the 4-byte magic string "GGUF". This identifies the file format and catches truncated or corrupted files early.
Code: Magic Validation
magic = f.read(4)
if magic != b"GGUF":
raise GGUFError(f"Invalid magic {magic!r} (expected b'GGUF')")
Step 2: Parse GGUF Header
After the magic, we read the version and counts. GGUF v2+ uses 64-bit counts (v1 used 32-bit).
Code: Header Parsing
version = r.u32() # Read 4-byte unsigned int
if version >= 2:
n_tensors = r.u64() # 8-byte count
n_kv = r.u64()
else:
n_tensors = r.u32() # v1: 4-byte count
n_kv = r.u32()
# Sanity check (catches corrupt headers)
if n_tensors > 1_000_000 or n_kv > 1_000_000:
raise GGUFError(f"Header counts look corrupt: {n_tensors=}, {n_kv=}")
Step 3: Read Metadata Key-Value Pairs
GGUF metadata contains model configuration. We only parse the keys we need and skip the rest for performance (especially important for large tokenizer arrays).
Keys We Extract
| GGUF Key | Type | Maps To | Example |
|---|---|---|---|
general.architecture |
string | model_type | "llama" |
llama.block_count |
u32 | num_layers | 36 |
llama.embedding_length |
u32 | hidden_size | 3584 |
llama.feed_forward_length |
u32 | intermediate_size | 18944 |
llama.attention.head_count |
u32 | num_heads | 28 |
llama.attention.head_count_kv |
u32 | num_kv_heads | 4 |
llama.context_length |
u32 | max_position_embeddings | 32768 |
llama.rope.freq_base |
f32 | rope_theta | 1000000.0 |
llama.norm_rms_eps |
f32 | rms_norm_eps | 1e-6 |
general.alignment |
u32 | (internal) | 32 |
Code: Selective Metadata Parsing
wanted_meta = {
"general.architecture",
"llama.block_count",
"llama.embedding_length",
# ... other keys we need
}
meta = {}
for _ in range(n_kv):
key = r.key_str() # Read key length + key bytes
vtype = r.u32() # Read value type
if key in wanted_meta:
meta[key] = _gguf_read_value(r, vtype) # Parse value
else:
_gguf_skip_value(r, vtype) # Skip unknown keys (fast!)
Step 4: Read Tensor Info (Headers Only)
For each tensor, we read its name, dimensions, type, and offset within the data section. The actual tensor data comes later - this is just the index.
Code: Building Tensor Index
@dataclass
class TensorInfo:
name: str
dims: Tuple[int, ...] # GGML order: ne0, ne1, ...
ggml_type: int
offset: int # Relative to data section start
tensors: Dict[str, TensorInfo] = {}
for _ in range(n_tensors):
name = r.key_str() # Read name
n_dims = r.u32() # Read dimension count
dims = tuple(r.u64() for _ in range(n_dims))
ggml_type = r.u32() # Q4_K=12, F32=0, etc.
offset = r.u64() # Byte offset in data section
tensors[name] = TensorInfo(name=name, dims=dims,
ggml_type=ggml_type, offset=offset)
Step 5: Calculate Data Section Start (Alignment)
GGUF aligns the data section to a boundary (typically 32 bytes). This ensures tensor data is properly aligned for SIMD operations.
Code: Calculate and Seek to Data
alignment = meta.get("general.alignment", 32) # Default 32 bytes
# Current position after reading all tensor info
current_pos = r.tell()
# Align to boundary
data_start = ((current_pos + alignment - 1) // alignment) * alignment
# Seek to data section (some writers already align, this handles both)
r.seek(data_start)
Step 6: Build Dtype Table
The bump format stores a per-tensor dtype table so the runtime knows which dequantization kernel to use. Each tensor gets a 1-byte type code.
Code: Build Dtype Table
# Start with token embedding dtype
token_dtype = ck_dtype_from_ggml_type(tok.ggml_type)
dtype_table = [token_dtype, CK_DT_FP32] # tok_emb, pos_emb
# Per-layer dtypes (14 entries per layer)
for layer in range(num_layers):
wq = tensors[f"blk.{layer}.attn_q.weight"]
wk = tensors[f"blk.{layer}.attn_k.weight"]
# ... get other tensors
dtype_table.extend([
CK_DT_FP32, # ln1_gamma (always FP32)
CK_DT_FP32, # ln2_gamma (always FP32)
ck_dtype_from_ggml_type(wq.ggml_type), # Wq
CK_DT_FP32, # bq (bias placeholder)
ck_dtype_from_ggml_type(wk.ggml_type), # Wk
CK_DT_FP32, # bk
# ... etc
])
# Final layer
dtype_table.extend([CK_DT_FP32, CK_DT_FP32]) # final_ln, final_bias
# Write as bytes
out_f.write(struct.pack("<I", len(dtype_table)))
out_f.write(bytes(dtype_table))
Step 7: Write Bump Header
The bump header is 128 bytes containing everything the runtime needs to interpret the weight data. We write a placeholder first, then fill it in after computing the SHA-256 checksum.
Step 8: Stream Tensor Data
This is where the actual conversion happens. We read tensors from GGUF and write them to bump in a specific order. Importantly, we don't dequantize - we copy the quantized blocks directly.
GGUF → Bump Tensor Name Mapping
Key Transformations During Copy
1. Norm Vectors: FP16/BF16 → FP32
GGUF stores RMSNorm gammas as FP16 or BF16. We convert to FP32 for the kernel:
def read_vector_f32(f, info):
raw = f.read(row_bytes)
if info.ggml_type == GGML_TYPE_F16:
return np.frombuffer(raw, np.float16).astype(np.float32)
elif info.ggml_type == GGML_TYPE_BF16:
u16 = np.frombuffer(raw, np.uint16)
u32 = u16.astype(np.uint32) << 16
return u32.view(np.float32)
return np.frombuffer(raw, np.float32)
2. Q/K/V: Head-Packed Layout
GGUF stores Q/K/V as flat [in_dim × out_dim]. We repack per-head for aligned access:
# For each head, copy head_dim rows
# Then pad to aligned_head_dim if needed
for head in range(num_heads):
for row in range(head_dim):
copy_row(src, dst)
# Zero-pad extra rows
for row in range(head_dim, aligned_head_dim):
write_zero_row(dst)
3. Q4_K Blocks: Direct Copy (No Dequant!)
The key optimization: we copy quantized blocks byte-for-byte. Dequantization happens at inference time in the GEMM kernel.
def copy_bytes_stream(f_in, src_pos, nbytes, w_out, chunk=1MB):
"""Stream copy without loading entire tensor into memory"""
f_in.seek(src_pos)
remaining = nbytes
while remaining > 0:
buf = f_in.read(min(remaining, chunk))
w_out.write(buf)
remaining -= len(buf)
Complete Data Flow Example
Let's trace a single weight tensor through the entire pipeline:
Checksumming
After writing all data, we compute a SHA-256 checksum over the payload (everything after the 128-byte header). This catches corruption during download or storage.
Code: Checksum with HashingWriter
class HashingWriter:
"""Wraps file writes to compute running SHA-256"""
def __init__(self, f):
self._f = f
self._h = hashlib.sha256()
self.bytes_written = 0
def write(self, data):
self._f.write(data)
self._h.update(data)
self.bytes_written += len(data)
def digest(self):
return self._h.digest() # 32 bytes
# Usage:
with open(output_path, "w+b") as f:
f.write(b"\x00" * 128) # Placeholder header
w = HashingWriter(f)
# ... write all tensor data through w ...
checksum = w.digest()
# Go back and fill in header
f.seek(0)
f.write(b"BUMPWGT3")
# ... write other fields ...
f.seek(0x40)
f.write(checksum) # 32 bytes at offset 0x40
Validation Checklist
The converter validates many conditions before writing:
Shape Validation
- Q4_K tensors must have
ne0 % 256 == 0 - embed_dim must be divisible by num_heads
- K/V dimensions must match GQA ratios
- MLP gate/up must have same dtype
Required Tensors
token_embd.weight(embeddings)output_norm.weight(final norm)- All per-layer attention tensors (q/k/v/o)
- All per-layer MLP tensors (gate/up/down)
Output Files
The converter produces two files:
| File | Purpose |
|---|---|
weights.bump |
Binary weight data (header + dtype table + tensors) |
config.json |
HuggingFace-compatible model config (for tooling) |
Example config.json Output
{
"architectures": ["LlamaForCausalLM"],
"model_type": "llama",
"num_hidden_layers": 36,
"hidden_size": 3584,
"intermediate_size": 18944,
"num_attention_heads": 28,
"num_key_value_heads": 4,
"vocab_size": 151936,
"max_position_embeddings": 32768,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0
}
Summary
- Norm gammas: FP16/BF16 → FP32 (for kernel compatibility)
- Attention weights: Flat → Head-packed (for aligned access)
- Biases: Generate zeros (GGUF models typically have no bias)