C-Kernel BPE Tokenizer
From raw bytes to token IDs — one beautiful merge at a time. A visual, Feynman-style walkthrough of Byte Pair Encoding.
The One-Sentence Idea
Find the most common pair of adjacent symbols, replace every occurrence with a single new symbol, and repeat. That's the entire algorithm.
Step-by-Step: BPE on Our SVG Dataset
Our BPE trainer (ck-bpe-train) feeds on .svg, .txt, .html, .md, and .json files from the corpus. Let's trace the algorithm on a tiny SVG fragment:
<svg viewBox="0 0 24 24">
<path d="M12 2L2 22h20z"/>
</svg>
Step 1 — Split Into Bytes
Every byte of every file in the corpus becomes a base symbol. IDs 0–3 are reserved for special tokens; IDs 4–259 cover all 256 byte values.
Step 2 — Count Every Adjacent Pair
Scan the entire corpus (multi-threaded, one shard per file). For every position i, record the pair (seq[i], seq[i+1]) in a hash table.
<path d=", stroke=", fill="# thousands of times. BPE naturally discovers these — the most frequent byte pairs in SVGs become the first merged tokens.
Step 3 — Pick the Top Pair & Merge It
The pair with the highest count wins (ties broken by smallest ID pair for determinism). It gets assigned the next available token ID, and every occurrence in the corpus is replaced in one pass.
Step 4 — Repeat Until vocab_size
Each iteration: count → pick → merge → shrink. Pairs can now include previously merged tokens, so the algorithm builds longer and longer pieces.
The Algorithm — Clean Pseudocode
BPE Training
symbols ← {0: <unk>, 1: <bos>, 2: <eos>, 3: <pad>}
symbols ← symbols ∪ {4..259: all 256 bytes}
merges ← []
corpus ← load all files as byte-ID sequences
while |symbols| < vocab_size:
pairs ← count_all_adjacent_pairs(corpus)
(L, R) ← argmax pairs by count
(tiebreak: smallest ID pair)
if pairs[(L,R)] < min_freq:
break
M ← |symbols| # next available ID
symbols[M] ← (L, R) # new merged symbol
merges.append((L, R, M))
corpus ← replace_all(corpus, L, R → M)
emit tokenizer.json with vocab + merges
BPE Encoding (at inference)
function encode(text):
# 1. Pretokenize (GPT-2 regex)
words ← split text by regex
for each word:
# 2. Byte-level encode
symbols ← map bytes to GPT-2 chars
# 3. Apply merges in priority order
for (L, R, M) in merge_rules:
while (L, R) adjacent in symbols:
replace L, R with M
# 4. Look up final token IDs
ids ← [vocab[s] for s in symbols]
return ids
GPT-2 Byte-Level Encoding
Raw bytes can't go directly into tokenizer JSON (control characters break JSON/display). GPT-2 defines a bijective mapping from every byte (0x00–0xFF) to a printable Unicode codepoint:
ASCII — The Foundation Layer
Before BPE can merge anything, it needs a base alphabet. That alphabet is bytes — and for English text, SVG, HTML, JSON, and code, almost every byte is ASCII. Understanding ASCII is understanding the ground floor on which all BPE merges are built.
--ascii-only flag skips the 158 byte values that never appear in your data — giving the merge algorithm 158 extra vocab slots to discover useful patterns like <path, stroke=, and fill="#. For a 320-token vocab, that's nearly 50% more merges.
When to use each mode
- Multi-language text (Japanese, Arabic, etc.)
- Binary data or UTF-8 heavy corpora
- HuggingFace ecosystem compatibility
./ck-bpe-train --corpus-dir data/
- SVG, HTML, XML, JSON corpora
- Source code (C, Python, JavaScript)
- English-only text
./ck-bpe-train --ascii-only --corpus-dir data/
Visualizing the Merge Tree
Every token in the final vocabulary is a binary tree of merges. The leaves are base bytes. Here's the token " <path" (common in our SVG dataset):
Our Implementation: ck-bpe-train
Written in pure C with pthreads parallelism. Key optimizations:
Evolution: Hash Table → Trie → True BPE
Hash table and Trie both use greedy longest-match, which picks the longest token at each position. True BPE applies merges in priority order, which can produce different (correct) tokenizations:
Why Greedy Longest-Match Fails
Text: "lowest"
Greedy Longest-Match:
"low" is in vocab → [7] then "est" → [42] → [7, 42]
True BPE (merge rules applied in order):
Initial: ['l', 'o', 'w', 'e', 's', 't']
Merge 'e'+'s' (priority 12) → ['l', 'o', 'w', 'es', 't']
Merge 'es'+'t' (priority 45) → ['l', 'o', 'w', 'est']
Merge 'l'+'o' (priority 80) → ['lo', 'w', 'est']
Merge 'lo'+'w' (priority 120)→ ['low', 'est']
→ Same result HERE, but on other words the order matters!
Output: tokenizer.json
The trainer emits a HuggingFace-compatible tokenizer.json:
{
"version": "1.0",
"model": {
"type": "BPE",
"vocab": {
"<|unk|>": 0, "<|bos|>": 1, "<|eos|>": 2, "<|pad|>": 3,
"Ā": 4, // byte 0x00 → U+0100
"ā": 5, // byte 0x01 → U+0101
...
"Ġ": 36, // byte 0x20 (space) → U+0120
...
"st": 261, // merged token
...
},
"merges": [
"Ġ <", // merge #1: space + '<'
"s t", // merge #2
"= \"", // merge #3
"p a", // merge #4
...
]
},
"pre_tokenizer": {
"type": "ByteLevel",
"add_prefix_space": false,
"use_regex": true
}
}
Also emits binary exports for the C runtime: vocab_offsets.bin, vocab_strings.bin, vocab_merges.bin, and tokenizer_meta.json.
Quick Reference
Train a tokenizer
./build/ck-bpe-train \
--corpus-dir docs/site/assets \
--vocab-size 320 \
--min-freq 2 \
--out tokenizer.json
Constants
SPECIAL_COUNT | 4 |
BASE_BYTE_ID | 4 |
BASE_BYTE_COUNT | 256 |
Initial vocab | 260 |
Supported corpus extensions
.svg .txt .xml .html .md .json
Files discovered recursively, sorted alphabetically for determinism.
Related Documentation
- SentencePiece Deep Dive — Viterbi DP, byte fallback
- Core Concepts — Tokenization fundamentals
- Training Intuition — How tokens flow into the model
- Kernel Catalog — Available operations