100% HuggingFace Parity

C-Kernel BPE Tokenizer

From raw bytes to token IDs — one beautiful merge at a time. A visual, Feynman-style walkthrough of Byte Pair Encoding.

The One-Sentence Idea

Find the most common pair of adjacent symbols, replace every occurrence with a single new symbol, and repeat. That's the entire algorithm.

Step-by-Step: BPE on Our SVG Dataset

Our BPE trainer (ck-bpe-train) feeds on .svg, .txt, .html, .md, and .json files from the corpus. Let's trace the algorithm on a tiny SVG fragment:

<svg viewBox="0 0 24 24">
  <path d="M12 2L2 22h20z"/>
</svg>

Step 1 — Split Into Bytes

Every byte of every file in the corpus becomes a base symbol. IDs 0–3 are reserved for special tokens; IDs 4–259 cover all 256 byte values.

Step 2 — Count Every Adjacent Pair

Scan the entire corpus (multi-threaded, one shard per file). For every position i, record the pair (seq[i], seq[i+1]) in a hash table.

Why does this matter for SVGs? SVG files repeat patterns like <path d=", stroke=", fill="# thousands of times. BPE naturally discovers these — the most frequent byte pairs in SVGs become the first merged tokens.

Step 3 — Pick the Top Pair & Merge It

The pair with the highest count wins (ties broken by smallest ID pair for determinism). It gets assigned the next available token ID, and every occurrence in the corpus is replaced in one pass.

Step 4 — Repeat Until vocab_size

Each iteration: count → pick → merge → shrink. Pairs can now include previously merged tokens, so the algorithm builds longer and longer pieces.

The Algorithm — Clean Pseudocode

BPE Training

symbols ← {0: <unk>, 1: <bos>, 2: <eos>, 3: <pad>}
symbols ← symbols ∪ {4..259: all 256 bytes}
merges  ← []

corpus ← load all files as byte-ID sequences

while |symbols| < vocab_size:
    pairs ← count_all_adjacent_pairs(corpus)
    (L, R) ← argmax pairs by count
                  (tiebreak: smallest ID pair)
    if pairs[(L,R)] < min_freq:
        break
    M ← |symbols|          # next available ID
    symbols[M] ← (L, R)    # new merged symbol
    merges.append((L, R, M))
    corpus ← replace_all(corpus, L, R → M)

emit tokenizer.json with vocab + merges

BPE Encoding (at inference)

function encode(text):
    # 1. Pretokenize (GPT-2 regex)
    words ← split text by regex

    for each word:
        # 2. Byte-level encode
        symbols ← map bytes to GPT-2 chars

        # 3. Apply merges in priority order
        for (L, R, M) in merge_rules:
            while (L, R) adjacent in symbols:
                replace L, R with M

        # 4. Look up final token IDs
        ids ← [vocab[s] for s in symbols]

    return ids

GPT-2 Byte-Level Encoding

Raw bytes can't go directly into tokenizer JSON (control characters break JSON/display). GPT-2 defines a bijective mapping from every byte (0x00–0xFF) to a printable Unicode codepoint:

ASCII — The Foundation Layer

Before BPE can merge anything, it needs a base alphabet. That alphabet is bytes — and for English text, SVG, HTML, JSON, and code, almost every byte is ASCII. Understanding ASCII is understanding the ground floor on which all BPE merges are built.

Why ASCII matters for BPE: In an SVG/HTML/code corpus, 99%+ of your bytes are printable ASCII. The --ascii-only flag skips the 158 byte values that never appear in your data — giving the merge algorithm 158 extra vocab slots to discover useful patterns like <path, stroke=, and fill="#. For a 320-token vocab, that's nearly 50% more merges.

When to use each mode

Default (256 bytes)

Multi-language text (Japanese, Arabic, etc.)
Binary data or UTF-8 heavy corpora
HuggingFace ecosystem compatibility

./ck-bpe-train --corpus-dir data/

ASCII-only (98 bytes)

SVG, HTML, XML, JSON corpora
Source code (C, Python, JavaScript)
English-only text

./ck-bpe-train --ascii-only --corpus-dir data/

Visualizing the Merge Tree

Every token in the final vocabulary is a binary tree of merges. The leaves are base bytes. Here's the token " <path" (common in our SVG dataset):

Our Implementation: ck-bpe-train

Written in pure C with pthreads parallelism. Key optimizations:

Evolution: Hash Table → Trie → True BPE

Hash table and Trie both use greedy longest-match, which picks the longest token at each position. True BPE applies merges in priority order, which can produce different (correct) tokenizations:

Why Greedy Longest-Match Fails

Text: "lowest"

Greedy Longest-Match:
  "low" is in vocab → [7]  then "est" → [42]  → [7, 42]

True BPE (merge rules applied in order):
  Initial: ['l', 'o', 'w', 'e', 's', 't']
  Merge 'e'+'s' (priority 12)  → ['l', 'o', 'w', 'es', 't']
  Merge 'es'+'t' (priority 45) → ['l', 'o', 'w', 'est']
  Merge 'l'+'o' (priority 80)  → ['lo', 'w', 'est']
  Merge 'lo'+'w' (priority 120)→ ['low', 'est']
  → Same result HERE, but on other words the order matters!

Output: tokenizer.json

The trainer emits a HuggingFace-compatible tokenizer.json:

{
  "version": "1.0",
  "model": {
    "type": "BPE",
    "vocab": {
      "<|unk|>": 0, "<|bos|>": 1, "<|eos|>": 2, "<|pad|>": 3,
      "Ā": 4,     // byte 0x00 → U+0100
      "ā": 5,     // byte 0x01 → U+0101
      ...
      "Ġ": 36,    // byte 0x20 (space) → U+0120
      ...
      "st": 261,  // merged token
      ...
    },
    "merges": [
      "Ġ <",       // merge #1: space + '<'
      "s t",        // merge #2
      "= \"",       // merge #3
      "p a",        // merge #4
      ...
    ]
  },
  "pre_tokenizer": {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "use_regex": true
  }
}

Also emits binary exports for the C runtime: vocab_offsets.bin, vocab_strings.bin, vocab_merges.bin, and tokenizer_meta.json.

Quick Reference

Train a tokenizer

./build/ck-bpe-train \
  --corpus-dir docs/site/assets \
  --vocab-size 320 \
  --min-freq 2 \
  --out tokenizer.json

Constants

`SPECIAL_COUNT`	4
`BASE_BYTE_ID`	4
`BASE_BYTE_COUNT`	256
`Initial vocab`	260

Supported corpus extensions

.svg .txt .xml .html .md .json

Files discovered recursively, sorted alphabetically for determinism.