100% HuggingFace Parity

C-Kernel BPE Tokenizer

From raw bytes to token IDs — one beautiful merge at a time. A visual, Feynman-style walkthrough of Byte Pair Encoding.

The One-Sentence Idea

Find the most common pair of adjacent symbols, replace every occurrence with a single new symbol, and repeat. That's the entire algorithm.

1 Raw Text "low lower newest" UTF-8 bytes 2 Byte Symbols l o w · l o w e r ... 256 base IDs (0x00–0xFF) 3 Merge Loop count pairs → pick top → merge repeat until vocab_size 4 Vocab 320 tokens tokenizer.json 5 IDs [42, 7, 91 …]

Step-by-Step: BPE on Our SVG Dataset

Our BPE trainer (ck-bpe-train) feeds on .svg, .txt, .html, .md, and .json files from the corpus. Let's trace the algorithm on a tiny SVG fragment:

<svg viewBox="0 0 24 24">
  <path d="M12 2L2 22h20z"/>
</svg>

Step 1 — Split Into Bytes

Every byte of every file in the corpus becomes a base symbol. IDs 0–3 are reserved for special tokens; IDs 4–259 cover all 256 byte values.

Initial Symbol Table (260 entries: 4 special + 256 bytes) ID 0: <unk> ID 1: <bos> ID 2: <eos> ID 3: <pad> Byte-level tokens (IDs 4–259): ID 4 0x00 ID 36 ' ' sp ID 52 '0' ID 69 'A' ID 101 'a' ID 126 'z' ID 64 '<' ID 66 '>' Token ID for byte b = BASE_BYTE_ID (4) + b So byte 0x3C ('<') → ID 4 + 60 = 64. Our SVG snippet becomes [64, 119, 122, 107, …]

Step 2 — Count Every Adjacent Pair

Scan the entire corpus (multi-threaded, one shard per file). For every position i, record the pair (seq[i], seq[i+1]) in a hash table.

Pair Counting — Iteration 1 Corpus: all .svg files from docs/site/assets/ (100+ SVGs) Sequence fragment from tokenizer-architecture.svg: < s v g v i e w B o x = " <s sv vg vi ie ew Bo ox Global Pair Table (top pairs from SVG corpus): Rank Pair Count Bar 1 ' ' + '<' (sp <) 4,217 2 's' + 't' (st) 3,841 3 '=' + '"' (=") 3,502 4 'a' + 't' (at) 2,987 ← SVGs have lots of sp+< ← "stroke", "style", "text" ← attribute values everywhere
Why does this matter for SVGs? SVG files repeat patterns like <path d=", stroke=", fill="# thousands of times. BPE naturally discovers these — the most frequent byte pairs in SVGs become the first merged tokens.

Step 3 — Pick the Top Pair & Merge It

The pair with the highest count wins (ties broken by smallest ID pair for determinism). It gets assigned the next available token ID, and every occurrence in the corpus is replaced in one pass.

Merge #1: (' ', '<') → new token ID 260 BEFORE (fragment): . . g < p a t h < t replace all AFTER: . . g < p a t h < t ← sequence shortened! 2 tokens → 1 each time Symbol Tree (binary merge log) 260: " <" 36: ' ' 64: '<' Merge #2: ('s', 't') → ID 261 "stroke", "style", "text", "start" — SVG attribute goldmine Merge #3: ('=', '"') → ID 262 Every SVG attribute assignment: fill="...", stroke="...", d="..." Merge #4: ('p', 'a') → ID 263 "path", "padding", "param" — the 'pa' pair is everywhere in SVG + HTML

Step 4 — Repeat Until vocab_size

Each iteration: count → pick → merge → shrink. Pairs can now include previously merged tokens, so the algorithm builds longer and longer pieces.

How Merges Compose — Building "path" from bytes INITIAL BYTES: p a t h MERGE #4: pa ID 263 th ID 271 MERGE #39: path ID 299 What the algorithm discovers from SVGs: ID 260:" <" (space + open tag) ID 261:"st" (stroke, style, start) ID 262:'="' (every attribute) ID 263:"pa" (path, padding) ID 299:"path" (full word) ID 315:"</svg" (common close tag)

The Algorithm — Clean Pseudocode

BPE Training

symbols ← {0: <unk>, 1: <bos>, 2: <eos>, 3: <pad>}
symbols ← symbols ∪ {4..259: all 256 bytes}
merges  ← []

corpus ← load all files as byte-ID sequences

while |symbols| < vocab_size:
    pairs ← count_all_adjacent_pairs(corpus)
    (L, R) ← argmax pairs by count
                  (tiebreak: smallest ID pair)
    if pairs[(L,R)] < min_freq:
        break
    M ← |symbols|          # next available ID
    symbols[M] ← (L, R)    # new merged symbol
    merges.append((L, R, M))
    corpus ← replace_all(corpus, L, R → M)

emit tokenizer.json with vocab + merges

BPE Encoding (at inference)

function encode(text):
    # 1. Pretokenize (GPT-2 regex)
    words ← split text by regex

    for each word:
        # 2. Byte-level encode
        symbols ← map bytes to GPT-2 chars

        # 3. Apply merges in priority order
        for (L, R, M) in merge_rules:
            while (L, R) adjacent in symbols:
                replace L, R with M

        # 4. Look up final token IDs
        ids ← [vocab[s] for s in symbols]

    return ids

GPT-2 Byte-Level Encoding

Raw bytes can't go directly into tokenizer JSON (control characters break JSON/display). GPT-2 defines a bijective mapping from every byte (0x00–0xFF) to a printable Unicode codepoint:

byte_to_gpt2_piece() — The Bijective Mapping Every byte gets a visible Unicode codepoint. This is how tokenizer.json stores byte-level tokens. 0x21 – 0x7E (printable ASCII) Identity mapping: byte = codepoint 'A' (0x41) → 'A' (U+0041) 0x00 – 0x20 (control + space) Shifted: codepoint = 0x100 + byte ' ' (0x20) → 'Ġ' (U+0120) 0x7F – 0xA0 (DEL + C1 controls) Shifted: codepoint = 0x100 + byte DEL (0x7F) → U+017F 0xA1 – 0xFF (Latin-1 supplement) Identity: already valid Unicode 'é' (0xE9) → 'é' (U+00E9) Key Insight Space (0x20) becomes Ġ in tokenizer.json. So when you see token "Ġhello", it means " hello" (with leading space). This is why HF tokenizers show "Ġ" everywhere.

ASCII — The Foundation Layer

Before BPE can merge anything, it needs a base alphabet. That alphabet is bytes — and for English text, SVG, HTML, JSON, and code, almost every byte is ASCII. Understanding ASCII is understanding the ground floor on which all BPE merges are built.

ASCII — American Standard Code for Information Interchange (1963) 7 bits → 128 characters → the universal text foundation What is ASCII? A 7-bit encoding (values 0–127) that maps numbers to characters. Every computer since the 1970s speaks ASCII. It defines 95 printable characters (letters, digits, symbols) + 33 control characters (tab, newline, null, etc). The 128 Characters — Four Ranges Control Chars (0x00–0x1F) 32 non-printable characters 0x00 NUL 0x09 TAB 0x0A LF 0x0D CR 0x1B ESC 0x1F ... Printable ASCII (0x20–0x7E) 95 characters — letters, digits, symbols, space 0x20 SP 0x30 '0' 0x41 'A' 0x61 'a' 0x21 '!' 0x3C '<' 0x3D '=' 0x22 '"' 0x2F '/' 0x3E '>' 0x7B '{' 0x7D '}' DEL 0x7F 1 control char Extended (0x80–0xFF) 128 more byte values NOT ASCII — used by UTF-8 continuation bytes, Latin-1 supplement, etc. Byte Heat Map — Which bytes appear in a typical SVG corpus? Each cell = one byte value (0x00 → 0xFF, left to right, top to bottom) 0x0_ 0x1_ 0x2_ 0x3_ 0x4_ 0x5_ 0x6_ 0x7_ 0x8_ 0x9_ 0xA_ 0xB_ 0xC_ 0xD_ ← 0x80–0xFF: unused in pure ASCII corpora → Heat Legend: Very hot (space, <, =, e, t) Hot (common letters, digits, ") Warm (less common printable) Cold / unused --ascii-only Mode Built into ck-bpe-train — activated with one flag Default (full byte mode): 4 special + 256 bytes = 260 base tokens All 256 byte values get a slot GPT-2 Unicode remapping for JSON ASCII-only mode: 4 special + 98 base = 102 base tokens \t (0x09) + \n (0x0A) + \r (0x0D) + printable 0x20–0x7E (95 chars) Identity mapping (no GPT-2 remap) → 158 fewer wasted slots = more room for merges!
Why ASCII matters for BPE: In an SVG/HTML/code corpus, 99%+ of your bytes are printable ASCII. The --ascii-only flag skips the 158 byte values that never appear in your data — giving the merge algorithm 158 extra vocab slots to discover useful patterns like <path, stroke=, and fill="#. For a 320-token vocab, that's nearly 50% more merges.

When to use each mode

Default (256 bytes)
  • Multi-language text (Japanese, Arabic, etc.)
  • Binary data or UTF-8 heavy corpora
  • HuggingFace ecosystem compatibility
./ck-bpe-train --corpus-dir data/
ASCII-only (98 bytes)
  • SVG, HTML, XML, JSON corpora
  • Source code (C, Python, JavaScript)
  • English-only text
./ck-bpe-train --ascii-only --corpus-dir data/

Visualizing the Merge Tree

Every token in the final vocabulary is a binary tree of merges. The leaves are base bytes. Here's the token " <path" (common in our SVG dataset):

" <path" ID 312 " <" ID 260 "path" ID 299 ' ' ID 36 '<' ID 64 "pa" ID 263 "th" ID 271 'p' ID 116 'a' ID 101 't' ID 120 'h' ID 108 Final token (used at encode time) Intermediate merge Base byte (leaf) Every token is a SymbolNode binary tree — left child + right child = new token

Our Implementation: ck-bpe-train

Written in pure C with pthreads parallelism. Key optimizations:

ck_bpe_train.c — Architecture Overview Arena Allocator Single calloc() for all state: sequences, symbols, merges, hash tables, threads. Zero fragmentation. Double-Buffered Sequences seq_a (read) seq_b (write) ping-pong each iteration Persistent Thread Pool T0 T1 T2 T3 condvar signaling Each Iteration (repeated vocab_size − 260 times): COUNT threaded, per-file MERGE TABLES local → global SELECT BEST max count, tiebreak REPLACE threaded rewrite FLIP BUFFER a ↔ b loop until vocab full Key Data Structures SymbolNode — merge tree MergeRule — (L, R, M, freq) PairSlot — hash table entry splitmix64 — hash function

Evolution: Hash Table → Trie → True BPE

v1 Hash Table O(n × k) ~2K chars/ms ~85% parity v2 256-way Trie O(k) ~100K chars/ms ~85% parity 50x faster → v3 True BPE Merge-rule based 10-80x faster than HF 100% parity ✓ exact parity →

Hash table and Trie both use greedy longest-match, which picks the longest token at each position. True BPE applies merges in priority order, which can produce different (correct) tokenizations:

Why Greedy Longest-Match Fails

Text: "lowest"

Greedy Longest-Match:
  "low" is in vocab → [7]  then "est" → [42]  → [7, 42]

True BPE (merge rules applied in order):
  Initial: ['l', 'o', 'w', 'e', 's', 't']
  Merge 'e'+'s' (priority 12)  → ['l', 'o', 'w', 'es', 't']
  Merge 'es'+'t' (priority 45) → ['l', 'o', 'w', 'est']
  Merge 'l'+'o' (priority 80)  → ['lo', 'w', 'est']
  Merge 'lo'+'w' (priority 120)→ ['low', 'est']
  → Same result HERE, but on other words the order matters!

Output: tokenizer.json

The trainer emits a HuggingFace-compatible tokenizer.json:

{
  "version": "1.0",
  "model": {
    "type": "BPE",
    "vocab": {
      "<|unk|>": 0, "<|bos|>": 1, "<|eos|>": 2, "<|pad|>": 3,
      "Ā": 4,     // byte 0x00 → U+0100
      "ā": 5,     // byte 0x01 → U+0101
      ...
      "Ġ": 36,    // byte 0x20 (space) → U+0120
      ...
      "st": 261,  // merged token
      ...
    },
    "merges": [
      "Ġ <",       // merge #1: space + '<'
      "s t",        // merge #2
      "= \"",       // merge #3
      "p a",        // merge #4
      ...
    ]
  },
  "pre_tokenizer": {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "use_regex": true
  }
}

Also emits binary exports for the C runtime: vocab_offsets.bin, vocab_strings.bin, vocab_merges.bin, and tokenizer_meta.json.

Quick Reference

Train a tokenizer

./build/ck-bpe-train \
  --corpus-dir docs/site/assets \
  --vocab-size 320 \
  --min-freq 2 \
  --out tokenizer.json

Constants

SPECIAL_COUNT4
BASE_BYTE_ID4
BASE_BYTE_COUNT256
Initial vocab260

Supported corpus extensions

.svg .txt .xml .html .md .json

Files discovered recursively, sorted alphabetically for determinism.

Related Documentation

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close