← Back to C-Kernel-Engine Docs Doxygen Source Documentation
true_bpe.h File Reference
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
#include "tokenizer/hash_table.h"

Go to the source code of this file.

Data Structures

struct  CKBPEConfig
 

Macros

#define CK_TRUE_BPE_API   __attribute__((visibility("default")))
 

Enumerations

enum  CKSpacePrefixStyle {
  CK_SPACE_PREFIX_AUTO = 0 , CK_SPACE_PREFIX_GPT2 = 1 , CK_SPACE_PREFIX_SPM = 2 , CK_SPACE_PREFIX_AUTO = 0 ,
  CK_SPACE_PREFIX_GPT2 = 1 , CK_SPACE_PREFIX_SPM = 2
}
 

Functions

 __attribute__ ((visibility("default"))) CKTrueBPE *ck_true_bpe_create(void)
 

Variables

int32_t int32_t bos
 
const CKBPEConfigconfig
 
int32_t int32_t int32_t eos
 
const char int32_t id
 
const char int int32_t * ids
 
const char * left
 
int32_t left_id
 
const char int int32_t int max_ids
 
const int32_t int char int max_len
 
int32_t int32_t int32_t merged_id
 
int const int32_t const char int const int32_t * merges
 
const int32_t int num_ids
 
int const int32_t const char int num_merges
 
int const int32_t * offsets
 
int32_t int32_t int32_t int32_t pad
 
int32_t int32_t int32_t int32_t priority
 
const char const char * right
 
int32_t int32_t right_id
 
const char int32_t float score
 
int const int32_t const char * strings
 
const char * text
 
const char int text_len
 
const char * token
 
int32_t unk
 
int vocab_size
 

Macro Definition Documentation

◆ CK_TRUE_BPE_API

#define CK_TRUE_BPE_API   __attribute__((visibility("default")))

Definition at line 37 of file true_bpe.h.

Enumeration Type Documentation

◆ CKSpacePrefixStyle

Enumerator
CK_SPACE_PREFIX_AUTO 
CK_SPACE_PREFIX_GPT2 
CK_SPACE_PREFIX_SPM 
CK_SPACE_PREFIX_AUTO 
CK_SPACE_PREFIX_GPT2 
CK_SPACE_PREFIX_SPM 

Definition at line 45 of file true_bpe.h.

45  {
46  CK_SPACE_PREFIX_AUTO = 0, /* Auto-detect from vocabulary */
47  CK_SPACE_PREFIX_GPT2 = 1, /* GPT-2 style: Ġ (U+0120, bytes 0xC4 0xA0) */
48  CK_SPACE_PREFIX_SPM = 2 /* SentencePiece style: ▁ (U+2581, bytes 0xE2 0x96 0x81) */
CKSpacePrefixStyle
Definition: true_bpe.h:45
@ CK_SPACE_PREFIX_AUTO
Definition: true_bpe.h:46
@ CK_SPACE_PREFIX_SPM
Definition: true_bpe.h:48
@ CK_SPACE_PREFIX_GPT2
Definition: true_bpe.h:47

Function Documentation

◆ __attribute__()

__attribute__ ( (visibility("default"))  )

Create a new True BPE tokenizer.

Returns
Newly allocated tokenizer, or NULL on error

Free a True BPE tokenizer.

Parameters
bpeTokenizer to free

Add a token to the vocabulary.

Parameters
bpeTokenizer
tokenToken string (UTF-8)
idToken ID
scoreToken score (for unigram models, 0.0 for BPE)
Returns
0 on success, -1 on error

Add a BPE merge rule by token IDs.

Merge rules define how tokens are combined during encoding. Rules with lower priority numbers are applied first.

Parameters
bpeTokenizer
left_idLeft token ID
right_idRight token ID
merged_idResulting merged token ID
priorityMerge priority (lower = applied first)
Returns
0 on success, -1 on error

Add a BPE merge rule by token strings.

This looks up the token IDs automatically and determines the merged token. The merged token must already exist in the vocabulary.

Parameters
bpeTokenizer
leftLeft token string
rightRight token string
priorityMerge priority (lower = applied first)
Returns
0 on success, -1 on error

Set special token IDs.

Parameters
bpeTokenizer
unkUnknown token ID (-1 to disable)
bosBeginning-of-sequence token ID (-1 to disable)
eosEnd-of-sequence token ID (-1 to disable)
padPadding token ID (-1 to disable)

Add a special token that should be matched BEFORE BPE encoding.

Special tokens like <|im_start|>, <|im_end|>, <|endoftext|> are matched literally in the input text before BPE processing. Without this, BPE would break them into individual characters.

Parameters
bpeTokenizer
tokenToken string to match literally (e.g., "<|im_end|>")
idToken ID to output when matched
Returns
0 on success, -1 on error

Set tokenizer configuration.

Parameters
bpeTokenizer
configConfiguration to apply

Load vocabulary + merges from binary buffers.

Parameters
bpeTokenizer
vocab_sizeNumber of tokens
offsetsOffsets array (length vocab_size)
stringsNull-terminated token strings blob
num_mergesNumber of merge rules
mergesMerge triples [left_id, right_id, merged_id] (length num_merges*3)
Returns
0 on success, -1 on error

Look up a token ID by string.

Parameters
bpeTokenizer
tokenToken string
Returns
Token ID, or unk_id if not found

Get a token string by ID.

Parameters
bpeTokenizer
idToken ID
Returns
Token string, or NULL if invalid

Get vocabulary size.

Parameters
bpeTokenizer
Returns
Number of tokens in vocabulary

Get number of merge rules.

Parameters
bpeTokenizer
Returns
Number of merge rules

Auto-detect space prefix style from vocabulary.

Counts tokens starting with Ġ (GPT-2) vs ▁ (SentencePiece) to determine style. The detected style is cached in the config.

Parameters
bpeTokenizer
Returns
Detected style (GPT2 or SPM)

Encode text to token IDs using true BPE algorithm.

This applies merge rules in priority order (not greedy longest-match).

Parameters
bpeTokenizer
textInput text (UTF-8)
text_lenText length in bytes, or -1 for null-terminated
idsOutput token IDs array
max_idsMaximum IDs to write
Returns
Number of tokens written

Decode token IDs to text.

Parameters
bpeTokenizer
idsInput token IDs
num_idsNumber of IDs
textOutput text buffer
max_lenMaximum text length
Returns
Number of bytes written (excluding null terminator)

Variable Documentation

◆ bos

int32_t int32_t bos

Definition at line 145 of file true_bpe.h.

◆ config

const CKBPEConfig* config

Definition at line 171 of file true_bpe.h.

Referenced by ck_tokenizer_encode(), and ck_true_bpe_set_config().

◆ eos

int32_t int32_t int32_t eos

Definition at line 146 of file true_bpe.h.

◆ id

int32_t id

Definition at line 95 of file true_bpe.h.

◆ ids

const int32_t* ids

Definition at line 263 of file true_bpe.h.

◆ left

◆ left_id

int32_t left_id

Definition at line 112 of file true_bpe.h.

Referenced by ck_true_bpe_add_merge(), find_best_merge(), merge_key(), and merge_table_lookup().

◆ max_ids

◆ max_len

◆ merged_id

int32_t int32_t int32_t merged_id

◆ merges

int const int32_t const char int const int32_t* merges

◆ num_ids

const int32_t int num_ids

Definition at line 278 of file true_bpe.h.

◆ num_merges

int const int32_t const char int num_merges

◆ offsets

◆ pad

int32_t int32_t int32_t int32_t pad

Definition at line 147 of file true_bpe.h.

◆ priority

const char const char int32_t priority

◆ right

◆ right_id

int32_t int32_t right_id

Definition at line 113 of file true_bpe.h.

Referenced by ck_true_bpe_add_merge(), find_best_merge(), merge_key(), and merge_table_lookup().

◆ score

const char int32_t float score

Definition at line 96 of file true_bpe.h.

◆ strings

int const int32_t const char* strings

◆ text

const int32_t int char* text

Definition at line 261 of file true_bpe.h.

◆ text_len

◆ token

const char* token

Definition at line 94 of file true_bpe.h.

◆ unk

int32_t unk

Definition at line 144 of file true_bpe.h.

◆ vocab_size