Go to the source code of this file.
Data Structures | |
| struct | CKBPEConfig |
Macros | |
| #define | CK_TRUE_BPE_API __attribute__((visibility("default"))) |
Enumerations | |
| enum | CKSpacePrefixStyle { CK_SPACE_PREFIX_AUTO = 0 , CK_SPACE_PREFIX_GPT2 = 1 , CK_SPACE_PREFIX_SPM = 2 , CK_SPACE_PREFIX_AUTO = 0 , CK_SPACE_PREFIX_GPT2 = 1 , CK_SPACE_PREFIX_SPM = 2 } |
Functions | |
| __attribute__ ((visibility("default"))) CKTrueBPE *ck_true_bpe_create(void) | |
Variables | |
| int32_t int32_t | bos |
| const CKBPEConfig * | config |
| int32_t int32_t int32_t | eos |
| const char int32_t | id |
| const char int int32_t * | ids |
| const char * | left |
| int32_t | left_id |
| const char int int32_t int | max_ids |
| const int32_t int char int | max_len |
| int32_t int32_t int32_t | merged_id |
| int const int32_t const char int const int32_t * | merges |
| const int32_t int | num_ids |
| int const int32_t const char int | num_merges |
| int const int32_t * | offsets |
| int32_t int32_t int32_t int32_t | pad |
| int32_t int32_t int32_t int32_t | priority |
| const char const char * | right |
| int32_t int32_t | right_id |
| const char int32_t float | score |
| int const int32_t const char * | strings |
| const char * | text |
| const char int | text_len |
| const char * | token |
| int32_t | unk |
| int | vocab_size |
| #define CK_TRUE_BPE_API __attribute__((visibility("default"))) |
Definition at line 37 of file true_bpe.h.
| enum CKSpacePrefixStyle |
| Enumerator | |
|---|---|
| CK_SPACE_PREFIX_AUTO | |
| CK_SPACE_PREFIX_GPT2 | |
| CK_SPACE_PREFIX_SPM | |
| CK_SPACE_PREFIX_AUTO | |
| CK_SPACE_PREFIX_GPT2 | |
| CK_SPACE_PREFIX_SPM | |
Definition at line 45 of file true_bpe.h.
| __attribute__ | ( | (visibility("default")) | ) |
Create a new True BPE tokenizer.
Free a True BPE tokenizer.
| bpe | Tokenizer to free |
Add a token to the vocabulary.
| bpe | Tokenizer |
| token | Token string (UTF-8) |
| id | Token ID |
| score | Token score (for unigram models, 0.0 for BPE) |
Add a BPE merge rule by token IDs.
Merge rules define how tokens are combined during encoding. Rules with lower priority numbers are applied first.
| bpe | Tokenizer |
| left_id | Left token ID |
| right_id | Right token ID |
| merged_id | Resulting merged token ID |
| priority | Merge priority (lower = applied first) |
Add a BPE merge rule by token strings.
This looks up the token IDs automatically and determines the merged token. The merged token must already exist in the vocabulary.
| bpe | Tokenizer |
| left | Left token string |
| right | Right token string |
| priority | Merge priority (lower = applied first) |
Set special token IDs.
| bpe | Tokenizer |
| unk | Unknown token ID (-1 to disable) |
| bos | Beginning-of-sequence token ID (-1 to disable) |
| eos | End-of-sequence token ID (-1 to disable) |
| pad | Padding token ID (-1 to disable) |
Add a special token that should be matched BEFORE BPE encoding.
Special tokens like <|im_start|>, <|im_end|>, <|endoftext|> are matched literally in the input text before BPE processing. Without this, BPE would break them into individual characters.
| bpe | Tokenizer |
| token | Token string to match literally (e.g., "<|im_end|>") |
| id | Token ID to output when matched |
Set tokenizer configuration.
| bpe | Tokenizer |
| config | Configuration to apply |
Load vocabulary + merges from binary buffers.
| bpe | Tokenizer |
| vocab_size | Number of tokens |
| offsets | Offsets array (length vocab_size) |
| strings | Null-terminated token strings blob |
| num_merges | Number of merge rules |
| merges | Merge triples [left_id, right_id, merged_id] (length num_merges*3) |
Look up a token ID by string.
| bpe | Tokenizer |
| token | Token string |
Get a token string by ID.
| bpe | Tokenizer |
| id | Token ID |
Get vocabulary size.
| bpe | Tokenizer |
Get number of merge rules.
| bpe | Tokenizer |
Auto-detect space prefix style from vocabulary.
Counts tokens starting with Ġ (GPT-2) vs ▁ (SentencePiece) to determine style. The detected style is cached in the config.
| bpe | Tokenizer |
Encode text to token IDs using true BPE algorithm.
This applies merge rules in priority order (not greedy longest-match).
| bpe | Tokenizer |
| text | Input text (UTF-8) |
| text_len | Text length in bytes, or -1 for null-terminated |
| ids | Output token IDs array |
| max_ids | Maximum IDs to write |
Decode token IDs to text.
| bpe | Tokenizer |
| ids | Input token IDs |
| num_ids | Number of IDs |
| text | Output text buffer |
| max_len | Maximum text length |
| int32_t int32_t bos |
Definition at line 145 of file true_bpe.h.
| const CKBPEConfig* config |
Definition at line 171 of file true_bpe.h.
Referenced by ck_tokenizer_encode(), and ck_true_bpe_set_config().
| int32_t int32_t int32_t eos |
Definition at line 146 of file true_bpe.h.
| int32_t id |
Definition at line 95 of file true_bpe.h.
| const int32_t* ids |
Definition at line 263 of file true_bpe.h.
| const char* left |
Definition at line 130 of file true_bpe.h.
Referenced by ck_tokenizer_add_merge(), ck_tokenizer_encode_spm_llama_impl(), ck_tokenizer_load_binary(), ck_tokenizer_lookup_merge(), ck_true_bpe_add_merge_by_tokens(), ck_true_bpe_load_binary(), and hash_pair().
| int32_t left_id |
Definition at line 112 of file true_bpe.h.
Referenced by ck_true_bpe_add_merge(), find_best_merge(), merge_key(), and merge_table_lookup().
| const char int int32_t int max_ids |
Definition at line 264 of file true_bpe.h.
Referenced by ck_tokenizer_encode(), ck_tokenizer_encode_spm_impl(), ck_tokenizer_encode_spm_llama_impl(), ck_true_bpe_encode(), encode_chunk(), encode_text_segment(), main(), spm_encode_byte_fallback(), and spm_llama_resegment_node().
| const int32_t int char int max_len |
Definition at line 280 of file true_bpe.h.
Referenced by ck_tokenizer_decode(), ck_true_bpe_decode(), find_longest_match_hash(), json_parse_string(), spm_count_unknown_run(), and spm_find_candidates_at_pos().
| int32_t int32_t int32_t merged_id |
Definition at line 114 of file true_bpe.h.
Referenced by ck_tokenizer_load(), ck_true_bpe_add_merge(), ck_true_bpe_add_merge_by_tokens(), and token_list_merge_at().
| int const int32_t const char int const int32_t* merges |
Definition at line 189 of file true_bpe.h.
Referenced by ck_tokenizer_load_binary(), ck_tokenizer_load_binary_with_scores(), ck_true_bpe_load_binary(), and main().
| const int32_t int num_ids |
Definition at line 278 of file true_bpe.h.
| int const int32_t const char int num_merges |
Definition at line 188 of file true_bpe.h.
Referenced by ck_tokenizer_load_binary(), ck_tokenizer_load_binary_with_scores(), ck_true_bpe_load_binary(), main(), and run_inference().
| int const int32_t* offsets |
Definition at line 186 of file true_bpe.h.
Referenced by ck_tokenizer_load_binary(), ck_tokenizer_load_binary_with_scores(), ck_true_bpe_load_binary(), main(), and spm_build_byte_lookup().
| int32_t int32_t int32_t int32_t pad |
Definition at line 147 of file true_bpe.h.
| const char const char int32_t priority |
Definition at line 115 of file true_bpe.h.
Referenced by ck_tokenizer_add_merge(), ck_trie_insert(), ck_true_bpe_add_merge(), and ck_true_bpe_add_merge_by_tokens().
| const char const char* right |
Definition at line 131 of file true_bpe.h.
Referenced by ck_tokenizer_add_merge(), ck_tokenizer_encode_spm_llama_impl(), ck_tokenizer_load_binary(), ck_tokenizer_lookup_merge(), ck_true_bpe_add_merge_by_tokens(), ck_true_bpe_load_binary(), and hash_pair().
| int32_t int32_t right_id |
Definition at line 113 of file true_bpe.h.
Referenced by ck_true_bpe_add_merge(), find_best_merge(), merge_key(), and merge_table_lookup().
| const char int32_t float score |
Definition at line 96 of file true_bpe.h.
| int const int32_t const char* strings |
Definition at line 187 of file true_bpe.h.
Referenced by ck_tokenizer_load_binary(), ck_tokenizer_load_binary_with_scores(), ck_true_bpe_load_binary(), main(), and spm_build_byte_lookup().
| const int32_t int char* text |
Definition at line 261 of file true_bpe.h.
| const char int text_len |
Definition at line 262 of file true_bpe.h.
Referenced by ck_tokenizer_encode(), ck_tokenizer_encode_spm_impl(), ck_tokenizer_encode_spm_llama_impl(), ck_tokenizer_lookup_exact_n(), ck_trie_find_longest(), ck_trie_has_prefix(), ck_true_bpe_encode(), encode_text_segment(), find_longest_match(), find_longest_match_hash(), find_longest_match_trie(), gpt2_pretokenize(), init_tokens_from_text(), match_special_token(), preprocess_bpe_spaces(), preprocess_spm_llama_text(), preprocess_spm_text(), preprocess_text(), spm_count_unknown_run(), spm_encode_byte_fallback(), and spm_find_candidates_at_pos().
| const char* token |
Definition at line 94 of file true_bpe.h.
| int32_t unk |
Definition at line 144 of file true_bpe.h.
| int vocab_size |
Definition at line 185 of file true_bpe.h.
Referenced by ck_tokenizer_load_binary(), ck_tokenizer_load_binary_with_scores(), ck_true_bpe_load_binary(), embedding_backward(), embedding_backward_bf16(), embedding_forward(), embedding_forward_bf16(), embedding_forward_q4_k(), embedding_forward_q6_k(), embedding_forward_q8_0(), logits_copy_to_position(), main(), run_inference(), run_prompt(), sample_argmax(), sample_token(), sample_top_p(), sample_topk(), simple_embedding(), softmax_cross_entropy_loss(), softmax_cross_entropy_loss_bf16(), and spm_build_byte_lookup().