Adding custom tokens #10137

Emreerdog · 2024-11-02T12:16:28Z

Emreerdog
Nov 2, 2024

Hello, since it is usual to give system prompt to the LLM specifying an output format, for a given system prompt:

You are summarizing the user message and writing this summarization between tags given as [SUM_BEGIN] [SUM_END].
I am parsing the output in a way that I can extract the summarization output of the LLM. Since I am receiving the tokens one by one like (token1([SU), token2(M_) ... tokenN(D])) etc., I am parsing the resultant message at once.

However, is it possible to add custom tokens such as "[SUM_BEGIN] [SUM_END]" to the model's vocabulary so that I can receive those tokens in a manner like <|im_start|> and <|im_end|>. I found and tried to edit some key/value pairs in the gguf file to achieve this however, it didn't work.

I pushed my tokens to tokenizer.ggml.tokens and their corresponding token_type
tokenizer.ggml.tokens tokenizer.ggml.token_type

If this is possible, it will make things a lot easier for me. All I want is to add custom tokens to model's vocabulary so that I can do calls like:

if (llama_token_get_attr(mModel, generatedToken) & LLAMA_TOKEN_ATTR_CONTROL) { // Do stuff }
or

if (llama_token_get_attr(mModel, generatedToken) & LLAMA_TOKEN_ATTR_USER_DEFINED) { // Do stuff }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding custom tokens #10137

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Adding custom tokens #10137

Emreerdog Nov 2, 2024

Replies: 0 comments

Emreerdog
Nov 2, 2024