Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Tokenization in 0.14.0 adds spaces #856

Open
newsletternewsletter opened this issue Jul 18, 2024 · 6 comments
Open

[BUG]: Tokenization in 0.14.0 adds spaces #856

newsletternewsletter opened this issue Jul 18, 2024 · 6 comments
Labels
bug Something isn't working Upstream Tracking an issue in llama.cpp

Comments

@newsletternewsletter
Copy link

Description

When tokenizing a text and decoding these tokens, one can see that tokenization now (as of version 0.14.0) adds one additional starting space to text for every call of Context.Tokenize(text, addBos, special). This is especially bad if a text is tokenized with more than one call.
Version 0.13.0 did not exhibit such behavior. Or at least, it did not add spaces at the start of words, changing their token ids.

This seems fine for most models (I saw this when using trollek/NinjaMouse-2.4B-32L-danube), but when I use gemma-1.1-2b-it-Q6_K.gguf (from: bartowski/gemma-1.1-2b-it-GGUF) now, it's not working anymore. The prompt was:

<start_of_turn>user
Who are you?<end_of_turn>
<start_of_turn>model

Validating with tokenize from llama.cpp b2985 (used in LlamaSharp Version 0.13.0):

     2 -> '<bos>'
   106 -> '<start_of_turn>'
  2425 -> ' user'
235286 -> '\'
235254 -> 'n'
  6571 -> 'Who'
   708 -> ' are'
   692 -> ' you'
235336 -> '?'
   107 -> '<end_of_turn>'
   730 -> ' \'
235254 -> 'n'
   106 -> '<start_of_turn>'
  2091 -> ' model'
235286 -> '\'
235254 -> 'n'

Interestingly the token at position 2 with id 2425 ' user' adds a starting space to 'user' (id 1645).

But even the latest llama.cpp b3412 does not work correctly, look at token at position 2 with id 968 ' <' :

     2 -> '<bos>'
   968 -> ' <'
  2997 -> 'start'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
  1645 -> 'user'
   108 -> '
'
  6571 -> 'Who'
   708 -> ' are'
   692 -> ' you'
181537 -> '?<'
   615 -> 'end'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
   108 -> '
'
235322 -> '<'
  2997 -> 'start'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
  2516 -> 'model'
   108 -> '
'

Is there a way to completely prevent extra spaces from being added by tokenization anywhere? I will tokenize them by hand if necessary. 😉

Reproduction Steps

Write the prompt (see above) to prompts.txt and run:

for llama.cpp b2985:

tokenize.exe "gemma-1.1-2b-it.Q6_K.gguf" "<start_of_turn>user\nWho are you?<end_of_turn>\n<start_of_turn>model\n"

or for llama.cpp b3412:

llama-tokenize.exe -m "gemma-1.1-2b-it.Q6_K.gguf" -f "prompt.txt"

Environment & Configuration

  • Operating system: Windows 10
  • .NET runtime version: 8.0
  • LLamaSharp version: 0.14.0
  • CPU device: Intel Core i7

Known Workarounds

I would love to know!

@martindevans
Copy link
Member

If you're seeing the wrong behaviour in llama-tokenize.exe this looks like it's probably an upstream bug?

@newsletternewsletter
Copy link
Author

If you're seeing the wrong behaviour in llama-tokenize.exe this looks like it's probably an upstream bug?

Yes, indeed!
I opened a bug ticket there: ggerganov/llama.cpp/issues/8584.

@martindevans martindevans added bug Something isn't working Upstream Tracking an issue in llama.cpp labels Jul 19, 2024
@newsletternewsletter
Copy link
Author

The behavior when the tokenizer adds a space to the first non-special token can be customized via the key tokenizer.ggml.add_space_prefix.
There are 2 workarounds (ggerganov/llama.cpp#8584 (comment)):

  1. Using a KV override: tokenizer.ggml.add_space_prefix=bool:false.
  2. Changing the model's KV metadata: add this key and set its value to false.

An acceptable workaround: Changing the KV metadata in the GGUF file via a python script works wonders (using a modified gguf-py/scripts/gguf_new_metadata.py from llama.cpp).

However, trying a KV override via ModelParams.MetadataOverrides does not seem to work. When adding modelParams.MetadataOverrides.Add(new MetadataOverride("tokenizer.ggml.add_space_prefix", false)) before loading the model via LLamaWeights.LoadFromFileAsync, the KV override is ignored and the tokenizer adds a space.

This is an upstream bug as it is reproducible with llama-cli.

@newsletternewsletter
Copy link
Author

It is being fixed upstream: ggerganov/llama.cpp#8614

@Oceania2018
Copy link
Member

Gemma 2 2B is released, it's even surpassing GPT-3.5-turbo
image

@newsletternewsletter
Copy link
Author

However, trying a KV override via ModelParams.MetadataOverrides does not seem to work. When adding modelParams.MetadataOverrides.Add(new MetadataOverride("tokenizer.ggml.add_space_prefix", false)) before loading the model via LLamaWeights.LoadFromFileAsync, the KV override is ignored and the tokenizer adds a space.

I tried again with LLamaSharp 0.15.0 and although it has been fixed upstream (ggerganov/llama.cpp#8614), the KV override in LLamaSharp via ModelParams.MetadataOverrides does not work, neither with models that have tokenizer.ggml.add_space_prefix set to true (e.g. Lite-Mistral-150M-v2-Instruct), nor with ones without the key tokenizer.ggml.add_space_prefix (many old quants).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Upstream Tracking an issue in llama.cpp
Projects
None yet
Development

No branches or pull requests

3 participants