[BUG]: Tokenization in 0.14.0 adds spaces #856

newsletternewsletter · 2024-07-18T19:15:44Z

Description

When tokenizing a text and decoding these tokens, one can see that tokenization now (as of version 0.14.0) adds one additional starting space to text for every call of Context.Tokenize(text, addBos, special). This is especially bad if a text is tokenized with more than one call.
Version 0.13.0 did not exhibit such behavior. Or at least, it did not add spaces at the start of words, changing their token ids.

This seems fine for most models (I saw this when using trollek/NinjaMouse-2.4B-32L-danube), but when I use gemma-1.1-2b-it-Q6_K.gguf (from: bartowski/gemma-1.1-2b-it-GGUF) now, it's not working anymore. The prompt was:

<start_of_turn>user
Who are you?<end_of_turn>
<start_of_turn>model

Validating with tokenize from llama.cpp b2985 (used in LlamaSharp Version 0.13.0):

     2 -> '<bos>'
   106 -> '<start_of_turn>'
  2425 -> ' user'
235286 -> '\'
235254 -> 'n'
  6571 -> 'Who'
   708 -> ' are'
   692 -> ' you'
235336 -> '?'
   107 -> '<end_of_turn>'
   730 -> ' \'
235254 -> 'n'
   106 -> '<start_of_turn>'
  2091 -> ' model'
235286 -> '\'
235254 -> 'n'

Interestingly the token at position 2 with id 2425 ' user' adds a starting space to 'user' (id 1645).

But even the latest llama.cpp b3412 does not work correctly, look at token at position 2 with id 968 ' <' :

     2 -> '<bos>'
   968 -> ' <'
  2997 -> 'start'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
  1645 -> 'user'
   108 -> '
'
  6571 -> 'Who'
   708 -> ' are'
   692 -> ' you'
181537 -> '?<'
   615 -> 'end'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
   108 -> '
'
235322 -> '<'
  2997 -> 'start'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
  2516 -> 'model'
   108 -> '
'

Is there a way to completely prevent extra spaces from being added by tokenization anywhere? I will tokenize them by hand if necessary. 😉

Reproduction Steps

Write the prompt (see above) to prompts.txt and run:

for llama.cpp b2985:

tokenize.exe "gemma-1.1-2b-it.Q6_K.gguf" "<start_of_turn>user\nWho are you?<end_of_turn>\n<start_of_turn>model\n"

or for llama.cpp b3412:

llama-tokenize.exe -m "gemma-1.1-2b-it.Q6_K.gguf" -f "prompt.txt"

Environment & Configuration

Operating system: Windows 10
.NET runtime version: 8.0
LLamaSharp version: 0.14.0
CPU device: Intel Core i7

Known Workarounds

I would love to know!

The text was updated successfully, but these errors were encountered:

martindevans · 2024-07-18T19:26:39Z

If you're seeing the wrong behaviour in llama-tokenize.exe this looks like it's probably an upstream bug?

newsletternewsletter · 2024-07-19T05:04:43Z

If you're seeing the wrong behaviour in llama-tokenize.exe this looks like it's probably an upstream bug?

Yes, indeed!
I opened a bug ticket there: ggerganov/llama.cpp/issues/8584.

newsletternewsletter · 2024-07-20T19:54:59Z

The behavior when the tokenizer adds a space to the first non-special token can be customized via the key tokenizer.ggml.add_space_prefix.
There are 2 workarounds (ggerganov/llama.cpp#8584 (comment)):

Using a KV override: tokenizer.ggml.add_space_prefix=bool:false.
Changing the model's KV metadata: add this key and set its value to false.

An acceptable workaround: Changing the KV metadata in the GGUF file via a python script works wonders (using a modified gguf-py/scripts/gguf_new_metadata.py from llama.cpp).

However, trying a KV override via ModelParams.MetadataOverrides does not seem to work. When adding modelParams.MetadataOverrides.Add(new MetadataOverride("tokenizer.ggml.add_space_prefix", false)) before loading the model via LLamaWeights.LoadFromFileAsync, the KV override is ignored and the tokenizer adds a space.

This is an upstream bug as it is reproducible with llama-cli.

newsletternewsletter · 2024-07-21T16:35:41Z

It is being fixed upstream: ggerganov/llama.cpp#8614

Oceania2018 · 2024-08-01T16:23:25Z

Gemma 2 2B is released, it's even surpassing GPT-3.5-turbo

newsletternewsletter · 2024-08-04T03:26:21Z

However, trying a KV override via ModelParams.MetadataOverrides does not seem to work. When adding modelParams.MetadataOverrides.Add(new MetadataOverride("tokenizer.ggml.add_space_prefix", false)) before loading the model via LLamaWeights.LoadFromFileAsync, the KV override is ignored and the tokenizer adds a space.

I tried again with LLamaSharp 0.15.0 and although it has been fixed upstream (ggerganov/llama.cpp#8614), the KV override in LLamaSharp via ModelParams.MetadataOverrides does not work, neither with models that have tokenizer.ggml.add_space_prefix set to true (e.g. Lite-Mistral-150M-v2-Instruct), nor with ones without the key tokenizer.ggml.add_space_prefix (many old quants).

martindevans added bug Something isn't working Upstream Tracking an issue in llama.cpp labels Jul 19, 2024

SpaceAntelope mentioned this issue Jul 22, 2024

Microsoft.KernelMemory version 0.68+ compatibility fix #862

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Tokenization in 0.14.0 adds spaces #856

[BUG]: Tokenization in 0.14.0 adds spaces #856

newsletternewsletter commented Jul 18, 2024

martindevans commented Jul 18, 2024

newsletternewsletter commented Jul 19, 2024

newsletternewsletter commented Jul 20, 2024

newsletternewsletter commented Jul 21, 2024

Oceania2018 commented Aug 1, 2024

newsletternewsletter commented Aug 4, 2024

[BUG]: Tokenization in 0.14.0 adds spaces #856

[BUG]: Tokenization in 0.14.0 adds spaces #856

Comments

newsletternewsletter commented Jul 18, 2024

Description

Reproduction Steps

Environment & Configuration

Known Workarounds

martindevans commented Jul 18, 2024

newsletternewsletter commented Jul 19, 2024

newsletternewsletter commented Jul 20, 2024

newsletternewsletter commented Jul 21, 2024

Oceania2018 commented Aug 1, 2024

newsletternewsletter commented Aug 4, 2024