Infill Incorrect Tokenization #3503

kherud · 2023-10-06T13:45:19Z

I am comparing the tokenization of the codellama repository with the infill example of this repository.

The first example prompt from the codellama repository consists of the strings:

Prefix: 'def remove_non_ascii(s: str) -> str:\n """ '
Suffix: '\n return result\n'

Comparing the tokenization of both implementations results in:

CodeLlama: 1 32007 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 13 1678 736 1121 13 32009
Llama.cpp: 32007 1 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 1 29871 13 1678 736 1121 13 32009

There are two differences:

The first two tokens are swapped (those are prefix_id and bos I think)
Llama.cpp adds a bos token again after the suffix_id token and an additional 29871 (is this a space?)

I believe the latter is definitely wrong, as the paper states on page 4:

To limit the distribution shift between autoregressive and infilling training, we suppress the implicit leading space that SentencePiece tokenizers add upon encoding the middle part and the suffix

The text was updated successfully, but these errors were encountered:

vvhg1 · 2023-10-06T16:53:03Z

Could you please try this and see if the issue is fixed there? Would be great if you could also test server with infill.

PS for some reason I don't seem to be getting an additional space, or could it be a newline that is also deleted in the paper?

kherud · 2023-10-06T17:17:14Z

Here are the additional tests:

codellama:
1 32007 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 13 1678 736 1121 13 32009

vvhg1/llama.cpp infill.cpp:
1 32007 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 29871 13 1678 736 1121 13 32009

vvhg1/llama.cpp server.cpp:
1 32007 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 29871 13 1678 736 1121 13 32009

ggerganov/llama.cpp server.cpp:
32007 1 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 1 29871 13 1678 736 1121 13 32009

Apart from the additional 29871 your fix seems correct. I think it is a space, since the prefix string ends in a space and this is the same token as the one before the suffix starts (32008).

ggerganov · 2023-10-06T17:28:45Z

Here are some sample results using tests/test-tokenizer-0-llama.py which tokenizes strings using the original Python tokenizer:

{ "\t"                    , {   29871,     12, }, },
{ "\n"                    , {   29871,     13, }, },
{ "\n return"             , {   29871,     13,    736, }, },
{ "\t\n"                  , {   29871,     12,     13, }, },
{ "Hello world"           , {   15043,   3186, }, },
{ " Hello world"          , {   29871,  15043,   3186, }, },
{ "Hello World"           , {   15043,   2787, }, },
{ " Hello World"          , {   29871,  15043,   2787, }, },
{ " Hello World!"         , {   29871,  15043,   2787,  29991, }, },
{ "Hello, world!"         , {   15043,  29892,   3186,  29991, }, },
{ " Hello, world!"        , {   29871,  15043,  29892,   3186,  29991, }, },

It looks like codellama requires some extra logic for removing the leading spaces when they occur.
This logic should be implemented as part of the infill example

vvhg1 · 2023-10-06T17:29:16Z

aggreed, I'm on it, I guess we limit the cleaning to leading spaces as it seems newlines are not removed in the original?

kherud · 2023-10-06T17:43:18Z

I think the additional space gets introduced by the llama.cpp tokenizer, a quick look suggests those lines are responsible:

llama.cpp/llama.cpp

Lines 5220 to 5221 in 9ca79d5

    
           // without adding this leading whitespace, we do not get the same results as the original tokenizer 
        
           raw_text = " " + raw_text;

The code probably relies on the assumption that a prompt always consists of a single part, which doesn't hold here.

vvhg1 · 2023-10-06T17:48:16Z

for some reason I get different tokens for newlines:
'':1, ' <PRE>':32007, ' def':822, ' remove':3349, '_':29918, 'non':5464, '_':29918, 'as':294, 'cii':18869, '(':29898, 's':29879, ':':29901, ' str':851, ')':29897, ' ->':1599, ' str':851, ':\':3583, 'n':29876, ' """':9995, ' ':29871, ' <SUF>':32008, ' \':320, 'n':29876, ' return':736, ' result':1121, '\':29905, 'n':29876, ' <MID>':32009
exact command:

./infill -t 10 -ngl 0 -m models/codellama-13b.Q5_K_S.gguf -c 4096 --temp 0.7 --repeat_penalty 1.1 -n 20 --in-prefix 'def remove_non_ascii(s: str) -> str:\n """ ' --in-suffix '\n return result\n'

I only see the 29871 when I add leading whitespace to the suffix (which we want to clean in any case I guess)

kherud · 2023-10-06T17:58:42Z

I used CodeLlama-7b-Instruct. Maybe that makes the difference?

ggerganov · 2023-10-06T18:05:30Z

@vvhg1 You are missing -e

@kherud Without this code, we don't get the same output as the Python tokenizer. The info in my comment is generated with the Python tokenizer, so it is ground truth. Any extra handling of the tokens should happen outside the tokenizer implementation

vvhg1 · 2023-10-06T18:38:25Z

thx, the -e now gives me identical results

what I do now is:

remove leading spaces from suffix input
if -e then remove leading space token from tokenized suffix
All done in the infill case so it doesn't touch anything else
Please test, hope it works as intended now...

kherud · 2023-10-06T18:48:50Z

I think cleaning any leading spaces would be wrong, since they might legitimately be part of the suffix.

Codellama uses this little method:

def encode_infilling(self, s: str) -> List[int]:
    """Encode a string without an implicit leading space."""
    return self.sp_model.encode("☺" + s)[2:]

With an explanation here:

We could use any prefix to add under the condition that its last token does not merge to the right with any other token. In our vocabulary, the emoji achieves that. Obviously this does not extend to other tokenizers, and the clean solution would be to add a switch to SentencePiece whether to suppress the implicit leading space upon a single encoding (there is a global switch but not a per-prompt one).

The issue also points to the solution of the transformers library, I think is the relevant part:

https://github.com/huggingface/transformers/blob/5af2c6269672cda01c24ad48fab13f14a3ffb746/src/transformers/models/code_llama/tokenization_code_llama.py#L287C14-L301

I'm not too familiar with the tokenization, but might there be situations, where " " + raw_text gives undesired results, depending on whether the space merges with the first characters of raw_text to a different token? If there is always just a single 29871 added, it should suffice to simply remove this individual token, right?

vvhg1 · 2023-10-06T20:05:33Z

Yes, I think the topic is slightly more complicated than I initially thought.

What we probably need to do is to have a look first how we encode because only the -e adds a space (at least it seems to me like that rn). Then in case of -e we either cut one space at the beginning of the string if it starts with a space (that will then get added again by the tokenizer) or if there is none we remove the single space token after tokenization.

The rationale behind this is that multiple spaces are tokens in their own right, and I guess it would be more complicated (at least to maintain) to translate a n spaces token at the beginning of the string into a n-1 spaces token.

vvhg1 · 2023-10-07T07:19:47Z

What I do now:
If params.escape (as only then the tokenizer prepends an extra space, in server seems to be the default behaviour and there is no option for it either)

case 1: and the suffix string starts with a space

we remove it
the tokenizer adds an extra space again giving us the original suffix. (I confirmed that the tokenizer first adds the space and then tokenizes so eg double space are not tokenized as 29871, 29871 but correctly as 259)
all done

case 2: no leading space in original suffix string

we tokenize the string, the tokenizer prepends a 29871
we remove the prepended 29871
all done

Please test thoroughly

shibe2 · 2023-10-09T10:30:24Z

Since SPM started inserting space unconditionally (#2810), I have to use a hack similar to Code Llama's. I've found that in models that I use, newline (LF) character does not stick to any other tokens, and its identifier is easily queryable with llama_token_nl.

ggerganov · 2023-10-10T07:36:58Z

Seems to work now. One minor thing I noticed that I don't understand:

I ran the original test from this issue:

make -j && ./infill -t 4 -m models/codellama-7b/ggml-model-f16.gguf -c 4096 --temp 0.0 --repeat_penalty 1.1 --in-prefix 'def remove_non_ascii(s: str) -> str:\n """ ' --in-suffix '\n return result\n' --verbose-prompt -e -ngl 1

ggml_metal_add_buffer: allocated 'data            ' buffer, size = 12853.98 MB, (12854.61 / 147456.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  2050.00 MB, (14904.61 / 147456.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   288.02 MB, (15192.62 / 147456.00)

system_info: n_threads = 4 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 

main: prompt: ''
main: number of tokens in prompt = 26
     1 -> ''
 32007 -> ' <PRE>'
   822 -> ' def'
  3349 -> ' remove'
 29918 -> '_'
  5464 -> 'non'
 29918 -> '_'
   294 -> 'as'
 18869 -> 'cii'
 29898 -> '('
 29879 -> 's'
 29901 -> ':'
   851 -> ' str'
 29897 -> ')'
  1599 -> ' ->'
   851 -> ' str'
 29901 -> ':'
    13 -> '
'
  9995 -> ' """'
 29871 -> ' '
 32008 -> ' <SUF>'
    13 -> '
'
   736 -> ' return'
  1121 -> ' result'
    13 -> '
'
 32009 -> ' <MID>'

sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0



#####  Infill mode  #####

 <PRE> def remove_non_ascii(s: str) -> str:
 """  <SUF>
 return result
 <MID>Remove non-ASCII characters from a string.

 Args:
     s (str): The input string.

 Returns:
     str: The output string.
 """
 import re
 pattern = re.compile(r'[^\x00-\x7F]+', re.UNICODE)
 result = pattern.sub('', s) <EOT> <EOT>
llama_print_timings:        load time =     720.32 ms
llama_print_timings:      sample time =      50.62 ms /    81 runs   (    0.62 ms per token,  1600.19 tokens per second)
llama_print_timings: prompt eval time =      54.77 ms /    26 tokens (    2.11 ms per token,   474.72 tokens per second)
llama_print_timings:        eval time =    1935.91 ms /    80 runs   (   24.20 ms per token,    41.32 tokens per second)
llama_print_timings:       total time =    2089.48 ms

I don't have the token 1678 in the suffix tokenization:

32008 13 736 1121 13

In contrast, OP gets:

32008 13 1678 736 1121 13

So not sure why you guys got the extra 1678 there in your tests.

vvhg1 · 2023-10-10T07:56:01Z

I think it should be ok since it is the same behaviour here and in Python.
1678 is three spaces btw.

ggerganov added bug Something isn't working high priority Very important issue labels Oct 6, 2023

vvhg1 mentioned this issue Oct 6, 2023

fixes infill incorrect tokenization #3508

Merged

ggerganov mentioned this issue Oct 8, 2023

[WIP] Add Fill-In-Middle example #2934

Closed

ggerganov closed this as completed in #3508 Oct 10, 2023

staviq mentioned this issue Oct 10, 2023

tokenizer : special token handling #3538

Merged

5 tasks

shibe2 mentioned this issue Oct 18, 2023

[feature] Make space insertion optional in SentencePiece tokenizer #3664

Closed

renerocksai mentioned this issue Oct 8, 2024

Try to read normalizer setting from the huggingface json zml/zml#47

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infill Incorrect Tokenization #3503

Infill Incorrect Tokenization #3503

kherud commented Oct 6, 2023

vvhg1 commented Oct 6, 2023 •

edited

Loading

kherud commented Oct 6, 2023

ggerganov commented Oct 6, 2023

vvhg1 commented Oct 6, 2023 •

edited

Loading

kherud commented Oct 6, 2023 •

edited

Loading

vvhg1 commented Oct 6, 2023 •

edited

Loading

kherud commented Oct 6, 2023

ggerganov commented Oct 6, 2023

vvhg1 commented Oct 6, 2023

kherud commented Oct 6, 2023

vvhg1 commented Oct 6, 2023 •

edited

Loading

vvhg1 commented Oct 7, 2023 •

edited

Loading

shibe2 commented Oct 9, 2023 •

edited

Loading

ggerganov commented Oct 10, 2023

vvhg1 commented Oct 10, 2023

Infill Incorrect Tokenization #3503

Infill Incorrect Tokenization #3503

Comments

kherud commented Oct 6, 2023

vvhg1 commented Oct 6, 2023 • edited Loading

kherud commented Oct 6, 2023

ggerganov commented Oct 6, 2023

vvhg1 commented Oct 6, 2023 • edited Loading

kherud commented Oct 6, 2023 • edited Loading

vvhg1 commented Oct 6, 2023 • edited Loading

kherud commented Oct 6, 2023

ggerganov commented Oct 6, 2023

vvhg1 commented Oct 6, 2023

kherud commented Oct 6, 2023

vvhg1 commented Oct 6, 2023 • edited Loading

vvhg1 commented Oct 7, 2023 • edited Loading

shibe2 commented Oct 9, 2023 • edited Loading

ggerganov commented Oct 10, 2023

vvhg1 commented Oct 10, 2023

vvhg1 commented Oct 6, 2023 •

edited

Loading

vvhg1 commented Oct 6, 2023 •

edited

Loading

kherud commented Oct 6, 2023 •

edited

Loading

vvhg1 commented Oct 6, 2023 •

edited

Loading

vvhg1 commented Oct 6, 2023 •

edited

Loading

vvhg1 commented Oct 7, 2023 •

edited

Loading

shibe2 commented Oct 9, 2023 •

edited

Loading