Qwen-72B-Chat conversion script does not treat <|im_start|> and <|im_end|> correctly. #4331

Noeda · 2023-12-04T23:16:13Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Description

(This is specifically for the latest 72B models. I have never tried the smaller ones).

I'm using this model: https://huggingface.co/Qwen/Qwen-72B-Chat

Commit: 33e171d1e9fc4903f9314b490d77fb8d58331b63

I think the current convert-hf-to-gguf.py does not produce a .gguf file that treats these two tokens correctly for <|im_start|> and <|im_end|>.

The prompt I used is "<|im_start|>system" for the examples below.

Following steps in #4281 to produce some .gguf files (I personally used the Q6_K on a MacStudio) I tried the tokenize tool:

    27 -> '<'
    91 -> '|'
   318 -> 'im'
  4906 -> '_start'
    91 -> '|'
    29 -> '>'
  8948 -> 'system'

Compare this to a Yi model with exact same prompt:

     6 -> '<|im_start|>'
 10707 -> 'system'

I saw the Qwen model code (https://huggingface.co/Qwen/Qwen-72B/blob/main/tokenization_qwen.py#L37) and I think these are intended to be single tokens. But the current script does not handle it properly.

Steps to Reproduce

Download the Qwen models. (https://huggingface.co/Qwen/Qwen-72B-Chat)
Use the convert-hf-to-gguf.py script to convert one into a .gguf file. (This is the exact command I found on my Mac Studio: python3 convert-hf-to-gguf.py --outfile /Volumes/T9/qwen_72b_chat_v3_f16.gguf --outtype f16 ~/text-generation-webui/models/Qwen_Qwen-72B-Chat)
Run tokenize on them to see what tokens are interpreted.

If I'm honest, I'm not sure if this would be a bug to llama.cpp repository or something Qwen team might want to fix in their repo. But I'm submitting it here for awareness.

Also, the model seems to work fine despite this. But maybe it would work better if they were interpreted correctly? No idea.

The text was updated successfully, but these errors were encountered:

cmp-nct · 2023-12-04T23:21:54Z

You should link the exact model you used, make sure it's what you have locally and not something else. Make sure you have ALL files from the HF dump in your local directory and that nothing is cut or missing.

What you describe sounds like an error in the tokenizer import, either a wrong tokenizer is used or the special tokens were not read in.
For usual models (all I tested in the past month) this always worked.

Just as you noticed: if that tokenization is not working then the finetune is broken, so that really has to be one token.

Noeda · 2023-12-04T23:26:08Z

@cmp-nct I used this model: https://huggingface.co/Qwen/Qwen-72B-Chat (I edited to include that in the original post)

I used this command to convert it into .gguf: (also in my original comment):

python3 convert-hf-to-gguf.py --outfile /Volumes/T9/qwen_72b_chat_v3_f16.gguf --outtype f16 ~/text-generation-webui/models/Qwen_Qwen-72B-Chat

The Qwen_Qwen-72B-Chat I have verified is exactly the model as on HuggingFace repo I linked. (well, at least in file names. It's not obviously diverging, but I admit I didn't do super thorough checking like md5sums over all the files and compare them)

The model works fine. I just think it may not be working optimally because the tokens are not treated specially.

cmp-nct · 2023-12-04T23:44:01Z

Okay that looks bad at first glance.

    ENDOFTEXT = "<\|endoftext\|>"
  | IMSTART = "<\|im_start\|>"
  | IMEND = "<\|im_end\|>"
  | # as the default behavior is changed to allow special tokens in
  | # regular texts, the surface forms of special tokens need to be
  | # as different as possible to minimize the impact
  | EXTRAS = tuple((f"<\|extra_{i}\|>" for i in range(205)))
  | # changed to use actual index to avoid misconfiguration with vocabulary expansion
  | SPECIAL_START_ID = 151643
  | SPECIAL_TOKENS = tuple(
  | enumerate(
  | (
  | (
  | ENDOFTEXT,
  | IMSTART,
  | IMEND,
  | )

So I am guessing, I do not know if we have special handling code for "QWEN" implemented, because it would be required.
There are two important differences:

It uses a custom tokenizer! It has similarities to GPT-2 tokenization but the regex looks different (would need to be checked).
If the regex is different the tokens would be split in a different way - most likely similar for english language and different for chinese languages
It assigns the 3 special tokens manually ID is 151643,151644, 151645 in that tokenizer python file
so what you can do as a quick fix: you can hack the json files and add the 3 tokens with their correct IDs
Usually you find it in two json files, you can edit them with any proper text editor (compare it with YI files)

Once that is done you'd have correct tokens for the 3 special characters, however there will be more issues.
I've not looked into qwen so what I said needs a grain of salt and the assumption that we do not have QWEN handling code somewhere. Maybe there are open threads here that discuss it, maybe you'll find a PR handling it.

If you decide to continue and hack those 3 tokens into the file make sure you run the python demo implementation, then add a print to the tokenizer output (that's what is fed into the generation, one of the main lines).
This way you can compare the tokenization of a prompt and look if the llama.cpp one produces similar results

Noeda · 2023-12-05T00:53:23Z

Also, for record, I am on commit 33e171d1e9fc4903f9314b490d77fb8d58331b63 *((Sun Dec 3 01:10:43 2023 -0800). There are some later commits as of I'm writing this comment but they don't look relevant to this particular issue.

cmp-nct · 2023-12-05T01:14:38Z

https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee

Try that, it might convert their tokenizer format to HF, if that was not already done.
I checked pull requests a bit and it looks like QWEN is likely working, possibly a bias parameter is missing (though if that wasn't added yet I'm doubtful you'd get any good results from it)

So that script is worth a check, if that doesn't help, try hack in the 3 special tokens with a text editor, then convert the model again and it might be good

sorasoras · 2024-01-17T16:04:34Z

I encounter this problem as well in QWEN13B models
it appear randomly during the conversation.

xd2333 · 2024-01-18T12:40:15Z

in my case, [PAD151645] [PAD151644] [PAD151643] shows in output(Qwen-14B-Chat q4k)

sakura-umi · 2024-01-18T13:54:02Z

I encounter this problem as well in QWEN13B models it appear randomly during the conversation.

in my case, [PAD151645] [PAD151644] [PAD151643] shows in output(Qwen-14B-Chat q4k)

Your situation may not be the same as the author's. In my situation, b1892 reproduces the bug you got but b1833 doesn't. And the special tokens are both correctly treated, since the tokenize tool has an expected behaviour below.

left for b1833 and right for b1892.

But the difference between b1833 and b1892 is that special token has a certain text in b1892 (here is '<|im_end|>'), while in b1833 it does not (here is '').

So I think this is why you got this bug.

Conclusion: b1859 fix a detokenization bug about non-special added tokens, but the convert script does not treat these special tokens(like 151643, 151644, 151645) as control tokens.

Solution: just make the reverse_vocab dict contains special tokens of the tokenizer, and everything is ok.

compilade · 2024-01-22T21:08:02Z

in my case, [PAD151645] [PAD151644] [PAD151643] shows in output(Qwen-14B-Chat q4k)

Regarding the [PAD{id}] tokens, I recently fixed this in #5052, but it requires re-converting existing Qwen models.

Solution: just make the reverse_vocab dict contains special tokens of the tokenizer, and everything is ok.

Funny, that's pretty much exactly what I did in the aforementioned pull-request.

I can reproduce the original problem (described by @Noeda) when converting Qwen-1.8B-Chat with llama.cpp commit 152d9d0 (tag b1941):

$ # Convert the model (well, only the vocab, since that's the only thing used in the following test)
$ python3 llama.cpp/convert-hf-to-gguf.py --vocab-only --outfile qwen-1.8b-chat-b1941-vocab.gguf Qwen-1_8B-Chat/
...
$ # Test the tokenizer
$ ./llama.cpp/result/bin/tokenize qwen-1.8b-chat-b1941-vocab.gguf "<|im_start|>system"
...
llm_load_print_meta: BOS token        = 151643 '[PAD151643]'
llm_load_print_meta: EOS token        = 151643 '[PAD151643]'
llm_load_print_meta: UNK token        = 151643 '[PAD151643]'
...
    27 -> '<'
    91 -> '|'
   318 -> 'im'
  4906 -> '_start'
    91 -> '|'
    29 -> '>'
  8948 -> 'system'

When I instead convert it with the more recent d6bd4d4 (tag b1942, which is when #5052 was merged), the tokenization works correctly 🎉 :

$ # Convert the model's vocab
$ python3 llama.cpp/convert-hf-to-gguf.py --vocab-only --outfile qwen-1.8b-chat-b1942-vocab.gguf Qwen-1_8B-Chat/
...
$ # Test the tokenizer
$ ./llama.cpp/result/bin/tokenize qwen-1.8b-chat-b1942-vocab.gguf "<|im_start|>system"
...
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151643 '<|endoftext|>'
...
151644 -> ''
  8948 -> 'system'

So, <|im_start|> is now properly understood as token 151644. Also notice that the various [PAD{id}] tokens now have their correct names (in this case <|endoftext|>).

In both cases, I used the same tokenize binary (from an even more recent commit, 6f9939d), so the problem really is inside the previously-badly-converted model files.

I don't know to what extent this affects the existing GGUF-converted Qwen models on HuggingFace, but I think most of them will need to be reconverted with a recent llama.cpp version.

If anyone reading this is unsure whether their Qwen GGUF model(s) needs re-conversion, run the tokenize example on the model and check if <|im_start|> is properly read as a single token (151644) instead of multiple, and also check if the BOS, EOS, and UNK tokens are not called [PAD151643].

Noeda · 2024-01-23T01:58:07Z

Regarding the [PAD{id}] tokens, I recently fixed this in #5052, but it requires re-converting existing Qwen models.

I don't know to what extent this affects the existing GGUF-converted Qwen models on HuggingFace, but I think most of them will need to be reconverted with a recent llama.cpp version.

If anyone reading this is unsure whether their Qwen GGUF model(s) needs re-conversion, run the tokenize example on the model and check if <|im_start|> is properly read as a single token (151644) instead of multiple, and also check if the BOS, EOS, and UNK tokens are not called [PAD151643].

Hey, thanks for the fix @compilade . I originally opened this issue, not so much that had a big interest in getting it fixed for myself, but because I thought community might want to aware that Qwen models seems to have issues with llama.cpp and have a placeholder to discuss it (I guess that worked out 😄).

I still seem to have the original Qwen-72B .gguf files and HuggingFace files on my computer from the time I opened this issue. I'm going to try rebake the .ggufs again with latest llama.cpp and test out the tokenizer, to confirm if it's all now working for me. I'll make a comment here soon to verify if I can reproduce the fix for my original files I used when opening this. Stay tuned.

Small edit to clarify: Specifically I mean Qwen-72B-Chat for all that. I can't remember on top of my head if the base Qwen-72B also had these tokens special (but I would expect yes).

Noeda · 2024-01-23T02:36:46Z

My llama.cpp commit: 011e8ec577fd135cbc02993d3ea9840c516d6a1c for tests below:

I can confirm that the fixes @compilade mentions are doing their job. I replicated the same steps as I had in my original post on this issue and created a qwen_72b_chat_v2_Q6_K.gguf with a conversion and then quantize tool from the HuggingFace Qwen-72B-Chat files.

llm_load_print_meta: general.name     = Qwen
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llama_model_load: vocab only - skipping tensors
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
151644 -> ''
  8948 -> 'system'

Also did a quick dirty test that the model itself is not broken.

~/llama.cpp/build/bin/main --model ./qwen_72b_chat_v2_Q6_K.gguf --prompt "<|im_start|>user\nplease repeat back: apple, orange, juice<|im_end|>"
...

sampling:
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


user\nplease repeat back: apple, orange, juice
="apple, orange, juice"

 [end of text]

llama_print_timings:        load time =    2651.02 ms
llama_print_timings:      sample time =       6.57 ms /    17 runs   (    0.39 ms per token,  2589.49 tokens per second)
llama_print_timings: prompt eval time =     427.73 ms /    13 tokens (   32.90 ms per token,    30.39 tokens per second)
llama_print_timings:        eval time =    1678.20 ms /    16 runs   (  104.89 ms per token,     9.53 tokens per second)
llama_print_timings:       total time =    2119.22 ms /    29 tokens
ggml_metal_free: deallocating
Log end

(The command line tool for generating text doesn't seem to print out <|im_start|> or <|im_end|> but maybe that's a separate issue or intended).

Thank you for the fixes!

compilade · 2024-01-23T03:15:55Z

I originally opened this issue, not so much that had a big interest in getting it fixed for myself, but because I thought community might want to aware that Qwen models seems to have issues with llama.cpp and have a placeholder to discuss it (I guess that worked out 😄).

You did the right thing, apparently, since it was a real problem :)

user\nplease repeat back: apple, orange, juice

Just a quick note: you can use the -e flag so that a \n in the prompt (passed with -p or --prompt) is given as a newline to the model. Also, there's the --chatml flag which automatically starts an interactive chat using ChatML templates if you don't want to write them manually.

(The command line tool for generating text doesn't seem to print out <|im_start|> or <|im_end|> but maybe that's a separate issue or intended).

It seems intended. The CONTROL tokens do not detokenize to their text. See the comment near the relevant code. It doesn't really explain why, though.

github-actions · 2024-04-08T01:06:38Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Noeda added the bug-unconfirmed label Dec 4, 2023

listeng mentioned this issue Dec 5, 2023

qwen.cpp合并到llama.cpp中之后，对于<|im_start|>、<|im_end|>似乎没有正确处理 QwenLM/qwen.cpp#50

Open

JianbangZ mentioned this issue Jan 21, 2024

support qwen2 #5037

Merged

github-actions bot added the stale label Mar 24, 2024

github-actions bot closed this as completed Apr 8, 2024

sonica1987 mentioned this issue May 18, 2024

Unable to support the qwen model well guinmoon/LLMFarm#63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen-72B-Chat conversion script does not treat <|im_start|> and <|im_end|> correctly. #4331

Qwen-72B-Chat conversion script does not treat <|im_start|> and <|im_end|> correctly. #4331

Noeda commented Dec 4, 2023 •

edited

Loading

cmp-nct commented Dec 4, 2023

Noeda commented Dec 4, 2023 •

edited

Loading

cmp-nct commented Dec 4, 2023 •

edited

Loading

Noeda commented Dec 5, 2023 •

edited

Loading

cmp-nct commented Dec 5, 2023

sorasoras commented Jan 17, 2024 •

edited

Loading

xd2333 commented Jan 18, 2024

sakura-umi commented Jan 18, 2024 •

edited

Loading

compilade commented Jan 22, 2024 •

edited

Loading

Noeda commented Jan 23, 2024 •

edited

Loading

Noeda commented Jan 23, 2024

compilade commented Jan 23, 2024 •

edited

Loading

github-actions bot commented Apr 8, 2024

Qwen-72B-Chat conversion script does not treat <|im_start|> and <|im_end|> correctly. #4331

Qwen-72B-Chat conversion script does not treat <|im_start|> and <|im_end|> correctly. #4331

Comments

Noeda commented Dec 4, 2023 • edited Loading

Prerequisites

Description

Steps to Reproduce

cmp-nct commented Dec 4, 2023

Noeda commented Dec 4, 2023 • edited Loading

cmp-nct commented Dec 4, 2023 • edited Loading

Noeda commented Dec 5, 2023 • edited Loading

cmp-nct commented Dec 5, 2023

sorasoras commented Jan 17, 2024 • edited Loading

xd2333 commented Jan 18, 2024

sakura-umi commented Jan 18, 2024 • edited Loading

compilade commented Jan 22, 2024 • edited Loading

Noeda commented Jan 23, 2024 • edited Loading

Noeda commented Jan 23, 2024

compilade commented Jan 23, 2024 • edited Loading

github-actions bot commented Apr 8, 2024

Noeda commented Dec 4, 2023 •

edited

Loading

Noeda commented Dec 4, 2023 •

edited

Loading

cmp-nct commented Dec 4, 2023 •

edited

Loading

Noeda commented Dec 5, 2023 •

edited

Loading

sorasoras commented Jan 17, 2024 •

edited

Loading

sakura-umi commented Jan 18, 2024 •

edited

Loading

compilade commented Jan 22, 2024 •

edited

Loading

Noeda commented Jan 23, 2024 •

edited

Loading

compilade commented Jan 23, 2024 •

edited

Loading