Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UnicodeDecodeError permanently #118

Merged
merged 9 commits into from
Apr 29, 2023
Merged

Fix UnicodeDecodeError permanently #118

merged 9 commits into from
Apr 29, 2023

Conversation

SagsMug
Copy link
Contributor

@SagsMug SagsMug commented Apr 26, 2023

https://docs.python.org/3/library/codecs.html#error-handlers

This detects a multibyte UTF8 character, and doesn't return if its incomplete.
If there otherwise wasn't enough tokens or the model somehow returns invalid bytes, we use errors="ignore" to remove invalid characters.
Ascii example:

'German ß, ♬'.encode(encoding='ascii', errors="ignore")
b'German , '

But there will never be thrown a decode or encode error.

Fixes:
#36
#57
#100
#116

@abetlen
Copy link
Owner

abetlen commented Apr 26, 2023

Thanks @SagsMug, I'll take a look at this and weigh against #55 I think this better preserves streaming functionality.

So is this bug just due to the fact that the Llama vocabulary includes token which are not valid utf-8 strings?

@SagsMug
Copy link
Contributor Author

SagsMug commented Apr 26, 2023

Thanks @SagsMug, I'll take a look at this and weigh against #55 I think this better preserves streaming functionality.

So is this bug just due to the fact that the Llama vocabulary includes token which are not valid utf-8 strings?

According to this:
ggerganov/llama.cpp#73 (comment)
It does indeed stem from some characters needing multiple tokens.

They supposedly fixed this by changing the model converter in ggerganov/llama.cpp#79
But given we still encounter this issue, maybe not.

> import llama_cpp
> lparams = llama_cpp.llama_context_default_params()
> ctx = llama_cpp.llama_init_from_file("./models/7B/ggml-model-q4_0.bin", lparams)
> def _tokenize(prompt, bos=True):
 _arr = (llama_cpp.llama_token * (len(prompt) + 1))()
 _n = llama_cpp.llama_tokenize(ctx, prompt.encode("utf8"), _arr, len(_arr), bos)
 return _arr[:_n]
> _tokenize("😀", False)
llama_tokenize: too many tokens
[]
> def _tokenize(prompt, bos=True):
 _arr = (llama_cpp.llama_token * (len(prompt) + 6))()
 _n = llama_cpp.llama_tokenize(ctx, prompt.encode("utf8"), _arr, len(_arr), bos)
 return _arr[:_n]
> _tokenize("😀", False)
[243, 162, 155, 131]
> [llama_cpp.llama_token_to_str(ctx, i) for i in [243, 162, 155, 131]]
[b'\xf0', b'\x9f', b'\x98', b'\x80']
> b"".join([llama_cpp.llama_token_to_str(ctx, i) for i in [243, 162, 155, 131]]).decode("utf8")
'😀'

(also yes thats technically a bug in my main example, it'll get fixed when somebody submits an issue 😋 )
This corresponds to what this site said: https://www.compart.com/en/unicode/U+1F600

The only real solution is to save invalid tokens until a valid output occurs (at probably most 4 tokens).
And increase the _arr buffer by 4 times (in worst case, if the max is 4 tokens, i dont know that it is).
But this works alright until that happens.
(Unless you really need those characters, then it doesnt)

According to this
https://en.wikipedia.org/wiki/UTF-8#Encoding
F0 is 11110000, or 11110xxx
Meaning we should be able to check if it starts with a "110" bit (192, 224 or 240), in which case its multibyte.
We can then detect how many bits there should be by what matching to those cases.
and then still need to save invalid tokens until we have a full 2, 3, or 4 byte one.
So bitwise and to the decimals 192, 224 or 240, and would detect it.
eg.
if value & pattern == pattern
Where 192 is dual, 224 is triple, and 240 is quad byte.

> 243 & 240
240

@SagsMug
Copy link
Contributor Author

SagsMug commented Apr 28, 2023

Thanks @SagsMug, I'll take a look at this and weigh against #55 I think this better preserves streaming functionality.

So is this bug just due to the fact that the Llama vocabulary includes token which are not valid utf-8 strings?

I have tried to address UTF8 properly by detecting multibytes and waiting for their completion.
Please check that it's satisfactory.
I have also kept the errors = "ignore" as a precaution.

@abetlen
Copy link
Owner

abetlen commented Apr 28, 2023

@SagsMug can we reduce the use of errors="ignore" to just bare minimum to catch detokenize issues, ie. we probably don't need it for prompts and such.

@SagsMug
Copy link
Contributor Author

SagsMug commented Apr 29, 2023

@SagsMug can we reduce the use of errors="ignore" to just bare minimum to catch detokenize issues, ie. we probably don't need it for prompts and such.

I have removed a bunch of cases and add a test for this.
Im pretty sure the rest are necessary, but please check

@mozzipa
Copy link

mozzipa commented May 22, 2023

Because of errors from Llama.generate() and low level api, I use below code snippet.
How about to integrate it?

                if not input_noecho:
                    for id in tokens:
                        detoken = llama_cpp.Llama.detokenize(self=self,tokens=[id])
                        byte_list.append(detoken)
                        gen_tokens += 1
                        try:
                            combine = b''.join(byte_list)
                            letter = combine.decode("utf-8")
                            # print(letter, end="",flush=True)
                            # yield print(letter, end="",flush=True)
                            yield letter
                            if gen_tokens > n_predict:
                                gen_tokens = 0
                                # llama_cpp.llama_free()
                                break
                            byte_list = []
                            
                        except:
                            pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants