Fix UnicodeDecodeError permanently #118

SagsMug · 2023-04-26T12:46:38Z

https://docs.python.org/3/library/codecs.html#error-handlers

This detects a multibyte UTF8 character, and doesn't return if its incomplete.
If there otherwise wasn't enough tokens or the model somehow returns invalid bytes, we use errors="ignore" to remove invalid characters.
Ascii example:

'German ß, ♬'.encode(encoding='ascii', errors="ignore")
b'German , '

But there will never be thrown a decode or encode error.

Fixes:
#36
#57
#100
#116

abetlen · 2023-04-26T20:29:07Z

Thanks @SagsMug, I'll take a look at this and weigh against #55 I think this better preserves streaming functionality.

So is this bug just due to the fact that the Llama vocabulary includes token which are not valid utf-8 strings?

SagsMug · 2023-04-26T21:32:11Z

Thanks @SagsMug, I'll take a look at this and weigh against #55 I think this better preserves streaming functionality.

So is this bug just due to the fact that the Llama vocabulary includes token which are not valid utf-8 strings?

According to this:
ggerganov/llama.cpp#73 (comment)
It does indeed stem from some characters needing multiple tokens.

They supposedly fixed this by changing the model converter in ggerganov/llama.cpp#79
But given we still encounter this issue, maybe not.

> import llama_cpp
> lparams = llama_cpp.llama_context_default_params()
> ctx = llama_cpp.llama_init_from_file("./models/7B/ggml-model-q4_0.bin", lparams)
> def _tokenize(prompt, bos=True):
 _arr = (llama_cpp.llama_token * (len(prompt) + 1))()
 _n = llama_cpp.llama_tokenize(ctx, prompt.encode("utf8"), _arr, len(_arr), bos)
 return _arr[:_n]
> _tokenize("😀", False)
llama_tokenize: too many tokens
[]
> def _tokenize(prompt, bos=True):
 _arr = (llama_cpp.llama_token * (len(prompt) + 6))()
 _n = llama_cpp.llama_tokenize(ctx, prompt.encode("utf8"), _arr, len(_arr), bos)
 return _arr[:_n]
> _tokenize("😀", False)
[243, 162, 155, 131]
> [llama_cpp.llama_token_to_str(ctx, i) for i in [243, 162, 155, 131]]
[b'\xf0', b'\x9f', b'\x98', b'\x80']
> b"".join([llama_cpp.llama_token_to_str(ctx, i) for i in [243, 162, 155, 131]]).decode("utf8")
'😀'

(also yes thats technically a bug in my main example, it'll get fixed when somebody submits an issue 😋 )
This corresponds to what this site said: https://www.compart.com/en/unicode/U+1F600

The only real solution is to save invalid tokens until a valid output occurs (at probably most 4 tokens).
And increase the _arr buffer by 4 times (in worst case, if the max is 4 tokens, i dont know that it is).
But this works alright until that happens.
(Unless you really need those characters, then it doesnt)

According to this
https://en.wikipedia.org/wiki/UTF-8#Encoding
F0 is 11110000, or 11110xxx
Meaning we should be able to check if it starts with a "110" bit (192, 224 or 240), in which case its multibyte.
We can then detect how many bits there should be by what matching to those cases.
and then still need to save invalid tokens until we have a full 2, 3, or 4 byte one.
So bitwise and to the decimals 192, 224 or 240, and would detect it.
eg.
if value & pattern == pattern
Where 192 is dual, 224 is triple, and 240 is quad byte.

> 243 & 240
240

SagsMug · 2023-04-28T11:38:30Z

Thanks @SagsMug, I'll take a look at this and weigh against #55 I think this better preserves streaming functionality.

So is this bug just due to the fact that the Llama vocabulary includes token which are not valid utf-8 strings?

I have tried to address UTF8 properly by detecting multibytes and waiting for their completion.
Please check that it's satisfactory.
I have also kept the errors = "ignore" as a precaution.

abetlen · 2023-04-28T21:34:17Z

@SagsMug can we reduce the use of errors="ignore" to just bare minimum to catch detokenize issues, ie. we probably don't need it for prompts and such.

SagsMug · 2023-04-29T10:35:30Z

@SagsMug can we reduce the use of errors="ignore" to just bare minimum to catch detokenize issues, ie. we probably don't need it for prompts and such.

I have removed a bunch of cases and add a test for this.
Im pretty sure the rest are necessary, but please check

mozzipa · 2023-05-22T04:49:48Z

Because of errors from Llama.generate() and low level api, I use below code snippet.
How about to integrate it?

                if not input_noecho:
                    for id in tokens:
                        detoken = llama_cpp.Llama.detokenize(self=self,tokens=[id])
                        byte_list.append(detoken)
                        gen_tokens += 1
                        try:
                            combine = b''.join(byte_list)
                            letter = combine.decode("utf-8")
                            # print(letter, end="",flush=True)
                            # yield print(letter, end="",flush=True)
                            yield letter
                            if gen_tokens > n_predict:
                                gen_tokens = 0
                                # llama_cpp.llama_free()
                                break
                            byte_list = []
                            
                        except:
                            pass

Mug added 4 commits April 26, 2023 14:37

Fix decode errors permanently

c4a8491

Merge branch 'main' of https://github.com/abetlen/llama-cpp-python

be2c961

Remove try catch from chat

3c130f0

Also ignore errors on input prompts

5f81400

Mug added 4 commits April 28, 2023 12:50

Detect multi-byte responses and wait

c39547a

One day, i'll fix off by 1 errors permanently too

3a98747

Dont detect off tokens, detect off detokenized utf8

eed6128

Python weirdness

b7d14ef

SagsMug force-pushed the main branch from 39a4065 to b7d14ef Compare April 28, 2023 11:38

Remove excessive errors="ignore" and add utf8 test

18a0c10

abetlen approved these changes Apr 29, 2023

View reviewed changes

abetlen merged commit 755f9fa into abetlen:main Apr 29, 2023

This was referenced May 2, 2023

Fix unicode decoding error #36

Closed

Bugfix: Fix broken: UnicodeDecodeError: 'utf-8' codec can't decode #55

Closed

This was referenced May 23, 2023

Issue with emoji decoding in steaming mode, only #57

Closed

UnicodeDecodeError: 'utf-8' codec can't decode byte #116

Closed

felladrin mentioned this pull request Aug 2, 2023

[error] translate text from english to chinese Maknee/minigpt4.cpp#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UnicodeDecodeError permanently #118

Fix UnicodeDecodeError permanently #118

SagsMug commented Apr 26, 2023 •

edited

Loading

abetlen commented Apr 26, 2023

SagsMug commented Apr 26, 2023 •

edited

Loading

SagsMug commented Apr 28, 2023 •

edited

Loading

abetlen commented Apr 28, 2023

SagsMug commented Apr 29, 2023

mozzipa commented May 22, 2023

Fix UnicodeDecodeError permanently #118

Fix UnicodeDecodeError permanently #118

Conversation

SagsMug commented Apr 26, 2023 • edited Loading

abetlen commented Apr 26, 2023

SagsMug commented Apr 26, 2023 • edited Loading

SagsMug commented Apr 28, 2023 • edited Loading

abetlen commented Apr 28, 2023

SagsMug commented Apr 29, 2023

mozzipa commented May 22, 2023

SagsMug commented Apr 26, 2023 •

edited

Loading

SagsMug commented Apr 26, 2023 •

edited

Loading

SagsMug commented Apr 28, 2023 •

edited

Loading