-
Notifications
You must be signed in to change notification settings - Fork 963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix UnicodeDecodeError permanently #118
Conversation
According to this: They supposedly fixed this by changing the model converter in ggerganov/llama.cpp#79 > import llama_cpp
> lparams = llama_cpp.llama_context_default_params()
> ctx = llama_cpp.llama_init_from_file("./models/7B/ggml-model-q4_0.bin", lparams)
> def _tokenize(prompt, bos=True):
_arr = (llama_cpp.llama_token * (len(prompt) + 1))()
_n = llama_cpp.llama_tokenize(ctx, prompt.encode("utf8"), _arr, len(_arr), bos)
return _arr[:_n]
> _tokenize("😀", False)
llama_tokenize: too many tokens
[]
> def _tokenize(prompt, bos=True):
_arr = (llama_cpp.llama_token * (len(prompt) + 6))()
_n = llama_cpp.llama_tokenize(ctx, prompt.encode("utf8"), _arr, len(_arr), bos)
return _arr[:_n]
> _tokenize("😀", False)
[243, 162, 155, 131]
> [llama_cpp.llama_token_to_str(ctx, i) for i in [243, 162, 155, 131]]
[b'\xf0', b'\x9f', b'\x98', b'\x80']
> b"".join([llama_cpp.llama_token_to_str(ctx, i) for i in [243, 162, 155, 131]]).decode("utf8")
'😀' (also yes thats technically a bug in my main example, it'll get fixed when somebody submits an issue 😋 ) The only real solution is to save invalid tokens until a valid output occurs (at probably most 4 tokens). According to this > 243 & 240
240 |
I have tried to address UTF8 properly by detecting multibytes and waiting for their completion. |
@SagsMug can we reduce the use of |
I have removed a bunch of cases and add a test for this. |
Because of errors from Llama.generate() and low level api, I use below code snippet.
|
https://docs.python.org/3/library/codecs.html#error-handlers
This detects a multibyte UTF8 character, and doesn't return if its incomplete.
If there otherwise wasn't enough tokens or the model somehow returns invalid bytes, we use errors="ignore" to remove invalid characters.
Ascii example:
But there will never be thrown a decode or encode error.
Fixes:
#36
#57
#100
#116