Bugfix: Fix broken: UnicodeDecodeError: 'utf-8' codec can't decode #55

riverzhou · 2023-04-09T15:33:58Z

No description provided.

CyberTimon · 2023-04-09T17:10:04Z

Doesn't this just remove the error? Emoji's still don't work

riverzhou · 2023-04-10T02:15:57Z

Doesn't this just remove the error? Emoji's still don't work

Test passed in Chinese, not test Emoji.

pjq · 2023-04-10T04:40:45Z

Great to see the fix Chinese.

jmtatsch · 2023-04-10T06:50:44Z

Very funny indeed.
My vicuna prefers to answer me in Chinese.
With this fix at least it can without erroring out.

MillionthOdin16 · 2023-04-11T14:56:48Z

Was this resolved Upstream? ggerganov/llama.cpp@aaf3b23

abetlen · 2023-04-11T15:08:46Z

@MillionthOdin16 I don't think so because I've had this issue on linux as well. I believe the issue is that utf-8 encodng is variable length and certain tokens are not valid utf-8 because they're just returned as bytes which may include partial utf-8 code points.

I think this needs some tests to ensure we're properly keeping track of the number of returned bytes.

Niek · 2023-04-12T10:05:12Z

Fixes #57

abetlen · 2023-04-12T14:42:28Z

@Niek can you confirm that this fixes the bug and gives the same result in streaming vs. regular mode. For example, compare streaming and regular mode for a completion that breaks in streaming mode with a fixed seed and temperature=0.

Niek · 2023-04-12T16:56:23Z

I just tested, for my own reference:

docker run --rm -it -v /path/to/models:/models -p8000:8000 python:3-buster bash
git clone -b river https://github.com/riverzhou/llama-cpp-python.git /app
cd /app
sed -i -e 's/[email protected]:/https:\/\/github.com\//' -e 's/.git$//' .gitmodules
git submodule update --init --recursive
python -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi sse_starlette uvicorn
python setup.py develop
HOST=0.0.0.0 MODEL=/models/ggml-vicuna-7b-4bit.bin python3 -m llama_cpp.server

With a prompt like show me 3 emojis, I no longer get an error, but it seems the message returned is empty instead. So doesn't look like a full fix.

abetlen · 2023-04-12T17:31:29Z

@Niek Can you try chaging for i in range(1,4): to for i in reversed(range(1,4)): so instead we're decoding the longest posible sequence? Not sure this would fix it either but worth a try.

wujb123 · 2023-04-26T03:22:48Z

I change the code a littie bit, and it works:
_text = ''
try:
_text = text[start:].decode("utf-8")
except UnicodeDecodeError:
for i in range(-2,2): # changed to (-2,2)
try:
_text = text[start+i:].decode("utf-8") # changed to [start+i:]
break
except UnicodeDecodeError:
continue
yield {
"id": completion_id,
"object": "text_completion",
"created": created,
"model": self.model_path,
"choices": [
{
"text": _text,
"index": 0,
"logprobs": None,
"finish_reason": None,

abetlen · 2023-05-05T03:13:52Z

@riverzhou can you check if this bug is still occurs since #118 was merged?

gjmulder · 2023-05-23T14:54:59Z

@riverzhou update?

riverzhou · 2023-06-27T16:17:14Z

@riverzhou can you check if this bug is still occurs since #118 was merged?

Fine. It's OK now. Thanks.

Bugfix: Fix broken: UnicodeDecodeError: 'utf-8' codec can't decode

dabd89d

Fikavec mentioned this pull request Apr 17, 2023

An error regarding Unicode DecodeError（关于Unicode DecodeError的错误） nomic-ai/pygpt4all#61

Closed

abetlen mentioned this pull request Apr 26, 2023

Fix UnicodeDecodeError permanently #118

Merged

Merge remote-tracking branch 'origin/main' into river

cfbb2fc

riverzhou force-pushed the river branch from cab0c11 to cfbb2fc Compare May 5, 2023 02:45

gjmulder added the bug Something isn't working label May 23, 2023

riverzhou closed this Jun 27, 2023

riverzhou deleted the river branch July 8, 2023 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: Fix broken: UnicodeDecodeError: 'utf-8' codec can't decode #55

Bugfix: Fix broken: UnicodeDecodeError: 'utf-8' codec can't decode #55

riverzhou commented Apr 9, 2023

CyberTimon commented Apr 9, 2023

riverzhou commented Apr 10, 2023

pjq commented Apr 10, 2023 •

edited

Loading

jmtatsch commented Apr 10, 2023

MillionthOdin16 commented Apr 11, 2023 •

edited

Loading

abetlen commented Apr 11, 2023

Niek commented Apr 12, 2023

abetlen commented Apr 12, 2023

Niek commented Apr 12, 2023

abetlen commented Apr 12, 2023

wujb123 commented Apr 26, 2023

abetlen commented May 5, 2023

gjmulder commented May 23, 2023

riverzhou commented Jun 27, 2023

Bugfix: Fix broken: UnicodeDecodeError: 'utf-8' codec can't decode #55

Bugfix: Fix broken: UnicodeDecodeError: 'utf-8' codec can't decode #55

Conversation

riverzhou commented Apr 9, 2023

CyberTimon commented Apr 9, 2023

riverzhou commented Apr 10, 2023

pjq commented Apr 10, 2023 • edited Loading

jmtatsch commented Apr 10, 2023

MillionthOdin16 commented Apr 11, 2023 • edited Loading

abetlen commented Apr 11, 2023

Niek commented Apr 12, 2023

abetlen commented Apr 12, 2023

Niek commented Apr 12, 2023

abetlen commented Apr 12, 2023

wujb123 commented Apr 26, 2023

abetlen commented May 5, 2023

gjmulder commented May 23, 2023

riverzhou commented Jun 27, 2023

pjq commented Apr 10, 2023 •

edited

Loading

MillionthOdin16 commented Apr 11, 2023 •

edited

Loading