Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [Bug]: New install - response keeps repeating the last line #1182

Open
2 tasks done
DeadEnded opened this issue Mar 5, 2024 · 8 comments
Open
2 tasks done

🐛 [Bug]: New install - response keeps repeating the last line #1182

DeadEnded opened this issue Mar 5, 2024 · 8 comments

Comments

@DeadEnded
Copy link

Bug description

I just pulled the image, spun up a container with default settings. I downloaded the Mistral-7B model, and left everything default. I've tried a few short questions, and the answer repeats the last line until I stop the container.

Steps to reproduce

  1. Spin up new container with default settings (from repo)
  2. Download Mistral-7B
  3. Start a new chat and ask "what is the square root of nine"

Environment Information

Docker version: 25.0.3
OS: Ubuntu 22.04.4 LTS on kernel 5.15.0-97
CPU: AMD Ryzen 5 2400G
Broswer: Firefox version 123.0

Screenshots

image

Relevant log output

llm_load_print_meta: BOS token        = 1 '<s>'

llm_load_print_meta: EOS token        = 2 '</s>'

llm_load_print_meta: UNK token        = 0 '<unk>'

llm_load_print_meta: LF token         = 13 '<0x0A>'

llm_load_tensors: ggml ctx size =    0.11 MiB

llm_load_tensors: offloading 0 repeating layers to GPU

llm_load_tensors: offloaded 0/33 layers to GPU

llm_load_tensors:        CPU buffer size =  4165.37 MiB

...............................................................................................

llama_new_context_with_model: n_ctx      = 2153

llama_new_context_with_model: freq_base  = 10000.0

llama_new_context_with_model: freq_scale = 1

llama_kv_cache_init:        CPU KV buffer size =   269.13 MiB

llama_new_context_with_model: KV self size  =  269.12 MiB, K (f16):  134.56 MiB, V (f16):  134.56 MiB

llama_new_context_with_model:        CPU input buffer size   =    12.22 MiB

llama_new_context_with_model:        CPU compute buffer size =   174.42 MiB

llama_new_context_with_model: graph splits (measure): 1

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 

Model metadata: {'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '32768', 'general.name': 'mistralai_mistral-7b-v0.1', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15'}

18:signal-handler (1709671894) Received SIGTERM scheduling shutdown...

Received termination signal!

++ _term

++ echo 'Received termination signal!'

++ kill -TERM 18

++ kill -TERM 19

18:signal-handler (1709671894) Received SIGTERM scheduling shutdown...

18:signal-handler (1709671894) Received SIGTERM scheduling shutdown...

Confirmations

  • I'm running the latest version of the main branch.
  • I checked existing issues to see if this has already been described.
@serge-chat serge-chat deleted a comment from DeadEnded Mar 6, 2024
@SolutionsKrezus
Copy link

Hello, I have the same bug when using Mistral or Mixtral for text generation. It keeps repeating the last sentance over and over till I restart the container. I tried increasing the repeat penalty but it does nothing.

@fishscene
Copy link

fishscene commented Apr 15, 2024

I've noticed this for most, if not all models I can test. This bug essentially makes serge useless.
Update
Reverting to "ghcr.io/serge-chat/serge:0.8.2" appears to vastly improve or eliminate the repeating issue altogether. Still testing.

@gaby
Copy link
Member

gaby commented Apr 16, 2024

This is probably a bug in llama-cpp-python. I will update it this week and do a new release.

Which specific model are you all using? @SolutionsKrezus @fishscene

@SolutionsKrezus
Copy link

I'm currently using Mistral 7B and Mixtral @gaby
I reverted to 0.8.0 and it works like a charm

@fishscene
Copy link

This is probably a bug in llama-cpp-python. I will update it this week and do a new release.

Which specific model are you all using? @SolutionsKrezus @fishscene

Apologies, I’m at work at the moment.
All models I tested were affected to some degree. Some more than others.

Off the top of my head:
All current mixtral models, at least 2 mistral models, neural chat, one of the medical ones, definitely a few more as well. I did not test anything above 13b as those are beyond my hardware.

I would see random replies marked/flagged as code snippets… and if the model started repeating itself, that was the end of anything useful as all subsequent replies would only repeat.

Of all the testing I did, getting 10 coherent replies was a major milestone- and even then, sometimes it took multiple re-prompting (delete my query and ask it slightly differently) to get to 10. A couple models started spewing nonsense and repeats on the very first response.

All this to say, testing should be very easy to do.
When I reverted to previous serge release, I immediately saw improvement.

Curious though. OP is using Ryzen- so am I: Ryzen 1700x, 32GB RAM, no CUDA GPU. (NVIDIA T400 I think). Using CPU for AI.

Maybe this is isolated to Ryzen CPU’s?

Another behavior to note:
When asking some censored models a question, they straight up have no reply at all. No detectable CPU was used either. It was like some pre-AI function was like “nope” and didn’t pass along my query to the AI model itself. There’s a name for this pre-process, but it escapes me at the moment. Not sure if it is a clue either.

@SolutionsKrezus
Copy link

I don't think it is a Ryzen-related issue @fishscene
I have the same problem with a Intel Xeon D-1540 with 32GB RAM and no GPU.

@JuniperChris929
Copy link

Same issue here. This pretty much renders the software completely useless :(

@gaby
Copy link
Member

gaby commented Oct 10, 2024

Can you try ghcr.io/serge-chat/serge:main. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants