-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
num_beams > 1 sometimes breaks inference #11
Comments
Thank you for the bug report. I'll take a look to see what's going on. |
I'm having difficulty replicating. Would you mind sharing your prompt length and generate call? Here's my attempt:
I ran it a few times with a couple different prompt lengths but never got it to error out. It's possible the issue only crops up on 30B model, which I didn't try with yet. |
I will isolate the setting/param that is causing this on my end and use your above gen code as baseline. |
Iterations finding so far, where my generate is exactly the same as yours num_beams=2, temp=0.7, etc 7 params total.
Here are diff results so far: Errors when:
Above 2 Errors at the following input/ouput sizes:
No errors, normal result at the following sizes:
My tokenizer is using fast but tested non-fast and nothing changes:
The above are using your latest head requantized to 30b 4bit sequential groupsizes=128. Edit: I am still continuing to test to weed out all other possibilities. Above is just what I have found so far. |
I have isolated out the following as src of issue:
Edit: Added transforms head+stable test. |
qwop version has very similar batch bug and was supposed fixed with qwopqwop200/GPTQ-for-LLaMa@d1c6d72 which I have not tested yet but since the triton codebase had it's origin here, perhaps the bug fix can be ported over. |
Thank you for the details, I've got it to reproduce now. It triggers when the generated tokens length is >=256, so my 128 test didn't trigger it. Also it triggers even on the 7B model, so testing is easier. I'll start digging and see what's going on. |
Occurs even with an FP16 hugging face LLaMA so this appears to be an issue with the |
I've opened a bug report at Closing this issue as it isn't related specifically to GPTQ-triton. |
On longer prompts/inputs I am encountering the following error when num_beams is set to 2. Shorter prompts appears to have no issue with multiple beams.
The text was updated successfully, but these errors were encountered: