Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num_beams > 1 sometimes breaks inference #11

Closed
Qubitium opened this issue Apr 17, 2023 · 9 comments
Closed

num_beams > 1 sometimes breaks inference #11

Qubitium opened this issue Apr 17, 2023 · 9 comments

Comments

@Qubitium
Copy link

Qubitium commented Apr 17, 2023

env:
transformers [fpga PR performance-fix branch]
pytorch 2.0.0+cu118
Nvidia 4090
Model: 30B 4bit act-order sequential (quantized using GPTQ-triton script)

num_beams = 2
length_penalty  = 1.3

On longer prompts/inputs I am encountering the following error when num_beams is set to 2. Shorter prompts appears to have no issue with multiple beams.

  File "/root/test.py", line 223, in process
    gen_output = model.generate(
  File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 1585, in generate
    return self.beam_sample(
  File "/root/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 3210, in beam_sample
    next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
@fpgaminer
Copy link
Owner

Thank you for the bug report. I'll take a look to see what's going on.

@fpgaminer
Copy link
Owner

I'm having difficulty replicating. Would you mind sharing your prompt length and generate call? Here's my attempt:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from gptq_triton import load_quant
from transformers import AutoTokenizer, LlamaForCausalLM
import random

model_path = 'weights/llama-7b-triton-4bit-c4-group-1-act-seq/'
model = load_quant(model_path, warmup_autotune=False)
model.eval()
model.to('cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

target_prompt_length = 2048
prompt = ''

while True:
	prompt = prompt + ''.join(random.choice('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 .,;:!?\n') for _ in range(2048 * 10))
	# Encode and crop down
	encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
	if encoded_prompt.shape[1] > target_prompt_length:
		encoded_prompt = encoded_prompt[:, :target_prompt_length]
		encoded_prompt = encoded_prompt.to('cuda')
		break

output_sequences = model.generate(
	input_ids=encoded_prompt,
	max_length=128 + len(encoded_prompt[0]),
	temperature=0.7,
	num_return_sequences=1,
	num_beams=2,
	length_penalty=1.3,
	do_sample=True,
)

I ran it a few times with a couple different prompt lengths but never got it to error out. It's possible the issue only crops up on 30B model, which I didn't try with yet.

@Qubitium
Copy link
Author

I will isolate the setting/param that is causing this on my end and use your above gen code as baseline.

@Qubitium
Copy link
Author

Qubitium commented Apr 20, 2023

Iterations finding so far, where my generate is exactly the same as yours num_beams=2, temp=0.7, etc 7 params total.

output_sequences = model.generate(
	input_ids=encoded_prompt,
	max_length=128 + len(encoded_prompt[0]),
	temperature=0.7,
	num_return_sequences=1,
	num_beams=2,
	length_penalty=1.3,
	do_sample=True,

Here are diff results so far:

Errors when:

  1. set max_length=512, I crash constantly like before RuntimeError: probability tensor contains either inf, nan or element < 0
  2. don't set max_length and set max_new_tokens=512, I get non crash but over 75% of the output is garbled junk.

Above 2 Errors at the following input/ouput sizes:

Prompt tokens size: 156
Output tokens size (including prompt prefix):  327
New tokens: 171

No errors, normal result at the following sizes:

Prompt tokens size: 76
Output tokens size (including prompt prefix):  98
New tokens: 22

My tokenizer is using fast but tested non-fast and nothing changes:

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, padding_side="left", add_special_tokens=False, use_fast=True)

The above are using your latest head requantized to 30b 4bit sequential groupsizes=128.

Edit: I am still continuing to test to weed out all other possibilities. Above is just what I have found so far.

@Qubitium
Copy link
Author

Qubitium commented Apr 20, 2023

I have isolated out the following as src of issue:

  1. Tested pytorch 2.1 nightly + 2.0 stable
  2. Tested Openai/Triton 2.0.0 and 2.0.0.post1 (nightly)
  3. Tested Transformers 2.29.0dev (head as of this post) and 2.28.1 (release)

Edit: Added transforms head+stable test.
Edit2: I have stopped testing as I have run out of diff env/pkg diffs to test that I think may affect the runtime.

@Qubitium
Copy link
Author

qwop version has very similar batch bug and was supposed fixed with qwopqwop200/GPTQ-for-LLaMa@d1c6d72 which I have not tested yet but since the triton codebase had it's origin here, perhaps the bug fix can be ported over.

qwopqwop200/GPTQ-for-LLaMa@d1c6d72

@fpgaminer
Copy link
Owner

Thank you for the details, I've got it to reproduce now. It triggers when the generated tokens length is >=256, so my 128 test didn't trigger it. Also it triggers even on the 7B model, so testing is easier.

I'll start digging and see what's going on.

@fpgaminer
Copy link
Owner

Occurs even with an FP16 hugging face LLaMA so this appears to be an issue with the transformers library, not GPTQ-triton. Still, I'm curious enough to keep digging.

@fpgaminer
Copy link
Owner

I've opened a bug report at transformers with details of my analysis thus far: huggingface/transformers#22914

Closing this issue as it isn't related specifically to GPTQ-triton.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants