num_beams > 1 sometimes breaks inference #11

Qubitium · 2023-04-17T13:26:52Z

env:
transformers [fpga PR performance-fix branch]
pytorch 2.0.0+cu118
Nvidia 4090
Model: 30B 4bit act-order sequential (quantized using GPTQ-triton script)

num_beams = 2
length_penalty  = 1.3

On longer prompts/inputs I am encountering the following error when num_beams is set to 2. Shorter prompts appears to have no issue with multiple beams.

  File "/root/test.py", line 223, in process
    gen_output = model.generate(
  File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 1585, in generate
    return self.beam_sample(
  File "/root/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 3210, in beam_sample
    next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

The text was updated successfully, but these errors were encountered:

fpgaminer · 2023-04-17T16:55:08Z

Thank you for the bug report. I'll take a look to see what's going on.

fpgaminer · 2023-04-20T05:00:32Z

I'm having difficulty replicating. Would you mind sharing your prompt length and generate call? Here's my attempt:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from gptq_triton import load_quant
from transformers import AutoTokenizer, LlamaForCausalLM
import random

model_path = 'weights/llama-7b-triton-4bit-c4-group-1-act-seq/'
model = load_quant(model_path, warmup_autotune=False)
model.eval()
model.to('cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

target_prompt_length = 2048
prompt = ''

while True:
	prompt = prompt + ''.join(random.choice('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 .,;:!?\n') for _ in range(2048 * 10))
	# Encode and crop down
	encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
	if encoded_prompt.shape[1] > target_prompt_length:
		encoded_prompt = encoded_prompt[:, :target_prompt_length]
		encoded_prompt = encoded_prompt.to('cuda')
		break

output_sequences = model.generate(
	input_ids=encoded_prompt,
	max_length=128 + len(encoded_prompt[0]),
	temperature=0.7,
	num_return_sequences=1,
	num_beams=2,
	length_penalty=1.3,
	do_sample=True,
)

I ran it a few times with a couple different prompt lengths but never got it to error out. It's possible the issue only crops up on 30B model, which I didn't try with yet.

Qubitium · 2023-04-20T05:39:19Z

I will isolate the setting/param that is causing this on my end and use your above gen code as baseline.

Qubitium · 2023-04-20T07:20:47Z

Iterations finding so far, where my generate is exactly the same as yours num_beams=2, temp=0.7, etc 7 params total.

output_sequences = model.generate(
	input_ids=encoded_prompt,
	max_length=128 + len(encoded_prompt[0]),
	temperature=0.7,
	num_return_sequences=1,
	num_beams=2,
	length_penalty=1.3,
	do_sample=True,

Here are diff results so far:

Errors when:

set max_length=512, I crash constantly like before RuntimeError: probability tensor contains either inf, nan or element < 0
don't set max_length and set max_new_tokens=512, I get non crash but over 75% of the output is garbled junk.

Above 2 Errors at the following input/ouput sizes:

Prompt tokens size: 156
Output tokens size (including prompt prefix):  327
New tokens: 171

No errors, normal result at the following sizes:

Prompt tokens size: 76
Output tokens size (including prompt prefix):  98
New tokens: 22

My tokenizer is using fast but tested non-fast and nothing changes:

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, padding_side="left", add_special_tokens=False, use_fast=True)

The above are using your latest head requantized to 30b 4bit sequential groupsizes=128.

Edit: I am still continuing to test to weed out all other possibilities. Above is just what I have found so far.

Qubitium · 2023-04-20T07:50:07Z

I have isolated out the following as src of issue:

Tested pytorch 2.1 nightly + 2.0 stable
Tested Openai/Triton 2.0.0 and 2.0.0.post1 (nightly)
Tested Transformers 2.29.0dev (head as of this post) and 2.28.1 (release)

Edit: Added transforms head+stable test.
Edit2: I have stopped testing as I have run out of diff env/pkg diffs to test that I think may affect the runtime.

Qubitium · 2023-04-21T02:14:33Z

qwop version has very similar batch bug and was supposed fixed with qwopqwop200/GPTQ-for-LLaMa@d1c6d72 which I have not tested yet but since the triton codebase had it's origin here, perhaps the bug fix can be ported over.

qwopqwop200/GPTQ-for-LLaMa@d1c6d72

fpgaminer · 2023-04-21T06:35:25Z

Thank you for the details, I've got it to reproduce now. It triggers when the generated tokens length is >=256, so my 128 test didn't trigger it. Also it triggers even on the 7B model, so testing is easier.

I'll start digging and see what's going on.

fpgaminer · 2023-04-21T06:38:32Z

Occurs even with an FP16 hugging face LLaMA so this appears to be an issue with the transformers library, not GPTQ-triton. Still, I'm curious enough to keep digging.

fpgaminer · 2023-04-21T08:29:51Z

I've opened a bug report at transformers with details of my analysis thus far: huggingface/transformers#22914

Closing this issue as it isn't related specifically to GPTQ-triton.

Qubitium mentioned this issue Apr 17, 2023

Inference throwing: TypeError: forward() got an unexpected keyword argument 'position_ids #6

Closed

fpgaminer closed this as completed Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

num_beams > 1 sometimes breaks inference #11

num_beams > 1 sometimes breaks inference #11

Qubitium commented Apr 17, 2023 •

edited

Loading

fpgaminer commented Apr 17, 2023

fpgaminer commented Apr 20, 2023

Qubitium commented Apr 20, 2023

Qubitium commented Apr 20, 2023 •

edited

Loading

Qubitium commented Apr 20, 2023 •

edited

Loading

Qubitium commented Apr 21, 2023

fpgaminer commented Apr 21, 2023

fpgaminer commented Apr 21, 2023

fpgaminer commented Apr 21, 2023

num_beams > 1 sometimes breaks inference #11

num_beams > 1 sometimes breaks inference #11

Comments

Qubitium commented Apr 17, 2023 • edited Loading

fpgaminer commented Apr 17, 2023

fpgaminer commented Apr 20, 2023

Qubitium commented Apr 20, 2023

Qubitium commented Apr 20, 2023 • edited Loading

Qubitium commented Apr 20, 2023 • edited Loading

Qubitium commented Apr 21, 2023

fpgaminer commented Apr 21, 2023

fpgaminer commented Apr 21, 2023

fpgaminer commented Apr 21, 2023

Qubitium commented Apr 17, 2023 •

edited

Loading

Qubitium commented Apr 20, 2023 •

edited

Loading

Qubitium commented Apr 20, 2023 •

edited

Loading