CUDA Error When Running Batch Inference with OpenLLama Model #93

jcrangel · 2023-10-17T09:51:53Z

I'm attempting to evaluate an OpenLlama model on a test dataset. When I use single element inference, it's considerably slow, so I'm trying to utilize batching for efficiency. However, during batch inference, I'm encountering a CUDA error.
Error Message

../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [277,0,0], thread: [125,0,0] Assertion 'srcIndex < srcSelectDimSize' failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [277,0,0], thread: [126,0,0] Assertion 'srcIndex < srcSelectDimSize' failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [277,0,0], thread: [127,0,0] Assertion 'srcIndex < srcSelectDimSize' failed.
...
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with 'TORCH_USE_CUDA_DSA' to enable device-side assertions.

Code for Batch Inference

from tqdm import tqdm

def make_batch_inference(dataset, batch_size=8):
all_out = []


progress_bar = tqdm(range(0, len(dataset), batch_size), desc="Inferencing")

for start_idx in progress_bar:
    end_idx = start_idx + batch_size
    batch_questions = dataset['question'][start_idx:end_idx]
    
    batch = tokenizer(batch_questions, return_tensors='pt', padding=True, truncation=True, max_length=512)

    with torch.cuda.amp.autocast():
        output_tokens = model.generate(
            input_ids=batch["input_ids"].to("cuda:0"), max_new_tokens=2048
        )

    batch_out = [extract_first_sparql(tokenizer.decode(tokens, skip_special_tokens=True)) for tokens in output_tokens]

    all_out.extend(batch_out)

return all_out

Loading data and dataset

test_data = load_data_from_file("data/kqapro_lcquad_test.json")
test_dataset = Dataset.from_dict(test_data)

results = make_batch_inference(test_dataset)

Additional Information

It's original a "openlm-research/open_llama_7b_v2" but I finetune it using peft. So I load the model using :

from peft import AutoPeftModelForCausalLM
device_map = {"": 0}
model = AutoPeftModelForCausalLM.from_pretrained(os.path.join(
    output_dir, 'saved_model'), device_map=device_map, torch_dtype=torch.bfloat16)
 tokenizer = LlamaTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Any assistance on this issue would be greatly appreciated. Thank you in advance!

The text was updated successfully, but these errors were encountered:

jcrangel · 2023-11-03T01:45:40Z

I was because I have some out of index tokens, I have to remove them:

def truncate_batch(batch, max_length=None):
    """
    To remove tokens  before the padding '32000' which cause
    Assertion `srcIndex < srcSelectDimSize` failed.
    """
    lengths = batch['attention_mask'].sum(dim=1)

    # If max_length is not provided, take the minimum of the lengths in the batch
    if not max_length:
        max_length = lengths.min().item()

    # Slice the tensors
    batch['input_ids'] = batch['input_ids'][:, :max_length]
    batch['attention_mask'] = batch['attention_mask'][:, :max_length]
    return batch
` ``

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Error When Running Batch Inference with OpenLLama Model #93

CUDA Error When Running Batch Inference with OpenLLama Model #93

jcrangel commented Oct 17, 2023

jcrangel commented Nov 3, 2023 •

edited

Loading

CUDA Error When Running Batch Inference with OpenLLama Model #93

CUDA Error When Running Batch Inference with OpenLLama Model #93

Comments

jcrangel commented Oct 17, 2023

jcrangel commented Nov 3, 2023 • edited Loading

jcrangel commented Nov 3, 2023 •

edited

Loading