RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx #1363

LukeLIN-web · 2024-09-17T00:40:33Z

System Info

I am using cuda_12.2, torch 2.1.0a0+29c30b1, bitsandbytes 0.43.3, python 3.10
Driver Version: 535.113.01
NVIDIA GeForce RTX 2080 Ti

Reproduction

import gc

import torch
from diffusers import LattePipeline
from transformers import T5EncoderModel, BitsAndBytesConfig
import imageio
from torchvision.utils import save_image

torch.manual_seed(0)

def flush():
    gc.collect()
    torch.cuda.empty_cache()

def bytes_to_giga_bytes(bytes):
    return bytes / 1024 / 1024 / 1024

video_length = 16
model_id = "maxin-cn/Latte-1"

text_encoder = T5EncoderModel.from_pretrained(
    model_id,
    subfolder="text_encoder",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16),
    device_map="auto",
    cache_dir="/data/"
)

pipe = LattePipeline.from_pretrained(
    model_id, 
    text_encoder=text_encoder,
    transformer=None,
    device_map="balanced",
    cache_dir="/data/"
)

with torch.no_grad():
    prompt = "a cat wearing sunglasses and working as a lifeguard at pool."
    negative_prompt = ""
    prompt_embeds, negative_prompt_embeds = pipe.encode_prompt(prompt, negative_prompt=negative_prompt)

del text_encoder
del pipe
flush()

pipe = LattePipeline.from_pretrained(
    model_id,
    text_encoder=None,
    torch_dtype=torch.float16,
    cache_dir="/data/",
).to("cuda")
# pipe.enable_vae_tiling()
# pipe.enable_vae_slicing()

videos = pipe(
    video_length=video_length,
    num_inference_steps=50,
    negative_prompt=None, 
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_prompt_embeds,
    output_type="pt",
).frames.cpu()

print(f"Max memory allocated: {bytes_to_giga_bytes(torch.cuda.max_memory_allocated())} GB")

if video_length > 1:
    videos = (videos.clamp(0, 1) * 255).to(dtype=torch.uint8) # convert to uint8
    imageio.mimwrite('./latte_output.mp4', videos[0].permute(0, 2, 3, 1), fps=8, quality=5) # highest quality is 10, lowest is 0
else:
    save_image(videos[0], './latte_output.png')

Vchitect/Latte#125 (comment)

Expected behavior

https://huggingface.co/docs/bitsandbytes/v0.43.3/installation
What is 4bit quantation GPU requirement?

matthewdouglas · 2024-09-17T17:26:23Z

Hi @LukeLIN-web,
I was not able to reproduce this on an RTX 4090. That said, I would also expect it to work on a 2080 Ti, as that GPU is fully supported for 4bit quantization with bitsandbytes.

I suspect your stack trace is not giving the full picture, as we do not use cublasGemmEx in 4bit. This may come from a PyTorch operation. You may get a more clear trace by setting CUDA_LAUNCH_BLOCKING=1 in your environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx #1363

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx #1363

LukeLIN-web commented Sep 17, 2024

matthewdouglas commented Sep 17, 2024

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx #1363

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx #1363

Comments

LukeLIN-web commented Sep 17, 2024

System Info

Reproduction

Expected behavior

matthewdouglas commented Sep 17, 2024