Matrix Size Mismatch Error in 'make_captions.py' When Using --num_beams Not Equal to 1 #1149

INFCode · 2024-03-03T04:26:58Z

I am using the finetune/make_captions.py to create captions using BLIP. However, setting --num_beams larger than 1 results in mismatched matrix sizes.

Here's the error log:

> python3 "finetune/make_captions.py" --batch_size="1" --num_beams="5" --top_p="0.9" --max_length="75" --min_length="5" --beam_search \
                         --caption_extension=".txt" "/home/infcode/code/git/ai/dataset/hyh/all_original_caption/" \
                         --caption_weights="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth"
2024-03-02 22:58:18.607746: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-02 22:58:18.607782: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-02 22:58:18.607818: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-02 22:58:18.613707: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-02 22:58:19.716504: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
get_preferred_device() -> cuda
2024-03-02 22:58:20 INFO     Current Working Directory is: /home/infcode/code/git/kohya_ss                                                                         make_captions.py:85
                    INFO     load images from /home/infcode/code/git/ai/dataset/hyh/all_original_caption                                                           make_captions.py:90
                    INFO     found 53 images.                                                                                                                      make_captions.py:93
                    INFO     loading BLIP caption: https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth                 make_captions.py:95
2024-03-02 22:58:29 INFO     load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth                          blip.py:242
                    INFO     BLIP loaded                                                                                                                           make_captions.py:99
  0%|                                                                                                                                                          | 0/53 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/infcode/code/git/kohya_ss/finetune/make_captions.py", line 210, in <module>
    main(args)
  File "/home/infcode/code/git/kohya_ss/finetune/make_captions.py", line 154, in main
    run_batch(b_imgs)
  File "/home/infcode/code/git/kohya_ss/finetune/make_captions.py", line 107, in run_batch
    captions = model.generate(
  File "/home/infcode/code/git/kohya_ss/finetune/blip/blip.py", line 162, in generate
    outputs = self.text_decoder.generate(input_ids=input_ids,
  File "/home/infcode/code/git/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/infcode/code/git/kohya_ss/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1797, in generate
    return self.beam_search(
  File "/home/infcode/code/git/kohya_ss/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 3181, in beam_search
    outputs = self(
  File "/home/infcode/code/git/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/infcode/code/git/kohya_ss/finetune/blip/med.py", line 886, in forward
    outputs = self.bert(
  File "/home/infcode/code/git/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/infcode/code/git/kohya_ss/finetune/blip/med.py", line 781, in forward
    encoder_outputs = self.encoder(
  File "/home/infcode/code/git/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/infcode/code/git/kohya_ss/finetune/blip/med.py", line 445, in forward
    layer_outputs = layer_module(
  File "/home/infcode/code/git/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/infcode/code/git/kohya_ss/finetune/blip/med.py", line 361, in forward
    cross_attention_outputs = self.crossattention(
  File "/home/infcode/code/git/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/infcode/code/git/kohya_ss/finetune/blip/med.py", line 277, in forward
    self_outputs = self.self(
  File "/home/infcode/code/git/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/infcode/code/git/kohya_ss/finetune/blip/med.py", line 178, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (5) must match the size of tensor b (25) at non-singleton dimension 0

After trying different values for num_beams, it seems that the size of key_layer.transpose(-1, -2) is the square of query_layer at dimension 0, which explains why num_beams=1 does not result in any error.

The text was updated successfully, but these errors were encountered:

TeKett · 2024-03-29T12:38:22Z

I got the exact same issue, im using the GUI version.

10:38:16-980477 INFO     Version: v22.3.1

10:38:16-987458 INFO     nVidia toolkit detected
10:38:18-560252 INFO     Torch 2.0.1+cu118
10:38:18-584160 INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8700
10:38:18-586155 INFO     Torch detected GPU: NVIDIA GeForce RTX 4070 Ti VRAM 12281 Arch (8, 9) Cores 60
10:38:18-588150 INFO     Verifying modules installation status from requirements_windows_torch2.txt...
10:38:18-593163 INFO     Verifying modules installation status from requirements.txt...
10:38:21-600523 INFO     headless: False
10:38:21-607532 INFO     Load CSS...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
10:51:20-341524 INFO     Captioning files in D:\Pictures\New folder (3)...
10:51:20-342520 INFO     ./venv/Scripts/python.exe "finetune/make_captions.py" --batch_size="1" --num_beams="2"
                         --top_p="0.9" --max_length="75" --min_length="35" --beam_search --caption_extension=".txt"
                         "D:\Pictures\New folder (3)"
                         --caption_weights="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/mode
                         l_large_caption.pth"
Current Working Directory is:  C:\Train\Kohya
load images from D:\Pictures\New folder (3)
found 2 images.
loading BLIP caption: https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth
BLIP loaded
  0%|                                                                                            | 0/2 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "C:\Train\Kohya\finetune\make_captions.py", line 200, in <module>
    main(args)
  File "C:\Train\Kohya\finetune\make_captions.py", line 144, in main
    run_batch(b_imgs)
  File "C:\Train\Kohya\finetune\make_captions.py", line 97, in run_batch
    captions = model.generate(
  File "C:\Train\Kohya\finetune\blip\blip.py", line 158, in generate
    outputs = self.text_decoder.generate(input_ids=input_ids,
  File "C:\Train\Kohya\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Train\Kohya\venv\lib\site-packages\transformers\generation\utils.py", line 1611, in generate
    return self.beam_search(
  File "C:\Train\Kohya\venv\lib\site-packages\transformers\generation\utils.py", line 2909, in beam_search
    outputs = self(
  File "C:\Train\Kohya\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Train\Kohya\finetune\blip\med.py", line 886, in forward
    outputs = self.bert(
  File "C:\Train\Kohya\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Train\Kohya\finetune\blip\med.py", line 781, in forward
    encoder_outputs = self.encoder(
  File "C:\Train\Kohya\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Train\Kohya\finetune\blip\med.py", line 445, in forward
    layer_outputs = layer_module(
  File "C:\Train\Kohya\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Train\Kohya\finetune\blip\med.py", line 361, in forward
    cross_attention_outputs = self.crossattention(
  File "C:\Train\Kohya\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Train\Kohya\finetune\blip\med.py", line 277, in forward
    self_outputs = self.self(
  File "C:\Train\Kohya\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Train\Kohya\finetune\blip\med.py", line 178, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (2) must match the size of tensor b (4) at non-singleton dimension 0

kohya-ss · 2024-03-30T05:58:22Z

I finally found the fix for this :)

kohya-ss mentioned this issue Mar 30, 2024

What does number of beams do in captioning, and how do i use it. I get error on anything above 1. bmaltais/kohya_ss#2175

Closed

kohya-ss closed this as completed in f1f30ab Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix Size Mismatch Error in 'make_captions.py' When Using --num_beams Not Equal to 1 #1149

Matrix Size Mismatch Error in 'make_captions.py' When Using --num_beams Not Equal to 1 #1149

INFCode commented Mar 3, 2024 •

edited

Loading

TeKett commented Mar 29, 2024

kohya-ss commented Mar 30, 2024

Matrix Size Mismatch Error in 'make_captions.py' When Using --num_beams Not Equal to 1 #1149

Matrix Size Mismatch Error in 'make_captions.py' When Using --num_beams Not Equal to 1 #1149

Comments

INFCode commented Mar 3, 2024 • edited Loading

TeKett commented Mar 29, 2024

kohya-ss commented Mar 30, 2024

INFCode commented Mar 3, 2024 •

edited

Loading