Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2-VL 微调不支持同时输入video和image么 #5822

Open
1 task done
zhang122994917 opened this issue Oct 25, 2024 · 0 comments
Open
1 task done

Qwen2-VL 微调不支持同时输入video和image么 #5822

zhang122994917 opened this issue Oct 25, 2024 · 0 comments
Labels
pending This problem is yet to be addressed

Comments

@zhang122994917
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

transformers==4.45.1

Reproduction

输入同时包含video和image在进行tokenizer时具体报错如下

Converting format of dataset (num_proc=128): 100%|_________________________________________________________________________| 49996/49996 [00:02<00:00, 20519.51 examples/s]
Running tokenizer on dataset (num_proc=128): 0%| | 0/49996 [02:06<?, ? examples/s]

[rank0]: result = (True, func(*args, **kwds))
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3558, in _map_single
[rank0]: batch = apply_function_on_filtered_inputs(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3427, in apply_function_on_filtered_inputs
[rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]: File "./LLaMA-Factory/src/llamafactory/data/processors/supervised.py", line 105, in preprocess_supervised_dataset
[rank0]: input_ids, labels = _encode_supervised_example(
[rank0]: File "./LLaMA-Factory/src/llamafactory/data/processors/supervised.py", line 48, in _encode_supervised_example
[rank0]: messages = template.mm_plugin.process_messages(prompt + response, images, videos, processor)
[rank0]: File " ./LLaMA-Factory/src/llamafactory/data/mm_plugin.py", line 496, in process_messages
[rank0]: raise ValueError("len(images) is less than the number of {} tokens.".format(IMAGE_PLACEHOLDER))
[rank0]: ValueError: len(images) is less than the number of tokens.

具体数据格式如下,message中包含

    {
        "messages": [
            {
                "content": "<video><image>\n以上是一个xxx",
                "role": "user"
            },
            {
                "content": "",
                "role": "assistant"
            }
        ],
        "images": [
            "xxxx"
        ],
        "videos": [
            "xxxx"
        ]
    },

按照如下代码修改后能跑通,但是tokenizer特别慢还没查清楚原因

       if image_processor != video_processor:
            if input_dict.get("images") is not None:
                mm_inputs.update(image_processor(input_dict["images"], return_tensors="pt"))
            if input_dict.get("videos") is not None:
                mm_inputs.update(video_processor(input_dict["videos"], return_tensors="pt"))
        elif input_dict.get("images") is not None or input_dict.get("videos") is not None:  # same processor (qwen2-vl)
           
            # print(f"input_dict is. ========== {input_dict}")         
            #mm_inputs.update(image_processor(**input_dict, return_tensors="pt"))

            images = input_dict.get("images")
            videos = input_dict.get("videos")

            if images is not None:
                image_inputs = image_processor(images=images, videos=None, return_tensors="pt")
                image_grid_thw = image_inputs["image_grid_thw"]
            else:
                image_inputs = {}
                image_grid_thw = None

            if videos is not None:
                videos_inputs = image_processor(images=None, videos=videos, return_tensors="pt")
                video_grid_thw = videos_inputs["video_grid_thw"]
            else:
                videos_inputs = {}
                video_grid_thw = None
            mm_inputs['image_grid_thw'] = image_grid_thw
            mm_inputs['video_grid_thw'] = video_grid_thw

        return mm_inputs

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant