You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the README and searched the existing issues.
System Info
transformers==4.45.1
Reproduction
输入同时包含video和image在进行tokenizer时具体报错如下
Converting format of dataset (num_proc=128): 100%|_________________________________________________________________________| 49996/49996 [00:02<00:00, 20519.51 examples/s]
Running tokenizer on dataset (num_proc=128): 0%| | 0/49996 [02:06<?, ? examples/s]
[rank0]: result = (True, func(*args, **kwds))
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3558, in _map_single
[rank0]: batch = apply_function_on_filtered_inputs(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3427, in apply_function_on_filtered_inputs
[rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]: File "./LLaMA-Factory/src/llamafactory/data/processors/supervised.py", line 105, in preprocess_supervised_dataset
[rank0]: input_ids, labels = _encode_supervised_example(
[rank0]: File "./LLaMA-Factory/src/llamafactory/data/processors/supervised.py", line 48, in _encode_supervised_example
[rank0]: messages = template.mm_plugin.process_messages(prompt + response, images, videos, processor)
[rank0]: File " ./LLaMA-Factory/src/llamafactory/data/mm_plugin.py", line 496, in process_messages
[rank0]: raise ValueError("len(images) is less than the number of {} tokens.".format(IMAGE_PLACEHOLDER))
[rank0]: ValueError: len(images) is less than the number of tokens.
Reminder
System Info
transformers==4.45.1
Reproduction
输入同时包含video和image在进行tokenizer时具体报错如下
Converting format of dataset (num_proc=128): 100%|_________________________________________________________________________| 49996/49996 [00:02<00:00, 20519.51 examples/s]
Running tokenizer on dataset (num_proc=128): 0%| | 0/49996 [02:06<?, ? examples/s]
[rank0]: result = (True, func(*args, **kwds))
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3558, in _map_single
[rank0]: batch = apply_function_on_filtered_inputs(
[rank0]: File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3427, in apply_function_on_filtered_inputs
[rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]: File "./LLaMA-Factory/src/llamafactory/data/processors/supervised.py", line 105, in preprocess_supervised_dataset
[rank0]: input_ids, labels = _encode_supervised_example(
[rank0]: File "./LLaMA-Factory/src/llamafactory/data/processors/supervised.py", line 48, in _encode_supervised_example
[rank0]: messages = template.mm_plugin.process_messages(prompt + response, images, videos, processor)
[rank0]: File " ./LLaMA-Factory/src/llamafactory/data/mm_plugin.py", line 496, in process_messages
[rank0]: raise ValueError("
len(images)
is less than the number of {} tokens.".format(IMAGE_PLACEHOLDER))[rank0]: ValueError:
len(images)
is less than the number of tokens.具体数据格式如下,message中包含和
按照如下代码修改后能跑通,但是tokenizer特别慢还没查清楚原因
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: