执行微调训练时，一直停在0%不动 #5824

czhcc · 2024-10-25T08:13:54Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.3
Platform: Linux-4.18.0-348.7.1.el8_5.x86_64-x86_64-with-glibc2.35
Python version: 3.10.12
PyTorch version: 2.3.0a0+ebedce2 (GPU)
Transformers version: 4.46.0
Datasets version: 3.0.2
Accelerate version: 1.0.1
PEFT version: 0.13.2
TRL version: 0.11.4
GPU type: NVIDIA A40-24Q

Reproduction

[INFO|trainer.py:2319] 2024-10-25 08:06:23,313 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:2320] 2024-10-25 08:06:23,313 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2321] 2024-10-25 08:06:23,313 >> Total optimization steps = 234
[INFO|trainer.py:2322] 2024-10-25 08:06:23,318 >> Number of trainable parameters = 9,232,384
0%| | 0/234 [00:00<?, ?it/s]

一直停在这个信息，不动，系统是centos8.5，docker是25.0.1

Expected behavior

No response

Others

No response

The text was updated successfully, but these errors were encountered:

Arcmoon-Hu · 2024-10-25T09:32:17Z

哪个模型呢？gpu有在运行吗？

czhcc · 2024-10-25T09:59:42Z

qwen2.5-1.5b，gpu也没动

alexlai2860 · 2024-10-25T17:13:45Z

你好，请问你解决了吗？我测试llava也遇到了同样的问题，在dpo时无法训练，但在sft阶段是正常的。显存有占用，但GPU没有运行。

czhcc · 2024-10-26T02:17:51Z

没有解决，现在只能判断是centos的问题。因为在ubuntu下是正常的

github-actions bot added the pending This problem is yet to be addressed label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

执行微调训练时，一直停在0%不动 #5824

执行微调训练时，一直停在0%不动 #5824

czhcc commented Oct 25, 2024

Arcmoon-Hu commented Oct 25, 2024

czhcc commented Oct 25, 2024

alexlai2860 commented Oct 25, 2024

czhcc commented Oct 26, 2024

执行微调训练时，一直停在0%不动 #5824

执行微调训练时，一直停在0%不动 #5824

Comments

czhcc commented Oct 25, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

Arcmoon-Hu commented Oct 25, 2024

czhcc commented Oct 25, 2024

alexlai2860 commented Oct 25, 2024

czhcc commented Oct 26, 2024