Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

执行微调训练时,一直停在0%不动 #5824

Open
1 task done
czhcc opened this issue Oct 25, 2024 · 4 comments
Open
1 task done

执行微调训练时,一直停在0%不动 #5824

czhcc opened this issue Oct 25, 2024 · 4 comments
Labels
pending This problem is yet to be addressed

Comments

@czhcc
Copy link

czhcc commented Oct 25, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.8.3
  • Platform: Linux-4.18.0-348.7.1.el8_5.x86_64-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • PyTorch version: 2.3.0a0+ebedce2 (GPU)
  • Transformers version: 4.46.0
  • Datasets version: 3.0.2
  • Accelerate version: 1.0.1
  • PEFT version: 0.13.2
  • TRL version: 0.11.4
  • GPU type: NVIDIA A40-24Q

Reproduction

[INFO|trainer.py:2319] 2024-10-25 08:06:23,313 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:2320] 2024-10-25 08:06:23,313 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2321] 2024-10-25 08:06:23,313 >> Total optimization steps = 234
[INFO|trainer.py:2322] 2024-10-25 08:06:23,318 >> Number of trainable parameters = 9,232,384
0%| | 0/234 [00:00<?, ?it/s]

一直停在这个信息,不动,系统是centos8.5,docker是25.0.1

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Oct 25, 2024
@Arcmoon-Hu
Copy link

哪个模型呢?gpu有在运行吗?

@czhcc
Copy link
Author

czhcc commented Oct 25, 2024

qwen2.5-1.5b,gpu也没动

@alexlai2860
Copy link

你好,请问你解决了吗?我测试llava也遇到了同样的问题,在dpo时无法训练,但在sft阶段是正常的。显存有占用,但GPU没有运行。

@czhcc
Copy link
Author

czhcc commented Oct 26, 2024

没有解决,现在只能判断是centos的问题。因为在ubuntu下是正常的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

3 participants