Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepseek-moe模型在进行lora微调训练时loss值会突然变为0一直到最后,导致推理异常。 #27

Open
hangchen426926 opened this issue Feb 29, 2024 · 3 comments

Comments

@hangchen426926
Copy link

现象1:deepseek-moe模型在进行lora微调训练时loss值会突然变为0一直到最后,导致推理异常,输出结果为:!!!。
image

现象2:deepseek-moe模型在checkpoint模型基础上进一步lora微调训练,会报错。
需要将trainer.train(resume_from_checkpoint = resume_from_checkpoint_dir)改为:
trainer.train() 才会启动成功。但保存的checkpoint就会从头开始,而不是从原checkpoint模型开始。

期待回复,谢谢~

@zwd003
Copy link
Collaborator

zwd003 commented Mar 1, 2024

请问报错结果是什么呢,resume需要加载lora的adapter

@hangchen426926
Copy link
Author

hangchen426926 commented Mar 1, 2024

请问报错结果是什么呢,resume需要加载lora的adapter

从基础模型开始lora训练时不会报错,推理也不报错,就是loss值会在1个epoch后突然变为0,微调后模型推理返回结果是一堆感叹号。但如果lora微调想在resume上加载lora的adapter,使用trainer.train(resume_from_checkpoint = resume_from_checkpoint_dir)训练会报错额。
image

@zyzyyy123
Copy link

请问报错结果是什么呢,resume需要加载lora的adapter

从基础模型开始lora训练时不会报错,推理也不报错,就是loss值会在1个epoch后突然变为0,微调后模型推理返回结果是一堆感叹号。但如果lora微调想在resume上加载lora的adapter,使用trainer.train(resume_from_checkpoint = resume_from_checkpoint_dir)训练会报错额。 image

请问你这个问题解决了吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants