Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training setup issues #13

Open
xuxuduoduomeiGithub opened this issue Jun 27, 2024 · 2 comments
Open

Training setup issues #13

xuxuduoduomeiGithub opened this issue Jun 27, 2024 · 2 comments

Comments

@xuxuduoduomeiGithub
Copy link

Hello, thank you for bringing wonderful work!
I have the following questions and I hope you can provide assistance:
I only have one RTX3090 graphics card with 24GB of memory, so in the training configuration file of RGT:
num_worker_per_gpu: 16
batch_size_per_gpu: 8
The configuration method should be reduced, and I will try to enable training correctly:
num_worker_per_gpu: 6
batch_size_per_gpu: 4
But will this affect the performance of the indicators after training, and should I adjust other training parameters to achieve balance, such as adjusting the learning rate, increasing the number of training rounds, etc.
Looking forward to your reply~

@zhengchen1999
Copy link
Owner

Yes, it will affect the training. Model training is based on steps (iterations) rather than epochs. Therefore, changes in batches and GPUs require adjustments to the iterations.

If your batch_size_per_gpu=4 and 1 GPU, compared to original setup of batch_size_per_gpu=8 and 4 GPUs, the amount of training data decreases by a factor of 8. You would need to multiply the iterations by 8 and adjust lr schedule accordingly. This might not guarantee exact reproduction of the results, but the difference shouldn't be significant.

PS: Considering your computing resources, you can consider using DAT-light.

@xuxuduoduomeiGithub
Copy link
Author

是的,会影响训练。模型训练是基于步骤(迭代)而不是 epoch 的。因此,batch 和 GPU 的变化需要对迭代进行调整。

如果您的 batch_size_per_gpu=4 和 1 GPU,与原始设置的 batch_size_per_gpu=8 和 4 GPU 相比,训练数据量会减少 8 倍。您需要将迭代次数乘以 8,并相应地调整 lr 计划。这可能无法保证结果的准确再现,但差异应该不大。

PS:考虑到你的计算资源,你可以考虑使用DAT-light

Okay, thank you for your reply~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants