Training setup issues #13

xuxuduoduomeiGithub · 2024-06-27T05:59:38Z

Hello, thank you for bringing wonderful work!
I have the following questions and I hope you can provide assistance:
I only have one RTX3090 graphics card with 24GB of memory, so in the training configuration file of RGT：
num_worker_per_gpu: 16
batch_size_per_gpu: 8
The configuration method should be reduced, and I will try to enable training correctly:
num_worker_per_gpu: 6
batch_size_per_gpu: 4
But will this affect the performance of the indicators after training, and should I adjust other training parameters to achieve balance, such as adjusting the learning rate, increasing the number of training rounds, etc.
Looking forward to your reply~

zhengchen1999 · 2024-06-27T06:06:52Z

Yes, it will affect the training. Model training is based on steps (iterations) rather than epochs. Therefore, changes in batches and GPUs require adjustments to the iterations.

If your batch_size_per_gpu=4 and 1 GPU, compared to original setup of batch_size_per_gpu=8 and 4 GPUs, the amount of training data decreases by a factor of 8. You would need to multiply the iterations by 8 and adjust lr schedule accordingly. This might not guarantee exact reproduction of the results, but the difference shouldn't be significant.

PS: Considering your computing resources, you can consider using DAT-light.

xuxuduoduomeiGithub · 2024-06-27T06:21:38Z

是的，会影响训练。模型训练是基于步骤（迭代）而不是 epoch 的。因此，batch 和 GPU 的变化需要对迭代进行调整。

如果您的 batch_size_per_gpu=4 和 1 GPU，与原始设置的 batch_size_per_gpu=8 和 4 GPU 相比，训练数据量会减少 8 倍。您需要将迭代次数乘以 8，并相应地调整 lr 计划。这可能无法保证结果的准确再现，但差异应该不大。

PS：考虑到你的计算资源，你可以考虑使用DAT-light。

Okay, thank you for your reply~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training setup issues #13

Training setup issues #13

xuxuduoduomeiGithub commented Jun 27, 2024

zhengchen1999 commented Jun 27, 2024

xuxuduoduomeiGithub commented Jun 27, 2024

Training setup issues #13

Training setup issues #13

Comments

xuxuduoduomeiGithub commented Jun 27, 2024

zhengchen1999 commented Jun 27, 2024

xuxuduoduomeiGithub commented Jun 27, 2024