You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have questions about the effective batch size, which is batch size 128 * accumulated_grad_batch 16= 2048.
Does this mean the model see 128 sample a time and then calculates the gradient, then add all the gradients for each of 16 batches? I think such type of implementation differs from the normal concept of batch size 2048, where model sees 2048 sample at a time and the InfoNCE loss is computed over all 2048 samples, but not over 128 samples.
Besides, I find that the precision is chosen to be 16 bit. I wonder why is it necessary to not use 32 bit.
In src/models/base_model.py, I find that the warmup_epochs and max_epochs is rescaled by a factor of self.train_iters_per_epoch // self.config.num_of_mini_batch. Why is this rescaling necessary ? If this factor does not equal to 1, the max_epochs in learning rate scheduler does not equal to max_epochs in pl trainer, which I think is not quite reasonable.
The text was updated successfully, but these errors were encountered:
Thanks for your excellent work and released code.
Does this mean the model see 128 sample a time and then calculates the gradient, then add all the gradients for each of 16 batches? I think such type of implementation differs from the normal concept of batch size 2048, where model sees 2048 sample at a time and the InfoNCE loss is computed over all 2048 samples, but not over 128 samples.
The text was updated successfully, but these errors were encountered: