Loss abruptly becomes 'nan' during Self-Supervised training #2

hrishi508 · 2022-01-29T11:27:07Z

In the ResNet9_Barlow_Twins.ipynb notebook, the model and the training function compiled and trained on ~13 epochs successfully, however subsequently the loss abruptly becomes nan. This in turn was occuring due to the gradient becoming nan.

Debugging:

torch.autograd.set_detect_anomaly(True) was used to trace what part of the code was causing there to be nan values.
Error was traced to be: RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.
A simpler model (AlexNet) was used in the AlexNet_Barlow_Twins.ipynb notebook. The error persisted.
Gradient clipping was used to ensure that the gradients do not explode. Division by zero was also prevented at all stages by adding a small positive constant wherever required.
We also tried using Facebook research's implementation of the Barlow Twins loss function and the LARS optimizer.

The text was updated successfully, but these errors were encountered:

hrishi508 added bug Something isn't working help wanted Extra attention is needed labels Jan 29, 2022

hrishi508 pinned this issue Jan 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss abruptly becomes 'nan' during Self-Supervised training #2

Loss abruptly becomes 'nan' during Self-Supervised training #2

hrishi508 commented Jan 29, 2022

Loss abruptly becomes 'nan' during Self-Supervised training #2

Loss abruptly becomes 'nan' during Self-Supervised training #2

Comments

hrishi508 commented Jan 29, 2022