You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SM-G-SUM feeds backward() outputs of 1 and then uses the returned gradients unaltered (ie their sum across states)
SM-G-ABS feeds backward() outputs of 1/sz and then manually calculates the mean of the gradients of the individual states, whereas in SM-G-SUM they were summed inside of backward()
The result is SM-G-SUM using a scale that is sz^2 larger in magnitude than SM-G-ABS. This is difficult to notice when the length of the states is only 2 as in the example, especially so since SM-G-ABS will return a naturally larger scale due to no washout.
Absolutely awesome work on your genetic and evolutionary research! Safe mutations are an incredible milestone in genetic optimization! Now just throw away tensorflow and pytorch and start coding in pure Cuda like you ought to be :)
The text was updated successfully, but these errors were encountered:
To further clarify: I believe both implementions are wrong in the sense that they are not finding a scaling vector independent of the number of states.
SM-G-SUM should set:
grad_output[:, i] = 1.0/len(_states)
since the gradients get summed by the backward() pass
SM-G-ABS should EITHER:
a) grad_output[:, i] = 1.0
since these values are then averaged along the 2 axis
or
b) mean_abs_jacobian = torch.abs(jacobian).sum(2)
to sum them instead of averaging them
Note that torch.autograd.backward() calculates the sum of gradients in all states (at least in 0.4.1 https://pytorch.org/docs/stable/autograd.html?highlight=backward#torch.autograd.backward)
SM-G-SUM feeds backward() outputs of 1 and then uses the returned gradients unaltered (ie their sum across states)
SM-G-ABS feeds backward() outputs of 1/sz and then manually calculates the mean of the gradients of the individual states, whereas in SM-G-SUM they were summed inside of backward()
The result is SM-G-SUM using a scale that is sz^2 larger in magnitude than SM-G-ABS. This is difficult to notice when the length of the states is only 2 as in the example, especially so since SM-G-ABS will return a naturally larger scale due to no washout.
Absolutely awesome work on your genetic and evolutionary research! Safe mutations are an incredible milestone in genetic optimization! Now just throw away tensorflow and pytorch and start coding in pure Cuda like you ought to be :)
The text was updated successfully, but these errors were encountered: