-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gaussian Squashed Gaussian #7609
Conversation
Can one of the admins verify this patch? |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Is there anything else that I should do before this can be merged? |
Hey @matthewearl Any update on this? If all tests pass, I'm happy to merge it. We could then add another specific test for GaussianSquashedGaussian (entropy). |
Hi @sven1977 . I noticed a stability issue when the mean deviates too far either side and almost all mass concentrates around either limit. I have a quick fix though, which is to clip the mean value returned from the net which practically should have little effect. I'll upload that shortly. On the topic of |
Test FAILed. |
I've just added a numerical stability fix which bounds the loc between -3 and 3. Given the scale bounds this should always represent a pretty extreme distribution with mass concentrated around either the high or low bound so it shouldn't limit the behaviour space too much. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test PASSed. |
Hi @sven1977 I've added some unit tests, fixed linter issues, and fixed issues with the existing squashed gaussian test. I think the remaining issues from Travis are in the baseline (although I am not certain). Is there anything else required for getting this merged? |
Test FAILed. |
Just wondering, why was this closed? As I see it, in the meantime a squashed gaussian has been added, but it seems to be only usable in SAC as it is not automatically chosen if bounds are given in the model catalog, correct? |
Actually, |
Just realized that there is no Torch implementation - is that the reason why thas wasn't merged? I'd be happy to give it a try. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
Why are these changes needed?
Currently when PPO is used with a bounded (continuous) action space, action samples are simply drawn from an unbounded normal distribution, and then clipped to the bounds. The entropy is calculated directly on the normal. Because PPO gives a reward for higher entropy, then there exists a failure mode where the algorithm can learn to push most of the mass outside of the action range and increase the variance, thus increasing entropy despite there being little change in selected actions.
The direct way to fix this, ie. calculating the entropy of the clipped distribution doesn't work since the clipped distribution actually has undefined entropy. Another way to fix it is to use a "soft clip" such as the existing SquashedGaussian distribution which maps samples through a (scaled) tanh function in order to ensure samples lie within the desired range. The problem here is that the entropy here is hard (impossible?) to compute analytically which is required by PPO when using a non-zero
entropy_coeff
.In this PR I have implemented the
GaussianSquashedGaussian
which instead of mapping through tanh maps through the normal CDF. When scaled appropriately it closely approximates tanh:However, it has the benefit that the entropy is analytically tractable. In fact, the entropy is just -KL(N1 || N2), where N1 is the normal being squashed, and N2 is the normal corresponding with the CDF used for squashing.
This should be considered a draft review for now, since I'd like to get a second opinion on how I've structured the catalog -> action space mapping, and I have also touched the existing SquashedGaussian and I'm unsure if these changes will break anything.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.