-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "[wingman -> rllib] IMPALA MultiDiscrete changes (#3967)" #4332
Conversation
Test PASSed. |
@robertnishihara is this misdetection due to a revert or is it some other bug? |
@ericl This is the new performance testing jenkins, which only kicks in if the PR string contains ray-core afaik (@devin-petersohn), i.e. if something performance-related has changed. ray-core should probably be changed to core. |
Ah ok, didn't realize it had a different name. |
log_rhos = get_log_rhos(target_action_log_probs, | ||
behaviour_action_log_probs) | ||
|
||
log_rhos = target_action_log_probs - behaviour_action_log_probs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential source of the bug is if this guy has a wrong shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the shapes here and it looked reasonable (at least they match).
Test FAILed. |
…project#3967)" (ray-project#4332)" This reverts commit 3c41cb9.
What do these changes do?
There seems to be a bug here, isolated to the changes in vtrace.py. Reverting that file and patching up multi_from_logits to call from_logits fixes the regression. Unfortunately I don't have time to investigate this further so we should revert this change and re-land it once the bug is found.
Interestingly, vtrace.py still passes its original unit tests.
The regression is fairly straightforward to reproduce: on any GPU machine run:
Within a few minutes it gets to 1m+ timesteps. Before the regression, you can expect 10+ reward at this point. After the regression, the reward will top out at 1-2.
Related issue number
Atari performance regression originally reported in #4329