Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] disabled torch release test on APPO #30282

Closed

Conversation

kouroshHakha
Copy link
Contributor

Signed-off-by: Kourosh Hakhamaneshi [email protected]

Why are these changes needed?

Screen Shot 2022-11-14 at 11 27 15 PM

This test just never passed. For some reason torch version has always been broken. We should disable this for now until this gets fixed. otherwise the release test notifs will be noisy.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
@@ -7,6 +7,8 @@ appo-pongnoframeskip-v4:
timesteps_total: 5000000
stop:
time_total_s: 1800
# TODO (Kourosh): Torch and tf2 do not learn as good as tf. Why?
frameworks: ["tf"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why torch does not learn this task for this test is simply speed (very similar issue with the corresponding PPO test). After a deeper investigation into our PPO+GPU+Atari+CUDA11.6 tests, @smorad and I have found the problem to be simply the .backward() call on the loss. We made sure it's not the GPU->CPU copying, not the optimizer.zero_grad, not the input dtypes (e.g. double instead of float), not the loss math itself, not the model (same number of trainable params than its tf counterpart). We also checked the actual torch graph with the graphviz + torchviz package and saw nothing suspicious. We might have to go back and try different torch+CUDA versions again (we did this once figuring out that CUDA 10.2 was quite good for torch, actually) to overcome these test failures. :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's hold off on this PR until we come to a conclusion with @sven1977 's investigation on torch + cuda thing.

@kouroshHakha
Copy link
Contributor Author

closing in favor of #30361

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants