[RLlib] disabled torch release test on APPO #30282

kouroshHakha · 2022-11-15T07:30:32Z

Signed-off-by: Kourosh Hakhamaneshi [email protected]

Why are these changes needed?

This test just never passed. For some reason torch version has always been broken. We should disable this for now until this gets fixed. otherwise the release test notifs will be noisy.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

sven1977 · 2022-11-16T15:01:55Z

release/rllib_tests/learning_tests/yaml_files/appo/appo-pongnoframeskip-v4.yaml

@@ -7,6 +7,8 @@ appo-pongnoframeskip-v4:
        timesteps_total: 5000000
    stop:
        time_total_s: 1800
+    # TODO (Kourosh): Torch and tf2 do not learn as good as tf. Why?
+    frameworks: ["tf"]


The reason why torch does not learn this task for this test is simply speed (very similar issue with the corresponding PPO test). After a deeper investigation into our PPO+GPU+Atari+CUDA11.6 tests, @smorad and I have found the problem to be simply the .backward() call on the loss. We made sure it's not the GPU->CPU copying, not the optimizer.zero_grad, not the input dtypes (e.g. double instead of float), not the loss math itself, not the model (same number of trainable params than its tf counterpart). We also checked the actual torch graph with the graphviz + torchviz package and saw nothing suspicious. We might have to go back and try different torch+CUDA versions again (we did this once figuring out that CUDA 10.2 was quite good for torch, actually) to overcome these test failures. :/

let's hold off on this PR until we come to a conclusion with @sven1977 's investigation on torch + cuda thing.

kouroshHakha · 2022-11-19T00:46:16Z

closing in favor of #30361

disabled torch on appo as it has never passed

9c78f6e

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha assigned sven1977 and gjoliver Nov 15, 2022

disabled tf2 as well

7680dee

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gjoliver approved these changes Nov 16, 2022

View reviewed changes

sven1977 reviewed Nov 16, 2022

View reviewed changes

kouroshHakha closed this Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] disabled torch release test on APPO #30282

[RLlib] disabled torch release test on APPO #30282

kouroshHakha commented Nov 15, 2022

sven1977 Nov 16, 2022

kouroshHakha Nov 16, 2022

kouroshHakha commented Nov 19, 2022

[RLlib] disabled torch release test on APPO #30282

[RLlib] disabled torch release test on APPO #30282

Conversation

kouroshHakha commented Nov 15, 2022

Why are these changes needed?

Related issue number

Checks

sven1977 Nov 16, 2022

Choose a reason for hiding this comment

kouroshHakha Nov 16, 2022

Choose a reason for hiding this comment

kouroshHakha commented Nov 19, 2022