[RLlib] Fix A2C release test crash (rollout_fragment_length vs train_batch_size). #30361

sven1977 · 2022-11-16T21:55:48Z

Signed-off-by: sven1977 [email protected]

torch+CUDA11.x seems to slow down our torch algos considerably, such that most torch learning tests fail.

Adding an override pip3 install torch==..+cu102 torchvision==..+cu102 to our release tests app-config fixes the problem.

However, we should also change our ML docker back to CUDA 10.2!

We will have to do more investigation as to why this is caused, starting with a simple SL+GPU+CNN workload.
However, thus far, we cannot find any flaws in RLlib itself that would explain these slowdowns.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>

…h_cuda_10_2

Signed-off-by: sven1977 <[email protected]>

…h_cuda_10_2

Signed-off-by: sven1977 <[email protected]>

kouroshHakha · 2022-11-18T17:20:52Z

@sven1977 Please separate a2c pr from cuda

kouroshHakha · 2022-11-18T17:31:58Z

rllib/algorithms/algorithm_config.py

@@ -2256,6 +2256,57 @@ def is_policy_to_train(pid, batch=None):

        return policies, is_policy_to_train

+    def validate_train_batch_size_vs_rollout_fragment_length(self) -> None:


stupid questions: 1) why do we need to be able to specify both that roughly match each other? why not just error out when train_batch_size does not match the expected value based on the rollout_fragment_length value? 2) is setting rollout_fragment_length = "auto" always recommended?

Well, for off-policy algos, rollout_fragment_length can be whatever it wants and it is not linked to the train batch size. For on-policy, I'm thinking that sometimes, users would want to set the rollout fragment length manually to force a certain rollout behavior, however, through this new error, we force them to be aware that this will have an effect on their train batch size.

kouroshHakha · 2022-11-20T00:26:08Z

Created a separate PR for CUDA 10.2 downgradation #30512. please make this pr only for the validation of rollout_fragement_length in on policy algos.

…h_cuda_10_2 Signed-off-by: sven1977 <[email protected]> # Conflicts: # release/rllib_tests/app_config.yaml

Signed-off-by: sven1977 <[email protected]>

…h_cuda_10_2

kouroshHakha

LGTM

…eric check for different on-policy algos to use. (ray-project#30361) Signed-off-by: Weichen Xu <[email protected]>

wip

925ee6b

Signed-off-by: sven1977 <[email protected]>

sven1977 requested review from amogkam, richardliaw and matthewdeng as code owners November 16, 2022 21:55

sven1977 assigned krfricke Nov 16, 2022

Merge branch 'master' of https://github.com/ray-project/ray into torc…

9f4c80b

…h_cuda_10_2

amogkam self-assigned this Nov 17, 2022

sven1977 added 5 commits November 17, 2022 18:14

wip

a2cd920

Signed-off-by: sven1977 <[email protected]>

wip

4345dd7

Signed-off-by: sven1977 <[email protected]>

wip

8f477e4

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into torc…

2ff9d73

…h_cuda_10_2

wip

d1c23f0

Signed-off-by: sven1977 <[email protected]>

sven1977 requested review from gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners November 18, 2022 09:12

amogkam unassigned amogkam and krfricke Nov 18, 2022

xwjiang2010 assigned kouroshHakha Nov 18, 2022

kouroshHakha reviewed Nov 18, 2022

View reviewed changes

kouroshHakha mentioned this pull request Nov 19, 2022

[RLlib] disabled torch release test on APPO #30282

Closed

7 tasks

sven1977 added 2 commits November 20, 2022 11:45

Merge branch 'master' of https://github.com/ray-project/ray into torc…

5519f25

…h_cuda_10_2 Signed-off-by: sven1977 <[email protected]> # Conflicts: # release/rllib_tests/app_config.yaml

wip

9b259bb

Signed-off-by: sven1977 <[email protected]>

sven1977 changed the title ~~[RLlib] Move back to torch + CUDA10.2 (for better release test performance)~~ [RLlib] Fix A2C release test crash (rollout_fragment_length vs train_batch_size). Nov 20, 2022

LINT.

b6f8b70

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into torc…

7fd722f

…h_cuda_10_2

kouroshHakha added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 21, 2022

kouroshHakha approved these changes Nov 21, 2022

View reviewed changes

sven1977 merged commit b8b32f3 into ray-project:master Nov 21, 2022

WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022

[RLlib] Introduce rollout_fragment_length vs train_batch_size gen…

3be69c8

…eric check for different on-policy algos to use. (ray-project#30361) Signed-off-by: Weichen Xu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Fix A2C release test crash (rollout_fragment_length vs train_batch_size). #30361

[RLlib] Fix A2C release test crash (rollout_fragment_length vs train_batch_size). #30361

sven1977 commented Nov 16, 2022 •

edited

Loading

kouroshHakha commented Nov 18, 2022

kouroshHakha Nov 18, 2022

sven1977 Nov 20, 2022

kouroshHakha commented Nov 20, 2022

kouroshHakha left a comment

		@@ -2256,6 +2256,57 @@ def is_policy_to_train(pid, batch=None):

		return policies, is_policy_to_train

		def validate_train_batch_size_vs_rollout_fragment_length(self) -> None:

[RLlib] Fix A2C release test crash (rollout_fragment_length vs train_batch_size). #30361

[RLlib] Fix A2C release test crash (rollout_fragment_length vs train_batch_size). #30361

Conversation

sven1977 commented Nov 16, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

kouroshHakha commented Nov 18, 2022

kouroshHakha Nov 18, 2022

Choose a reason for hiding this comment

sven1977 Nov 20, 2022

Choose a reason for hiding this comment

kouroshHakha commented Nov 20, 2022

kouroshHakha left a comment

Choose a reason for hiding this comment

sven1977 commented Nov 16, 2022 •

edited

Loading