Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Fix A2C release test crash (rollout_fragment_length vs train_batch_size). #30361

Merged
merged 11 commits into from
Nov 21, 2022

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Nov 16, 2022

Signed-off-by: sven1977 [email protected]

torch+CUDA11.x seems to slow down our torch algos considerably, such that most torch learning tests fail.

Adding an override pip3 install torch==..+cu102 torchvision==..+cu102 to our release tests app-config fixes the problem.

However, we should also change our ML docker back to CUDA 10.2!

We will have to do more investigation as to why this is caused, starting with a simple SL+GPU+CNN workload.
However, thus far, we cannot find any flaws in RLlib itself that would explain these slowdowns.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>
@amogkam amogkam self-assigned this Nov 17, 2022
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
@kouroshHakha
Copy link
Contributor

@sven1977 Please separate a2c pr from cuda

@@ -2256,6 +2256,57 @@ def is_policy_to_train(pid, batch=None):

return policies, is_policy_to_train

def validate_train_batch_size_vs_rollout_fragment_length(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stupid questions: 1) why do we need to be able to specify both that roughly match each other? why not just error out when train_batch_size does not match the expected value based on the rollout_fragment_length value? 2) is setting rollout_fragment_length = "auto" always recommended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, for off-policy algos, rollout_fragment_length can be whatever it wants and it is not linked to the train batch size. For on-policy, I'm thinking that sometimes, users would want to set the rollout fragment length manually to force a certain rollout behavior, however, through this new error, we force them to be aware that this will have an effect on their train batch size.

@kouroshHakha
Copy link
Contributor

Created a separate PR for CUDA 10.2 downgradation #30512. please make this pr only for the validation of rollout_fragement_length in on policy algos.

…h_cuda_10_2

Signed-off-by: sven1977 <[email protected]>

# Conflicts:
#	release/rllib_tests/app_config.yaml
Signed-off-by: sven1977 <[email protected]>
@sven1977 sven1977 changed the title [RLlib] Move back to torch + CUDA10.2 (for better release test performance) [RLlib] Fix A2C release test crash (rollout_fragment_length vs train_batch_size). Nov 20, 2022
Signed-off-by: sven1977 <[email protected]>
@kouroshHakha kouroshHakha added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 21, 2022
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sven1977 sven1977 merged commit b8b32f3 into ray-project:master Nov 21, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…eric check for different on-policy algos to use. (ray-project#30361)

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants