-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Fix A2C release test crash (rollout_fragment_length vs train_batch_size). #30361
Conversation
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
@sven1977 Please separate a2c pr from cuda |
@@ -2256,6 +2256,57 @@ def is_policy_to_train(pid, batch=None): | |||
|
|||
return policies, is_policy_to_train | |||
|
|||
def validate_train_batch_size_vs_rollout_fragment_length(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stupid questions: 1) why do we need to be able to specify both that roughly match each other? why not just error out when train_batch_size does not match the expected value based on the rollout_fragment_length value? 2) is setting rollout_fragment_length = "auto" always recommended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, for off-policy algos, rollout_fragment_length
can be whatever it wants and it is not linked to the train batch size. For on-policy, I'm thinking that sometimes, users would want to set the rollout fragment length manually to force a certain rollout behavior, however, through this new error, we force them to be aware that this will have an effect on their train batch size.
Created a separate PR for CUDA 10.2 downgradation #30512. please make this pr only for the validation of rollout_fragement_length in on policy algos. |
…h_cuda_10_2 Signed-off-by: sven1977 <[email protected]> # Conflicts: # release/rllib_tests/app_config.yaml
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…eric check for different on-policy algos to use. (ray-project#30361) Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: sven1977 [email protected]
torch+CUDA11.x seems to slow down our torch algos considerably, such that most torch learning tests fail.
Adding an override
pip3 install torch==..+cu102 torchvision==..+cu102
to our release tests app-config fixes the problem.However, we should also change our ML docker back to CUDA 10.2!
We will have to do more investigation as to why this is caused, starting with a simple SL+GPU+CNN workload.
However, thus far, we cannot find any flaws in RLlib itself that would explain these slowdowns.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.