-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib][gcs][placementgroups] instability issues running tune/rllib #18003
Comments
Thank you so much for discovering these. I'll investigate further and might create child issues to track these individually after finding out the cause. |
@krfricke , after running compact-regression-test.yaml, I am also getting:
|
|
I can't repro 5 and 6. Does this come up immediately? (It ran for ~1.5 hours without any problems). If it still comes up for you, can you post some local environment information (Python version and |
python 3.7 |
CC @wuisawesome, I think the placement groups are potentially leaking or not being cleaned up appropriately. |
I believe this should be fixed int he master. Please reopen if you see the issue again |
When I run rllib on ray 1.5.2:
(pid=191) 2021-08-22 10:45:21,492 INFO tune.py:550 -- Total run time: 1095.71 seconds (1094.69 seconds for the tuning loop).
:RLLIB requests a lot of resources sometimes, and if the cluster cannot scale up to accommodate it ends up adding nodes and removing them for being idle and hanging forever. (e.g., it requests resources that should run on 200 nodes, but the cluster can scale only to 10 nodes, so it keeps adding 10 nodes and removing them while the trials says “pending”).
I think we should have e2e tests of rllib with GPUs, this might be already existing but for some reason, I am not able for example to run (the cluster keeps adding and removing nodes like issue 3) :
ANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/compact-regression-test.yaml
orANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/impala/atari-impala-large.yaml
when I run
ANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/compact-regression-test.yaml
I get a lot of:CC @wuisawesome
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
The text was updated successfully, but these errors were encountered: