-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Possible tune / core bug related to correct resuming of trials on a dynamic cluster #21825
Comments
Could this be related to something with gcs server (#11309) and the "too many files" problem (#12227) related to ulimit? I have no possiblity to change ulimit in my case. However, I also only have 70 workers at a time.. but since every hour 3 are crashing, and 3 new workers are added .. maybe after a time for some reason this still makes a problem because internally the crashed ones are not handled correctly? How does ulimit play into this anyway, which part of the whole system opens so many files? |
Hey @thoglu would it be possible to send over a reproducible script? Is this something that only occurs after multiple days of running or can it be reproduced after killing trials after a few seconds? |
And can you also try with the latest Ray 1.11 and see if this is still a problem for you? |
I wil ltry again with the latest version asap .. it is potentially hard to make this reprodicible, it only occurs after several days of running .. if I kill jobs before they are always resumed correctly, only after a while ( ~7 days) scripts are not resumed correctly .. my supsicion is that this is connected to ulimit and the logging directory, but I am not sure if this is possible. |
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
Search before asking
Ray Component
Ray Core, Ray Tune
What happened + What you expected to happen
I ve had a discussion on slack with @krfricke about a recent tune bugfix that came with this PR (#20536) which is among other things concerned with correct resuming of trials... I myself always had problems in my setup with resume after a few days, so I installed the latest master nightly build which included that PR and started running some tune session on my cluster. It went smoothly and I thought that bug has solved my problem, but then I got the following "Reset" on the night from 22nd to 23rd.
Attached you find two pictures from the validation loss for a tune job with 70 tune trials (I run 7 discrete settings with tune.GridSearch() and num_samples=10, so 10 repeats per grid point and 70 total).
You can see that a bunch of trials get reset to the beginning and start over. In the second plot vs walltime it shows that trials stopped working on Jan 22nd .. they then stalled for a while and are only now being reset .
I should say something about my somewhat peculiar cluster setup: I run on a SGE grid engine where I have 24 hour worker slots .. what I do currently is I have one ray master node that is permanently on a separate machine. On this machine I start ray via
subprocess.run(["ray", "start" ,"--head" ,"--num-cpus=1" , "--num-gpus=0", "--temp-dir=%s" % ray_logdir], stdout=subprocess.PIPE)
which works as the master node.
Additionally I run a cronjob which creates 3 new worker nodes as SGE slots every hour, and 3 workers crash every hour when their respective SGE slot ends.. so every hour 3 workers are basically crashing, and 3 new workers are added to ray.
On the worker nodes, I execute
os.system("ray start --address='%s' --redis-password='5241590000000000' --num-cpus=1 --num-gpus=0 --temp-dir=%s" % (head_ip, ray_logdir))
to connect to the master node.
Furthermore, on the master node, I then call a python script which first executes
ray.init()
to connect to the cluster and thentune.rune( )
to run the tune trials.Because I add 3 workers per hour, and each is 24 hour long, I get a total of 72 nodes permanently on, which can handle my 70 trials. So I have continuous crashing+resuming, which means I rely on correct resuming .. as you can see resuming worked perfectly until the hickup on jan 22/23 . Now, the remaining few trials are still at a standstill and so their progress basically stopped at jan 22/23rd wth the exception of the one trial that is running on the head node. The cluster status tells me 70/70 trials are running, which is definately not the case.. the majority is still stopped and the ones that are running have been reset to the beginning.
Any idea what this could be or how I could debug this? I really want to solve this... in my opinion it should be possible to set up ray to be stable even if the cluster status is as dynamic as in my case.
UPDATE: I just killed the tune job on the head node, left ray running, and restarted the tune job with -resume .. it actually failed to restart with :
Failed to get the system config for raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresess; RPC Error details:
I also found a few "too many open files" errors in raylet.err, could they have something to do with this? Does this relate to the log files that are created ? My log folder size from this 1 week of running is 30 GB with tons of files and I have used log_to_file=True. I am currently running again with the option "log_to_file=False", and see what happens.
I have over 1900 log files in the logdir of the ray session, and over 900 files in the "old" subdirectory. The vast majority of the 30 GB comes from hundreds of raylet files which are each 80 MB.
UPDATE2:
After running a second time for a week, with "log_to_file=False" (but resume=True probably has overwritten this) the same errors appear and the system seems to be hanging again. Another error I just now discovered is in dashboard.log, of the type:
2022-01-24 17:31:43,188 ERROR node_head.py:241 -- Error updating node stats of 6ae37dcaf2c024fba86ceb92c1d7b9dd6aed518a0b8c1cf17e956ce9. Traceback (most recent call last): raise _create_rpc_error(self._cython_call._initial_metadata, grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses" debug_error_string = "{"created":"@1643041903.188611088","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1643041903.188609397","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
Any ideas?
Versions / Dependencies
2.0.0 nightly build (from ~17th of january 2022), python 3.89, scientific linux 7
Reproduction script
Currently no script available.
Anything else
This isse has occured multiple times, but I thought the latest bugfixes with the above mentioned pull release would have fixed it
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: