[Bug] Possible tune / core bug related to correct resuming of trials on a dynamic cluster #21825

thoglu · 2022-01-24T19:18:26Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Core, Ray Tune

What happened + What you expected to happen

I ve had a discussion on slack with @krfricke about a recent tune bugfix that came with this PR (#20536) which is among other things concerned with correct resuming of trials... I myself always had problems in my setup with resume after a few days, so I installed the latest master nightly build which included that PR and started running some tune session on my cluster. It went smoothly and I thought that bug has solved my problem, but then I got the following "Reset" on the night from 22nd to 23rd.

Attached you find two pictures from the validation loss for a tune job with 70 tune trials (I run 7 discrete settings with tune.GridSearch() and num_samples=10, so 10 repeats per grid point and 70 total).

You can see that a bunch of trials get reset to the beginning and start over. In the second plot vs walltime it shows that trials stopped working on Jan 22nd .. they then stalled for a while and are only now being reset .

I should say something about my somewhat peculiar cluster setup: I run on a SGE grid engine where I have 24 hour worker slots .. what I do currently is I have one ray master node that is permanently on a separate machine. On this machine I start ray via
subprocess.run(["ray", "start" ,"--head" ,"--num-cpus=1" , "--num-gpus=0", "--temp-dir=%s" % ray_logdir], stdout=subprocess.PIPE)
which works as the master node.

Additionally I run a cronjob which creates 3 new worker nodes as SGE slots every hour, and 3 workers crash every hour when their respective SGE slot ends.. so every hour 3 workers are basically crashing, and 3 new workers are added to ray.

On the worker nodes, I execute
os.system("ray start --address='%s' --redis-password='5241590000000000' --num-cpus=1 --num-gpus=0 --temp-dir=%s" % (head_ip, ray_logdir))
to connect to the master node.

Furthermore, on the master node, I then call a python script which first executes ray.init() to connect to the cluster and then tune.rune( ) to run the tune trials.

Because I add 3 workers per hour, and each is 24 hour long, I get a total of 72 nodes permanently on, which can handle my 70 trials. So I have continuous crashing+resuming, which means I rely on correct resuming .. as you can see resuming worked perfectly until the hickup on jan 22/23 . Now, the remaining few trials are still at a standstill and so their progress basically stopped at jan 22/23rd wth the exception of the one trial that is running on the head node. The cluster status tells me 70/70 trials are running, which is definately not the case.. the majority is still stopped and the ones that are running have been reset to the beginning.
Any idea what this could be or how I could debug this? I really want to solve this... in my opinion it should be possible to set up ray to be stable even if the cluster status is as dynamic as in my case.

UPDATE: I just killed the tune job on the head node, left ray running, and restarted the tune job with -resume .. it actually failed to restart with :
Failed to get the system config for raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresess; RPC Error details:

I also found a few "too many open files" errors in raylet.err, could they have something to do with this? Does this relate to the log files that are created ? My log folder size from this 1 week of running is 30 GB with tons of files and I have used log_to_file=True. I am currently running again with the option "log_to_file=False", and see what happens.
I have over 1900 log files in the logdir of the ray session, and over 900 files in the "old" subdirectory. The vast majority of the 30 GB comes from hundreds of raylet files which are each 80 MB.

UPDATE2:
After running a second time for a week, with "log_to_file=False" (but resume=True probably has overwritten this) the same errors appear and the system seems to be hanging again. Another error I just now discovered is in dashboard.log, of the type:

2022-01-24 17:31:43,188 ERROR node_head.py:241 -- Error updating node stats of 6ae37dcaf2c024fba86ceb92c1d7b9dd6aed518a0b8c1cf17e956ce9. Traceback (most recent call last): raise _create_rpc_error(self._cython_call._initial_metadata, grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses" debug_error_string = "{"created":"@1643041903.188611088","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1643041903.188609397","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

Any ideas?

Versions / Dependencies

2.0.0 nightly build (from ~17th of january 2022), python 3.89, scientific linux 7

Reproduction script

Currently no script available.

Anything else

This isse has occured multiple times, but I thought the latest bugfixes with the above mentioned pull release would have fixed it

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

thoglu · 2022-01-25T11:47:39Z

Could this be related to something with gcs server (#11309) and the "too many files" problem (#12227) related to ulimit? I have no possiblity to change ulimit in my case. However, I also only have 70 workers at a time.. but since every hour 3 are crashing, and 3 new workers are added .. maybe after a time for some reason this still makes a problem because internally the crashed ones are not handled correctly? How does ulimit play into this anyway, which part of the whole system opens so many files?

amogkam · 2022-04-11T18:27:13Z

Hey @thoglu would it be possible to send over a reproducible script? Is this something that only occurs after multiple days of running or can it be reproduced after killing trials after a few seconds?

amogkam · 2022-04-11T18:27:39Z

And can you also try with the latest Ray 1.11 and see if this is still a problem for you?

thoglu · 2022-04-25T14:35:22Z

I wil ltry again with the latest version asap .. it is potentially hard to make this reprodicible, it only occurs after several days of running .. if I kill jobs before they are always resumed correctly, only after a while ( ~7 days) scripts are not resumed correctly .. my supsicion is that this is connected to ulimit and the logging directory, but I am not sure if this is possible.

stale · 2022-09-09T01:20:54Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale · 2022-09-24T05:46:48Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

thoglu added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 24, 2022

xwjiang2010 added tune Tune-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 25, 2022

amogkam added the needs-repro-script Issue needs a runnable script to be reproduced label Apr 11, 2022

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 9, 2022

stale bot closed this as completed Sep 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Possible tune / core bug related to correct resuming of trials on a dynamic cluster #21825

[Bug] Possible tune / core bug related to correct resuming of trials on a dynamic cluster #21825

thoglu commented Jan 24, 2022 •

edited

Loading

thoglu commented Jan 25, 2022 •

edited

Loading

amogkam commented Apr 11, 2022

amogkam commented Apr 11, 2022

thoglu commented Apr 25, 2022

stale bot commented Sep 9, 2022

stale bot commented Sep 24, 2022

[Bug] Possible tune / core bug related to correct resuming of trials on a dynamic cluster #21825

[Bug] Possible tune / core bug related to correct resuming of trials on a dynamic cluster #21825

Comments

thoglu commented Jan 24, 2022 • edited Loading

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

thoglu commented Jan 25, 2022 • edited Loading

amogkam commented Apr 11, 2022

amogkam commented Apr 11, 2022

thoglu commented Apr 25, 2022

stale bot commented Sep 9, 2022

stale bot commented Sep 24, 2022

thoglu commented Jan 24, 2022 •

edited

Loading

thoglu commented Jan 25, 2022 •

edited

Loading