-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib] Frequent “the actor died unexpectedly before finishing this task” errors with executions ops in Ray/RLLib 0.8.7+ #11239
Comments
cc @rkooo567 @stephanie-wang any ideas here? I can't think of why this might be. @soundway , are there logs that say why the worker is dying? Perhaps run |
I am not aware of any known issues related to this... Seems pretty bad though. I also would like to see full logs (if it is crashed within 4~5 minutes, full logs for that cluster would be ideal). |
Btw, this could be the same issue? #9293 |
Can you also check if you have any meaningful logs in And yeah we've changed how actors are managed from the version 0.8.6 (in a more centralized way). I guess that change could be related to this issue. |
And do you have any info about which process uses the memory the most when it starts exploding? |
I see messages in
I don't know which process uses the most memory. But it seems like the plasma error happened before the drastic increase in memory, and not as a result of the increase in memory usage (the memory usage peaked 10 mins after the everything went haywire). This is assuming the timestamps between the workers and Grafana are synced correctly. |
I see. So, I think what happened is one of your nodes had out of memory for some reason, and actor on that nodes are dead, and that just made your script failed (because you did ray.get on dead actors' ref). There were drastic changes in how to manage the actor lifecycle from 0.8.6, but it doesn't seem to be really related based on your logs (the basic idea is now all actors are managed by gcs_server, and it looks like it worked correct). @ericl Idk the detail of object broadcasting issues, but is there any way that's related to this failure? Also, @soundway, when you ran with 0.8.5, did you see the same memory explosion issue? |
also cc @stephanie-wang @barakmich |
With 0.8.5 all memory usages are stable for long running jobs with many actors/nodes. |
What would cause the plasma error as I showed earlier? (i.e. |
I've been trying to get logs for the vanilla setup of QBert+PPO with the reproduction script I provided at the beginning of this issue (1 x p3.2xlarge + 1 x c5.18xlarge). It takes about 20-60+ hours to reproduce. The error in
During this process I was tweaking some of the settings and I might've found a potentially separate issue given the new error I'm getting. The only things I changed are import copy
import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo
if __name__ == '__main__':
ray.init(address="auto")
config = copy.deepcopy(ppo.DEFAULT_CONFIG)
config.update({
"rollout_fragment_length": 32,
"train_batch_size": 8192,
"sgd_minibatch_size": 512,
"num_sgd_iter": 1,
"num_workers": 256,
"num_gpus": 1,
"num_sgd_iter": 1,
"num_cpus_per_worker": 0.25,
"num_cpus_for_driver": 1,
"model": {"fcnet_hiddens": [1024, 1024]},
"framework": "torch",
"lr": ray.tune.sample_from(lambda s: np.random.random()),
})
trainer_cls = ppo.PPOTrainer
config["env"] = "QbertNoFrameskip-v4"
ray.tune.run(trainer_cls,
config=config,
fail_fast=True,
reuse_actors=False,
queue_trials=True,
num_samples=10000,
scheduler=ray.tune.schedulers.ASHAScheduler(
time_attr='training_iteration',
metric='episode_reward_mean',
mode='max',
max_t=5,
grace_period=5,
reduction_factor=3,
brackets=3),
) Here's a sampled copy of
|
@waldroje @soundway I was investigating this and realized there is a high likelihood the issue is the same as #11309, which is triggered by creating and destroying many actors. To validate this, could you try the following?
If this doesn't crash, then it is likely the same issue as 11309. If not, then it would be great if you could include the simplified script that triggers the crash without ASHAScheduler (ideally, simplified to just a |
Oh forgot to mention, you can also try the ulimit workaround in that issue: #11309 |
I’ll run some stuff tomorrow morning and revert
… On Oct 19, 2020, at 6:39 PM, Eric Liang ***@***.***> wrote:
Oh forgot to mention, you can also try the ulimit workaround in that issue: #11309 <#11309>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#11239 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHSHWMLQU3A2DM7F765FCI3SLS5ZXANCNFSM4SGQX3BA>.
|
@soundway @waldroje after experimenting with 1.0 vs 0.8.5 more, I think the main difference is we use 2-3x more file descriptors due to the change in the way actors are managed with the GCS service--- it's not really a leak. I'll put more details in the linked bug. I believe that increasing the file descriptor limit ( |
I upped my file descriptors to ~16k… but still crashing, but getting more specific error now.. not something I’ve seen before..
Ray worker pid: 24628
WARNING:tensorflow:From /home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariabl
e.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-10-21 00:25:36,762 ERROR worker.py:372 -- SystemExit was raised from the worker
Traceback (most recent call last):
File "/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2328, in get_attr
pywrap_tf_session.TF_OperationGetAttrValueProto(self._c_op, name, buf)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Operation 'default_policy/Sum_4' has no attr named '_XlaCompile'.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 331, in _MaybeCompile
xla_compile = op.get_attr("_XlaCompile")
File "/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2332, in get_attr
raise ValueError(str(e))
ValueError: Operation 'default_policy/Sum_4' has no attr named '_XlaCompile'.
Is the gist of it…. It’s late here, and I’ll post a more complete set of logs, and make sure I am not making any mistakes.. but figured I’d pas that along...
… On Oct 20, 2020, at 9:12 PM, Eric Liang ***@***.***> wrote:
@soundway <https://github.com/soundway> @waldroje <https://github.com/waldroje> after experimenting with 1.0 vs 0.8.5 more, I think the main difference is we use 2-3x more file descriptors due to the change in the way actors are managed with the GCS service--- it's not really a leak. I'll put more details in the linked bug.
I believe that increasing the file descriptor limit (ulimit -n value) will resolve the problem, can you try increasing the limit to 10000 or more? The number of fds opened seems to stabilize at just a few thousand.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#11239 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHSHWMNMMR2DMGHPPFRHFHTSLYYRXANCNFSM4SGQX3BA>.
|
@ericl I did some tests and I have some doubt that this is the root cause for our problem for several reasons. Though there might be alternative reasons for my observations. The ulimit for file descriptor per process for our system is already at 65K according to That said, I looked at the number of file descriptors of two other types of jobs where eventually they tend to crash after a long time.
Unfortunately in our system we can't use For setup 1, the largest observed fd number per process doesn't exceed 1700. So the 65K limit per process in theory should be far more than enough. I got it via:
The overall number of opened file descriptors according to the command below never exceeds 200K, but I do observe very large fluctuations between trial resets.
For setup 2, the max fds per process and the total number of fds when I use the commands above don't exceeds 600/140K respectively. In either case, the max number of fds per process is far less than the 65K threshold. The max number of fds across the system is also far less than the 78M allowed. Maybe something eventually would trigger a sudden increase in fd usage. In the past, we observed jobs that crash somewhat early during training (with relatively long min trial durations), and it didn't appear that it was during a trial reset. Based on these preliminary data, it appears that our problem won't be solved easily by just increasing |
Thanks for the investigations @soundway. It does indeed sound like a separate issue. I'm trying to think of how to best reproduce this. Based on what you said, it sounds like it's potentially something triggered by the characteristics of a custom environment, and potentially TensorFlow related. To help me isolate the cause and hopefully get a fast repro, could you try the following?
Probably (1) and (2) are the most critical, since if there's a standalone repro script that runs in a few minutes, I can always try the other steps myself pretty easily. |
For (1) I have tried something like that (with varying obs size, model size, environment memory occupancy), but unfortunately it still takes roughly the same amount of time to crash, and I could not get it to crash within minutes. Though I haven't done this too rigorously and I can revisit this. I haven't tried running long jobs in the cloud with just one instance yet in the but I can try that. I never use TF -- everything I've done here is with Torch (1.6 specifically, but we also see problem with 1.4). I could try TF as well. I started (4) with current setup with Q-bert and they haven't crashed yet, but will give you updates on this. |
@ericl Thanks for providing the dependency list. I will see if I can get a similar docker image in our cluster. For the other issue I mentioned, we aren't using any Tune/RLLib -- just vanilla Ray with our own actors/tasks. Unfortunately, we can't report a bug with our code so it might take some time before I can get a clean standalone script that could reproduce this. Even then, these things don't happen very often (about 1 in dozens of tries) so it won't be easy to reproduce. |
@soundway It still seems ok after a few days:
Note that the changes I had to make to get it to run were settting
and
The training hangs initially without this. Based on this result, it seems the instability is either related to these options, dependencies, or the use of docker, since the other factors are the same. |
Thanks for the update. What's the max number of iterations each trial runs till? The script I provided set to 5 -- which is very aggressive at restarting trials. Here it seems to be 2000? |
Ah I missed that. Retrying with max_t=5, grace_period=5:
|
@soundway it succeeded (100 samples), I'm retrying with 1000 samples. Btw, is it possible for you to provide an example YAML of the p2.16xl workload to reproduce? Wondering if it's some odd dependency issue / docker image issue. |
What exactly do you mean by the YAML file? Sometimes the job crashes around 1K-2K samples. The "200" num_workers could also reduce the chance of crashing since less trials will be launched. When using 10000 samples, while there is an initial hanging, the job will eventually start. |
You mentioned being able to reproduce on a p2.16xl; I'm wondering if you could share the Ray cluster launcher yaml used to create that node (if that was how it was created). I agree the "200" could reduce the chances of crashing; however note that you shouldn't be running with >1 worker per core, I would be curious if the issues go away on your end if you set "num_cpu_per_worker" > 1 to avoid overload. |
We have our own way of launching Ray jobs within our infra and don't use any Ray's launching mechanism. But at a high level we just launch bunch of instances with the specified types, each loaded with a docker image, then we run a command on one of the containers. We generally don't use more than 1 worker per core, but this script was created in the context when you asked me to make the crash reproducible on a single machine, and increasing number of workers was one of the most reliable way to do that on a single machine. Running similar scripts with large number of actors over a large cluster also increases the chance of crashing, but it requires more complex setup (e.g. the first setup I described in this bug report). |
@mkoh-asapp mentioned today he was seeing this issue on some other Ray issues - Matt, can you maybe add some information on this thread? |
@richardliaw Please file a separate issue, this is a symptom not a cause. |
Got it
…On Fri, Dec 11, 2020 at 10:44 AM Eric Liang ***@***.***> wrote:
@richardliaw <https://github.com/richardliaw> Please file a separate
issue, this is a symptom not a cause.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11239 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCRZZMEASNRYZVSVYNLK7DSUJR7TANCNFSM4SGQX3BA>
.
|
@soundway good news, I managed to repro. Filtering the logs with cat raylet.out | grep -v "soft limit" | grep -v "Unpinning object" | grep -v "Failed to send local" | grep -v "Connected to" | grep -v "Sending local GC" | grep -v "Last heartbeat was sent" | grep -v "took " | grep -v "Failed to kill
dmesg:
It could be that triggering GC somehow caused a segfault in the worker. I'll look into trying to reproduce this scenario. |
Btw, I think you can mitigate transient trial crashes by using |
It is compatible yep
…On Fri, Dec 11, 2020 at 12:59 PM Eric Liang ***@***.***> wrote:
Btw, I think you can mitigate transient trial crashes by using
max_failures setting in Tune. @richardliaw
<https://github.com/richardliaw> not sure if it's compatible with
hyperopt though.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11239 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCRZZIBJ26B6VQJ5P5PAN3SUKB2LANCNFSM4SGQX3BA>
.
|
Quick update: I've verified the issue is not due to GC. I also don't see any memory spikes or anything like that, so it's possible I've reproduced a different crash than what you've originally reported @soundway . What I do see is |
I've only checked dmesg in a handful of failed instances, and did not see torch related memory errors in those. I've reproduced crashes with TF in the past, but unfortunately I don't have the log files anymore for them. So it is possible that it's an unrelated issue. |
I've managed to repro without a segfault (using TF). It seems to happen after ~550 trials (not sure if exactly after 550). Trying to dig up worker logs for the crash if there are any. |
I've validated that using This means that #12894 should resolve the issue, by raising the unique bytes in the actor id from 4 -> 12 (now we support up to ~281474976710656 instead of ~100000 actors). |
Thanks for the update! If this is indeed the fix for the crashed produced by the contrived script, then I'm not entirely convinced that it shares the same root cause as the problem we have when running with our own environments (e.g. where sometimes we see errors related file descriptor from raylet). The experiment we run with our environments that crashed don't produce nearly as many actors, and the chance of collisions even with just 32 bit should be negligible compared to the rate of failure we were seeing according to this table. Unless the randomness of the IDs is not uniform. I guess we'll find out more with the latest wheel. I suppose 0.8.5 uses a different scheme to manage actor IDs? Because we never experience any failures with 0.8.5. |
Yep, 0.8.5 doesn't centralize actor management, so the probability of collision is much lower (you would have to have collisions within individual trials, which is extremely unlikely since each trial only had ~200 actors). However, the central GCS tracks all actors over all time, so collisions there become inevitable once you cycle through enough actors in your app. That said it does seem likely your app has other issues. Let's see if this PR fixes them. If not, we should create a new issue thread since this one is quite overloaded by now. |
@ericl I see that make sense. Another question -- during RLLib training, besides the obvious trial and worker actors, are other hidden actors being created in the background constantly? One of the earlier tests I did here is to use super long trial duration, but jobs still crashed without trial reset happening. |
@soundway shouldn't be. That sounds like a different cause (btw, I also realized that you should be able to see crash logs in the |
The pr got reverted, pending re merge. |
This is not a contribution.
Versions:
python: 3.6.8
ray: 1.0
pytorch: 1.6
tensorflow: 1.15
OS: Ubuntu 18.04 Docker
Since upgrading to 0.8.7 and 1.0, we are experiencing multiple stability issues that result in jobs crashing with
The actor died unexpectedly before finishing this task
errors. Note that these issues are quite difficult to reproduce using the default environment provided by RLLib (often needs over 40 hours for QBert), but with our custom environment they happen much earlier during the execution — sometimes as early as 4 minutes, and they also happen very consistently. We’ve never experienced anything like this with 0.8.5 or prior. Memory/resource shouldn’t be the bottleneck. Even though our custom environments use more memory, we also use nodes with much larger memory capacity for their rollouts. We closely monitor them via Grafana to ensure that all usages fall well below what’s available (i.e. overall memory usage is usually far below 50%). For every node, we assign 30% of the node’s memory for object store, which should be far more than enough based on the experience/model size.Here’s an example of the errors (produced by the script provided later):
Here's another variant of the error when running our own custom environment:
Here's the example script that produced the first error by training QBert with PPO. Note that it might take over 40 hours for the error to occur. The setup is a p3.2xlarge instance for the trainer, and the rollout workers are on a c5.18xlarge instance. 30% of memory on each instance is dedicated to object store.
One of the things we tried when debugging the problem is by storing all execution ops references in memory — and somehow it helps. We discovered this mitigation almost accidentally as we were debugging our own execution plan. For instance, for the PPO execution plan, if we modify it to also return all execution ops in a list that gets held in memory, then the time it takes for the job to crash gets significantly increased and we no longer get the same error. Instead, the error becomes
ray.exceptions.ObjectLostError: Object XXXXX is lost due to node failure
-- which seems to be caused by some node failed heartbeat check. It’s unclear if our attempted mitigation is just a fluke or it may point in the right direction to fix the underlying problem, or these errors share the same underlying cause. Here’s a modified script. Note that the new error is no longer guaranteed to be reproducible even when running for a long time. But with our environment it's quite consistent:In the worker logs, we would find the following message around the time we get the object lost error:
Further, sometimes — not always, the node that timed out has a drastic sharp increase (2-3x) in memory usage according to our Grafana within several seconds near the end — which is far more than the amount of memory it should use. We attempted to mitigate this second error by increasing the
num_heartbeats_timeout
setting in--system_config
, but it doesn’t seem to make much difference. None of these issues exist with the old optimizer scheme in 0.8.5 or earlier and we can train with our custom environment for days without any issue.We also encounter problems that after a trial terminates, a new trial doesn’t get started for some reason in certain cases (this can only be reproduced with our environments). It’s unclear if that’s related to the issue above at all and it’s been hard to debug it with these other instability issues. We’ll likely file another more detailed bug report related to that later when this is addressed.
The text was updated successfully, but these errors were encountered: