[rllib] Frequent “the actor died unexpectedly before finishing this task” errors with executions ops in Ray/RLLib 0.8.7+ #11239

soundway · 2020-10-06T20:37:04Z

This is not a contribution.

Versions:
python: 3.6.8
ray: 1.0
pytorch: 1.6
tensorflow: 1.15
OS: Ubuntu 18.04 Docker

Since upgrading to 0.8.7 and 1.0, we are experiencing multiple stability issues that result in jobs crashing with The actor died unexpectedly before finishing this task errors. Note that these issues are quite difficult to reproduce using the default environment provided by RLLib (often needs over 40 hours for QBert), but with our custom environment they happen much earlier during the execution — sometimes as early as 4 minutes, and they also happen very consistently. We’ve never experienced anything like this with 0.8.5 or prior. Memory/resource shouldn’t be the bottleneck. Even though our custom environments use more memory, we also use nodes with much larger memory capacity for their rollouts. We closely monitor them via Grafana to ensure that all usages fall well below what’s available (i.e. overall memory usage is usually far below 50%). For every node, we assign 30% of the node’s memory for object store, which should be far more than enough based on the experience/model size.

Here’s an example of the errors (produced by the script provided later):

2020-10-05 01:55:09,393\u0009ERROR trial_runner.py:567 -- Trial PPO_QbertNoFrameskip-v4_b43b9_00027: Error processing event.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 515, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 488, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: \u001b[36mray::PPO.train()\u001b[39m (pid=4251, ip=172.30.96.106)
  File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 516, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 505, in train
    result = Trainable.train(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 134, in step
    res = next(self.train_exec_impl)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 791, in apply_foreach
    result = fn(item)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/execution/metric_ops.py", line 79, in __call__
    timeout_seconds=self.timeout_seconds)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/metrics.py", line 75, in collect_episodes
    metric_lists = ray.get(collected)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Here's another variant of the error when running our own custom environment:

Failure # 1 (occurred at 2020-10-03_02-10-38)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 515, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 488, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::PPO.train() (pid=524, ip=172.30.58.198)
  File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 516, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 505, in train
    result = Trainable.train(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 134, in step
    res = next(self.train_exec_impl)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 876, in apply_flatten
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 828, in add_wait_hooks
    item = next(it)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 471, in base_iterator
    yield ray.get(futures, timeout=timeout)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Here's the example script that produced the first error by training QBert with PPO. Note that it might take over 40 hours for the error to occur. The setup is a p3.2xlarge instance for the trainer, and the rollout workers are on a c5.18xlarge instance. 30% of memory on each instance is dedicated to object store.

import copy

import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo


if __name__ == '__main__':
    ray.init(address="auto")

    config = copy.deepcopy(ppo.DEFAULT_CONFIG)
    config.update({
        "rollout_fragment_length": 32,
        "train_batch_size": 8192,
        "sgd_minibatch_size": 512,
        "num_sgd_iter": 1,
        "num_workers": 256,
        "num_gpus": 1,
        "num_sgd_iter": 1,
        "num_cpus_per_worker": 0.25,
        "num_cpus_for_driver": 1,
        "model": {"fcnet_hiddens": [1024, 1024]},
        "framework": "torch",
        "lr": ray.tune.sample_from(lambda s: np.random.random()),
    })

    trainer_cls = ppo.PPOTrainer

    config["env"] = "QbertNoFrameskip-v4"
    ray.tune.run(trainer_cls,
                 config=config,
                 fail_fast=True,
                 reuse_actors=False,
                 queue_trials=True,
                 num_samples=100,
                 scheduler=ray.tune.schedulers.ASHAScheduler(
                    time_attr='training_iteration',
                    metric='episode_reward_mean',
                    mode='max',
                    max_t=2000,
                    grace_period=100,
                    reduction_factor=3,
                    brackets=3),
                 )

One of the things we tried when debugging the problem is by storing all execution ops references in memory — and somehow it helps. We discovered this mitigation almost accidentally as we were debugging our own execution plan. For instance, for the PPO execution plan, if we modify it to also return all execution ops in a list that gets held in memory, then the time it takes for the job to crash gets significantly increased and we no longer get the same error. Instead, the error becomes ray.exceptions.ObjectLostError: Object XXXXX is lost due to node failure -- which seems to be caused by some node failed heartbeat check. It’s unclear if our attempted mitigation is just a fluke or it may point in the right direction to fix the underlying problem, or these errors share the same underlying cause. Here’s a modified script. Note that the new error is no longer guaranteed to be reproducible even when running for a long time. But with our environment it's quite consistent:

import copy

import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo
from ray.rllib.agents.ppo.ppo import UpdateKL, warn_about_bad_reward_scales
from ray.rllib.execution.common import STEPS_SAMPLED_COUNTER, _get_shared_metrics
from ray.rllib.execution.rollout_ops import ParallelRollouts, ConcatBatches, \
    StandardizeFields, SelectExperiences
from ray.rllib.execution.train_ops import TrainOneStep
from ray.rllib.execution.metric_ops import StandardMetricsReporting
from ray.rllib.policy.policy import Policy
from ray.rllib.policy.sample_batch import SampleBatch
from ray.util.iter import from_actors


def custom_ppo_execution_plan(workers, config):
    """Copy of PPO's execution plan, except we store all ops in a list and return them."""
    # Modified from ParallelRollout's bulk_sync mode.
    workers.sync_weights()
    def report_timesteps(batch):
        metrics = _get_shared_metrics()
        metrics.counters[STEPS_SAMPLED_COUNTER] += batch.count
        return batch
    ops = [from_actors(workers.remote_workers())]
    ops.append(ops[-1].batch_across_shards())
    ops.append(ops[-1].for_each(lambda batches: SampleBatch.concat_samples(batches)))
    ops.append(ops[-1].for_each(report_timesteps))

    # Collect batches for the trainable policies.
    ops.append(ops[-1].for_each(
        SelectExperiences(workers.trainable_policies())))
    # Concatenate the SampleBatches into one.
    ops.append(ops[-1].combine(
        ConcatBatches(min_batch_size=config["train_batch_size"])))
    # Standardize advantages.
    ops.append(ops[-1].for_each(StandardizeFields(["advantages"])))

    # Perform one training step on the combined + standardized batch.
    ops.append(ops[-1].for_each(
        TrainOneStep(
            workers,
            num_sgd_iter=config["num_sgd_iter"],
            sgd_minibatch_size=config["sgd_minibatch_size"])))

    # Update KL after each round of training.
    ops.append(ops[-1].for_each(lambda t: t[1]).for_each(UpdateKL(workers)))

    # Warn about bad reward scales and return training metrics.
    return (StandardMetricsReporting(ops[-1], workers, config) \
        .for_each(lambda result: warn_about_bad_reward_scales(config, result)),
        ops)

class ExecutionPlanWrapper:
    """A wrapper for custom_ppo_execution_plan that stores all ops in the object."""

    def __init__(self, workers, config):
        self.execution_plan, self.ops = custom_ppo_execution_plan(workers, config)

    def __next__(self):
        return next(self.execution_plan)


if __name__ == '__main__':
    ray.init(address="auto")

    config = copy.deepcopy(ppo.DEFAULT_CONFIG)
    config.update({
        "rollout_fragment_length": 32,
        "train_batch_size": 8192,
        "sgd_minibatch_size": 512,
        "num_sgd_iter": 1,
        "num_workers": 256,
        "num_gpus": 1,
        "num_sgd_iter": 1,
        "num_cpus_per_worker": 0.25,
        "num_cpus_for_driver": 1,
        "model": {"fcnet_hiddens": [1024, 1024]},
        "framework": "torch",
        "lr": ray.tune.sample_from(lambda s: np.random.random()),
    })

    trainer_cls = ppo.PPOTrainer.with_updates(
        name="CustomPPO",
        execution_plan=ExecutionPlanWrapper)

    config["env"] = "QbertNoFrameskip-v4"
    ray.tune.run(trainer_cls,
                 config=config,
                 fail_fast=True,
                 reuse_actors=False,
                 queue_trials=True,
                 num_samples=100,
                 scheduler=ray.tune.schedulers.ASHAScheduler(
                    time_attr='training_iteration',
                    metric='episode_reward_mean',
                    mode='max',
                    max_t=2000,
                    grace_period=100,
                    reduction_factor=3,
                    brackets=3),
                 )

In the worker logs, we would find the following message around the time we get the object lost error:

2020-10-04 00:19:40,710\u0009WARNING worker.py:1072 -- The node with node id f7c78d2999929f603ebdf4d2c4508f949f6dafb0 has been marked dead because the detector has missed too many heartbeats from it.

Further, sometimes — not always, the node that timed out has a drastic sharp increase (2-3x) in memory usage according to our Grafana within several seconds near the end — which is far more than the amount of memory it should use. We attempted to mitigate this second error by increasing the num_heartbeats_timeout setting in --system_config, but it doesn’t seem to make much difference. None of these issues exist with the old optimizer scheme in 0.8.5 or earlier and we can train with our custom environment for days without any issue.

We also encounter problems that after a trial terminates, a new trial doesn’t get started for some reason in certain cases (this can only be reproduced with our environments). It’s unclear if that’s related to the issue above at all and it’s been hard to debug it with these other instability issues. We’ll likely file another more detailed bug report related to that later when this is addressed.

The text was updated successfully, but these errors were encountered:

ericl · 2020-10-07T22:02:39Z

cc @rkooo567 @stephanie-wang any ideas here? I can't think of why this might be.

@soundway , are there logs that say why the worker is dying? Perhaps run dmesg to see if it's some resource exhaustion error or segfault. The full logs in /tmp/ray may also have something.

rkooo567 · 2020-10-07T22:29:04Z

I am not aware of any known issues related to this... Seems pretty bad though. I also would like to see full logs (if it is crashed within 4~5 minutes, full logs for that cluster would be ideal).

rkooo567 · 2020-10-07T22:32:52Z

Btw, this could be the same issue? #9293

soundway · 2020-10-09T00:14:06Z

Past logs are lost along with the instances so I will have to launch new ones. It will take a long time (and potentially many costly runs) for me to reproduce the error with QBert, but I will try. Unfortunately, I don’t think I can provide the full logs when running with our custom environments. That said, I went through several different types of *.err on the different machines, and I can show some findings here for our custom setup.

We use 1x p3.8xlarge instance as trainer, and 4x r5.24xlarge (768GB RAM) for rollouts. Each trainer spawns some number of workers that roughly occupy one r5.24xlarge instance. We run with the default PPO implementation provided by RLLib 1.0 without any modification to the policy/trainer class.

In this failed job, it took roughly 1 hour after launching. Below is the memory usage graph:

Note that for cpuworker0, the resource goes up significantly at around 03:13, which is also the time training stalls. Not all jobs that failed in the past exhibit this behavior in memory (but it could be a failure on our Grafana side to capture it). This is raylet.err on cpuworker0:

F1008 10:12:38.038853   490   521 malloc.cc:56] failed to find entry in mmap_records for fd 0
*** Check failure stack trace: ***
    @     0x5592b35fc9cd  google::LogMessage::Fail()
    @     0x5592b35fdb2c  google::LogMessage::SendToLog()
    @     0x5592b35fc6a9  google::LogMessage::Flush()
    @     0x5592b35fc8c1  google::LogMessage::~LogMessage()
    @     0x5592b35b4ad9  ray::RayLog::~RayLog()
    @     0x5592b3364676  plasma::GetMmapSize()
    @     0x5592b3345964  plasma::PlasmaStore::ReturnFromGet()
    @     0x5592b3345331  boost::asio::detail::wait_handler<>::do_complete()
    @     0x5592b38fa46f  boost::asio::detail::scheduler::do_run_one()
    @     0x5592b38fb971  boost::asio::detail::scheduler::run()
    @     0x5592b38fc9a2  boost::asio::io_context::run()
    @     0x5592b334b1ee  plasma::PlasmaStoreRunner::Start()
    @     0x7f8eac0449e0  (unknown)
    @     0x7f8eab9d16db  start_thread
    @     0x7f8eab2daa0f  clone

Note that 10:12:38 is the UTC time so it’s the same as the 3:12 AM time shown in the graph. It seems like the memory usage on cpuworker0 begins to rise after this error message. In its raylet.out, we see over 11K messages that are similar to this throughout the whole hour of training:

W1008 10:12:11.417868   490   533 object_manager.cc:777] ReceiveObjectChunk index 3 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: Chunk already referenced by another thread.
W1008 10:12:11.417868   490   533 object_manager.cc:777] ReceiveObjectChunk index 3 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: Chunk already referenced by another thread.
W1008 10:12:11.418118   490   537 object_manager.cc:777] ReceiveObjectChunk index 7 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: Chunk already referenced by another thread.
W1008 10:12:11.418118   490   537 object_manager.cc:777] ReceiveObjectChunk index 7 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: Chunk already referenced by another thread.
W1008 10:12:11.418407   490   531 object_manager.cc:777] ReceiveObjectChunk index 9 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: object already exists in the plasma store
W1008 10:12:11.418407   490   531 object_manager.cc:777] ReceiveObjectChunk index 9 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: object already exists in the plasma store
W1008 10:12:11.418475   490   534 object_manager.cc:777] ReceiveObjectChunk index 8 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: object already exists in the plasma store
W1008 10:12:11.418475   490   534 object_manager.cc:777] ReceiveObjectChunk index 8 of object 4b55b794aacfcfb629ffe3bd0300000003000000 failed: object already exists in the plasma store

But I also see similar things in other workers' logs (though with less number of such messages around 2k - 8K).

Around this time, we see the following messages across all workers:

2020-10-08 10:13:08,003\u0009WARNING worker.py:1072 The node with node id 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 has been marked dead because the detector has missed too many heartbeats from it.

I verified that that node ID is indeed associated with cpuworker0. In the python-core-driver logs across all machines, they are filed with messages below for the whole hour of training

W1008 09:16:25.774077   548  2218 metric_exporter.cc:211] Export metrics to agent failed: IOError: 14: failed to connect to all addresses

On driver node (trainer0), dashboard.err is filled with these errors — though we never used the dashboard:

2020-10-08 09:15:24,315 WARNING node_stats.py:317 -- Unexpected channel data received, channel: b'ERROR_INFO:*', data: Failed to load data because of 'ascii' codec can't decode byte 0xec in position 11: ordinal not in range(128)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/dashboard/dashboard.py", line 697, in run
    timeout=2)
  File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 826, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.DEADLINE_EXCEEDED
        details = "Deadline Exceeded"
        debug_error_string = "{"created":"@1602148527.471726244","description":"Error received from peer ipv4:172.28.95.82:62498","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Deadline Exceeded","grpc_status":4}"

And its monitor.err is filled with these:

Monitor: could not find ip for client 03fc8b37f77539bd022caa10fbd19aa23fbf18d0
Monitor: could not find ip for client dd1aa84a12ed78a2f57c548f93210a1848c136d3
Monitor: could not find ip for client 03fc8b37f77539bd022caa10fbd19aa23fbf18d0
Monitor: could not find ip for client dd1aa84a12ed78a2f57c548f93210a1848c136d3
Monitor: could not find ip for client 03fc8b37f77539bd022caa10fbd19aa23fbf18d0

For 4 of the pythone-core-worker log files on the driver (I believe these are the 4 trials?), they are much larger than others and contain these messages after cpuworker0’s raylet error (I sampled a few messages in chronological order):

I1008 10:13:08.228957   771  1473 direct_actor_transport.cc:187] Failing pending tasks for actor 6120b21703000000
I1008 10:13:08.229079   771  1473 actor_manager.cc:127] received notification on actor, state: DEAD, actor_id: 5c3e836b03000000, ip address: 172.27.134.161, port: 10001, worker_id: 788ca7ba903276d5382c515217a58828ce64bd8b, raylet_id: 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, num_restarts: 0
I1008 10:13:08.229198   771  1473 direct_actor_transport.cc:187] Failing pending tasks for actor 5c3e836b03000000                                           
I1008 10:13:08.229315   771  1473 actor_manager.cc:127] received notification on actor, state: DEAD, actor_id: 89cd273a03000000, ip address: 172.27.134.161, port: 10074, worker_id: 740db7a6fa98d329b9930fa8b5c4bedbfa879fc5, raylet_id: 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, num_restarts: 0                                              
I1008 10:13:08.229434   771  1473 direct_actor_transport.cc:187] Failing pending tasks for actor 89cd273a03000000
I1008 10:13:08.229483   771  1473 service_based_accessor.cc:791] Received notification for node id = 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, IsAlive = 0                                                                                                                                                                                      
I1008 10:13:08.229501   771  1473 core_worker.cc:611] Node failure 3ddf99132c312ccc8af79c66f826b2dc622d5ee4                                                                                                                                                                                                                                     
I1008 10:13:08.229929   771  1473 core_worker.cc:619] Object 25f56eccb8469e216120b2170300000001000000 lost due to node failure 3ddf99132c312ccc8af79c66f826b2dc622d5ee4
I1008 10:13:08.231173   771  1473 object_recovery_manager.cc:42] Starting recovery for object c2dfc437c284c9611b114c0f0300000001000000                                                                                                                                                                                                          
I1008 10:13:08.231225   771  1473 core_worker.cc:619] Object 65425a548e61bc9f7c13d2ef0300000001000000 lost due to node failure 3ddf99132c312ccc8af79c66f826b2dc622d5ee4                                                                                                                                                                         
I1008 10:13:08.231236   771  1473 object_recovery_manager.cc:42] Starting recovery for object 65425a548e61bc9f7c13d2ef0300000001000000
I1008 10:13:08.231266   771  1473 object_recovery_manager.cc:63] Lost object 25f56eccb8469e216120b2170300000001000000 has 1 locations                                                                                                                                                                                                           
I1008 10:13:08.231350   771  1473 object_recovery_manager.cc:63] Lost object af78fa8b791e72e9fab5fd080300000001000000 has 0 locations                                                                                                                                                                                                           
W1008 10:13:09.222667   771  1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: af78fa8b791e72e9fab5fd080300000001000000.
W1008 10:13:09.222667   771  1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: af78fa8b791e72e9fab5fd080300000001000000.                                                                                                                                                                               
I1008 10:13:09.222728   771  1473 object_recovery_manager.cc:47] Recovery complete for object af78fa8b791e72e9fab5fd080300000001000000                                                                                                                                                                                                          
I1008 10:13:09.222817   771  1473 object_recovery_manager.cc:63] Lost object a14243d4a233d5462a56d9370300000001000000 has 1 locations
W1008 10:13:15.225054   771  1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: c2dfc437c284c9611b114c0f0300000001000000.
I1008 10:13:15.225116   771  1473 object_recovery_manager.cc:47] Recovery complete for object c2dfc437c284c9611b114c0f0300000001000000                                                                                                                                                                                                          
I1008 10:13:15.225193   771  1473 object_recovery_manager.cc:63] Lost object 65425a548e61bc9f7c13d2ef0300000001000000 has 0 locations
W1008 10:13:16.225339   771  1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: 65425a548e61bc9f7c13d2ef0300000001000000.                                                                                                                                                                               
W1008 10:13:16.225339   771  1473 plasma_store_provider.cc:131] Trying to put an object that already existed in plasma: 65425a548e61bc9f7c13d2ef0300000001000000.
I1008 10:13:16.225394   771  1473 object_recovery_manager.cc:47] Recovery complete for object 65425a548e61bc9f7c13d2ef0300000001000000                                                                                                                                                                                                          
I1008 10:13:16.225814   771  1473 object_recovery_manager.cc:47] Recovery complete for object 25f56eccb8469e216120b2170300000001000000
I1008 10:13:16.230865   771  1473 object_recovery_manager.cc:47] Recovery complete for object 86b4165d0ea97a2375623aae0300000001000000                                                                                                                                                                                                          
I1008 10:13:16.230893   771  1473 object_recovery_manager.cc:47] Recovery complete for object bc30d377993870e56d86ca0d0300000001000000
W1008 10:13:26.228111   771   771 plasma_store_provider.cc:415] Attempted 50 times to reconstruct objects, but some objects are still unavailable. If this message continues to print, it may indicate that object's creating task is hanging, or something wrong happened in raylet backend. 9 object(s) pending: af78fa8b791e72e9fab5fd0803000000010000000e3f1cc1fb95e6965c3e836b030000000100000065425a548e61bc9f7c13d2ef0300000001000000c2dfc437c284c9611b114c0f03000000010000005b9b655a2e23f00d1d1d4f360300000001000000ca2eb887c575027089cd273a03000000010000001329ca24d0247e20e2196c2b030000000100000005dc1fa5de4fc0dc64947d35030000000100000070215c046e5ef261a14cb5b10300000001000000, etc.
W1008 10:13:26.228111   771   771 plasma_store_provider.cc:415] Attempted 50 times to reconstruct objects, but some objects are still unavailable. If this message continues to print, it may indicate that object's creating task is hanging, or something wrong happened in raylet backend. 9 object(s) pending: af78fa8b791e72e9fab5fd0803000000010000000e3f1cc1fb95e6965c3e836b030000000100000065425a548e61bc9f7c13d2ef0300000001000000c2dfc437c284c9611b114c0f03000000010000005b9b655a2e23f00d1d1d4f360300000001000000ca2eb887c575027089cd273a03000000010000001329ca24d0247e20e2196c2b030000000100000005dc1fa5de4fc0dc64947d35030000000100000070215c046e5ef261a14cb5b10300000001000000, etc.
W1008 10:14:16.252666   771   771 plasma_store_provider.cc:415] Attempted 100 times to reconstruct objects, but some objects are still unavailable. If this message continues to print, it may indicate that object's creating task is hanging, or something wrong happened in raylet backend. 9 object(s) pending: af78fa8b791e72e9fab5fd0803000000010000000e3f1cc1fb95e6965c3e836b030000000100000065425a548e61bc9f7c13d2ef0300000001000000c2dfc437c284c9611b114c0f03000000010000005b9b655a2e23f00d1d1d4f360300000001000000ca2eb887c575027089cd273a03000000010000001329ca24d0247e20e2196c2b030000000100000005dc1fa5de4fc0dc64947d35030000000100000070215c046e5ef261a14cb5b10300000001000000, etc.
…
W1008 10:33:28.116824   771   771 plasma_store_provider.cc:415] Attempted 1250 times to reconstruct objects, but some objects are still unavailable. If this message continues to print, it may indicate that object's creating task is hanging, or something wrong happened in raylet backend. 9 object(s) pending: af78fa8b791e72e9fab5fd0803000000010000000e3f1cc1fb95e6965c3e836b030000000100000065425a548e61bc9f7c13d2ef0300000001000000c2dfc437c284c9611b114c0f03000000010000005b9b655a2e23f00d1d1d4f360300000001000000ca2eb887c575027089cd273a03000000010000001329ca24d0247e20e2196c2b030000000100000005dc1fa5de4fc0dc64947d35030000000100000070215c046e5ef261a14cb5b10300000001000000, etc

Another thing that's worth noting is that since 0.8.7, we frequently see these messages in our console:

\u001b[2m\u001b[36m(pid=1064, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,460\u0009INFO (unknown file):0 -- gc.collect() freed 7962 refs in 0.5344230659998175 seconds
\u001b[2m\u001b[36m(pid=1302, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,465\u0009INFO (unknown file):0 -- gc.collect() freed 7890 refs in 0.5290356460000112 seconds
\u001b[2m\u001b[36m(pid=1424, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,465\u0009INFO (unknown file):0 -- gc.collect() freed 7962 refs in 0.5253929880000214 seconds
\u001b[2m\u001b[36m(pid=703, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,485\u0009INFO (unknown file):0 -- gc.collect() freed 8102 refs in 0.5600109100000736 seconds
\u001b[2m\u001b[36m(pid=1096, ip=172.28.10.161)\u001b[0m 2020-10-08 00:58:26,466\u0009INFO (unknown file):0 -- gc.collect() freed 352 refs in 0.5303756779999276 seconds

Given that some of the gc.collect() time could get to several seconds as training progresses, we aren't sure if this is an expected behavior with the new upgrade. Just want to check if this is now the norm. According to the raylet.out log on the driver, this seems to happen every 10 seconds.

Other than the logs shown here, most other .err files are either empty or uneventful, though I could’ve easily overlooked things. Please let me know if there’s any particular log file I should be looking at or provide.

Our assumption has been that these issues came after 0.8.5 since we’ve run many very large jobs in 0.8.5 and haven’t encountered any problem.

rkooo567 · 2020-10-09T00:20:12Z

Can you also check if you have any meaningful logs in gcs_server.err.

And yeah we've changed how actors are managed from the version 0.8.6 (in a more centralized way). I guess that change could be related to this issue.

rkooo567 · 2020-10-09T00:24:49Z

And do you have any info about which process uses the memory the most when it starts exploding?

soundway · 2020-10-09T00:38:11Z

gcs_server.err is empty.

I see messages in gcs_server.out after the plasma store error (and nothing else for 1 hour), and here are some sampled messages chronologically.

I1008 10:13:07.989955   452   452 gcs_node_manager.cc:389] Removing node, node id = 3ddf99132c312ccc8af79c66f826b2dc622d5ee4
W1008 10:13:07.990010   452   452 gcs_placement_group_manager.cc:328] Node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 failed, rescheduling the placement groups on the dead node.
W1008 10:13:07.990010   452   452 gcs_placement_group_manager.cc:328] Node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 failed, rescheduling the placement groups on the dead node.
W1008 10:13:07.990022   452   452 gcs_actor_manager.cc:731] Node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 failed, reconstructing actors.
W1008 10:13:07.990022   452   452 gcs_actor_manager.cc:731] Node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4 failed, reconstructing actors.
W1008 10:13:07.990080   452   452 gcs_actor_manager.cc:800] Actor is failed 2ab78bd103000000 on worker fada8da402dfd2f16d559a2f6c61278c4615cb6b at node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, need_reschedule = 1, remaining_restarts = 0
W1008 10:13:07.990080   452   452 gcs_actor_manager.cc:800] Actor is failed 2ab78bd103000000 on worker fada8da402dfd2f16d559a2f6c61278c4615cb6b at node 3ddf99132c312ccc8af79c66f826b2dc622d5ee4, need_reschedule = 1, remaining_restarts = 0
I1008 10:13:09.561029   452   452 gcs_worker_manager.cc:31] Reporting worker failure, worker id = 0b4efae1b623ef4c58fa10a39f07fbd06332fcff, node id = a471284266f5a2831a321480a5f5713c780770cf, address = 172.30.242.220
I1008 10:13:09.561241   452   452 gcs_worker_manager.cc:31] Reporting worker failure, worker id = 61d3289626bf566577c55e20db746f7ff89703bc, node id = a471284266f5a2831a321480a5f5713c780770cf, address = 172.30.242.220
I1008 10:13:09.561323   452   452 gcs_worker_manager.cc:31] Reporting worker failure, worker id = 659362b6c0464da8fc94a2e960bc04beb427a4c3, node id = 03fc8b37f77539bd022caa10fbd19aa23fbf18d0, address = 172.28.95.82
I1008 10:13:09.561383   452   452 gcs_worker_manager.cc:31] Reporting worker failure, worker id = 264a99e504c7d20652ca66ed3f0fe8f91a4af445, node id = a471284266f5a2831a321480a5f5713c780770cf, address = 172.30.242.220
I1008 10:13:09.571002   452   452 gcs_actor_manager.cc:680] Worker 128052d1c62b55829a01c5914df3a5840f49f7fe on node 03fc8b37f77539bd022caa10fbd19aa23fbf18d0 intentional exit.
W1008 10:13:09.571030   452   452 gcs_actor_manager.cc:800] Actor is failed 3ad7973003000000 on worker 128052d1c62b55829a01c5914df3a5840f49f7fe at node 03fc8b37f77539bd022caa10fbd19aa23fbf18d0, need_reschedule = 0, remaining_restarts = 0
W1008 10:13:09.571030   452   452 gcs_actor_manager.cc:800] Actor is failed 3ad7973003000000 on worker 128052d1c62b55829a01c5914df3a5840f49f7fe at node 03fc8b37f77539bd022caa10fbd19aa23fbf18d0, need_reschedule = 0, remaining_restarts = 0
I1008 10:13:09.571097   452   452 gcs_actor_manager.cc:680] Worker cd56eaaea46fa38d650ef69abde2b30a4c0c7471 on node a471284266f5a2831a321480a5f5713c780770cf intentional exit.
I1008 10:13:09.587844   452   452 gcs_actor_manager.cc:1065] Erasing actor 4f4801cc03000000 owned by c853f6e1f94d565152fc28c613ed85e9019342f0
I1008 10:13:09.587901   452   452 gcs_actor_manager.cc:558] Destroying actor, actor id = db2b89c703000000
I1008 10:13:09.587916   452   452 gcs_actor_manager.cc:1065] Erasing actor db2b89c703000000 owned by c853f6e1f94d565152fc28c613ed85e9019342f0
I1008 10:13:09.587986   452   452 gcs_actor_manager.cc:558] Destroying actor, actor id = 2da0d27603000000
I1008 10:13:09.588001   452   452 gcs_actor_manager.cc:1065] Erasing actor 2da0d27603000000 owned by c853f6e1f94d565152fc28c613ed85e9019342f0
I1008 10:13:09.588011   452   452 gcs_actor_manager.cc:558] Destroying actor, actor id = 5295519a03000000
W1008 10:13:09.597388   452   452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
W1008 10:13:09.597404   452   452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
W1008 10:13:09.597404   452   452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
W1008 10:13:09.597421   452   452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
W1008 10:13:09.597421   452   452 gcs_actor_manager.cc:542] Worker c853f6e1f94d565152fc28c613ed85e9019342f0 failed, destroying actor child.
I1008 10:13:19.293956   452   452 gcs_actor_manager.cc:104] Registering actor, job id = 03000000, actor id = 26b50a7003000000
I1008 10:13:19.295037   452   452 gcs_actor_manager.cc:109] Registered actor, job id = 03000000, actor id = 26b50a7003000000
I1008 10:13:19.295799   452   452 gcs_actor_manager.cc:127] Creating actor, job id = 03000000, actor id = 26b50a7003000000
I1008 10:13:19.296147   452   452 gcs_actor_scheduler.cc:196] Start leasing worker from node a471284266f5a2831a321480a5f5713c780770cf for actor 26b50a7003000000
I1008 10:13:19.320003   452   452 gcs_actor_manager.cc:104] Registering actor, job id = 03000000, actor id = 9f34da8403000000
I1008 10:13:19.320806   452   452 gcs_actor_manager.cc:109] Registered actor, job id = 03000000, actor id = 9f34da8403000000
I1008 10:13:19.321512   452   452 gcs_actor_manager.cc:127] Creating actor, job id = 03000000, actor id = 9f34da8403000000
I1008 10:13:19.321856   452   452 gcs_actor_scheduler.cc:196] Start leasing worker from node 09bc5ba21e2bb67134f86367b9cf8ad53fd503a4 for actor 9f34da8403000000
I1008 10:13:19.325047   452   452 gcs_actor_scheduler.cc:240] Finished leasing worker from 09bc5ba21e2bb67134f86367b9cf8ad53fd503a4 for actor 9f34da8403000000
I1008 10:13:19.325140   452   452 core_worker_client_pool.cc:28] Connected to 172.27.122.67:10022
I1008 10:13:19.326241   452   452 gcs_actor_scheduler.cc:322] Start creating actor 9f34da8403000000 on worker a27ce1ebec1c1efce3eeef588e2b1ed7e69d6204 at node 09bc5ba21e2bb67134f86367b9cf8ad53fd503a4

I don't know which process uses the most memory. But it seems like the plasma error happened before the drastic increase in memory, and not as a result of the increase in memory usage (the memory usage peaked 10 mins after the everything went haywire). This is assuming the timestamps between the workers and Grafana are synced correctly.

rkooo567 · 2020-10-09T00:47:40Z

I see. So, I think what happened is one of your nodes had out of memory for some reason, and actor on that nodes are dead, and that just made your script failed (because you did ray.get on dead actors' ref).

There were drastic changes in how to manage the actor lifecycle from 0.8.6, but it doesn't seem to be really related based on your logs (the basic idea is now all actors are managed by gcs_server, and it looks like it worked correct).

@ericl Idk the detail of object broadcasting issues, but is there any way that's related to this failure?

Also, @soundway, when you ran with 0.8.5, did you see the same memory explosion issue?

rkooo567 · 2020-10-09T00:48:53Z

also cc @stephanie-wang @barakmich

soundway · 2020-10-09T00:54:05Z

With 0.8.5 all memory usages are stable for long running jobs with many actors/nodes.

soundway · 2020-10-09T16:22:24Z

What would cause the plasma error as I showed earlier? (i.e. failed to find entry in mmap_records for fd 0) I didn't find anything too useful from Google. That seems like the root cause as it has the earliest timestamp of all events. OOM seems to be the aftermath, and I have jobs that failed without the spikes in memory.

soundway · 2020-10-14T05:18:05Z

I've been trying to get logs for the vanilla setup of QBert+PPO with the reproduction script I provided at the beginning of this issue (1 x p3.2xlarge + 1 x c5.18xlarge). It takes about 20-60+ hours to reproduce. The error in raylet.err does look slightly different from when we were running our custom environment but it's again coming from Plasma store. I removed most of the duplicated IOError: 14 messages. The memory usage also began to increase around the unix timestamp the error was thrown (though it didn't reach peak until a few minutes later).

E1013 08:51:42.808821   490   490 object_manager.cc:173] b5a2fc76f2fb8fc0afaed5c60200000003000000 attempted to pull an object that's already local.
E1013 09:08:40.582950   490   490 object_manager.cc:233] The object manager with ID fc03b82c14383ce31d349f62b1dd4c4d820fa70c is trying to pull object db00ef4a5318cc44afaed5c60200000003000000 but the object table suggests that this object manager already has the object. The object may have been evicted.
E1013 09:09:50.811242   490   490 object_manager.cc:233] The object manager with ID fc03b82c14383ce31d349f62b1dd4c4d820fa70c is trying to pull object 011a69285e40df7dafaed5c60200000003000000 but the object table suggests that this object manager already has the object. The object may have been evicted.
E1013 09:24:41.045271   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.045377   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.045403   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.045436   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.045487   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405635   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405663   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405692   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405714   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405815   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405865   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.405961   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.406013   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.406117   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 09:24:41.406142   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
*** Aborted at 1602581362 (unix time) try "date -d @1602581362" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x59) received by PID 490 (TID 0x7f7c297fa700) from PID 89; stack trace: ***
    @     0x7f7c323de8a0 (unknown)
    @     0x5590123261cd std::__detail::_Map_base<>::operator[]()
    @     0x55901232b84b plasma::PlasmaStore::ReturnFromGet()
    @     0x55901232b331 boost::asio::detail::wait_handler<>::do_complete()
    @     0x5590128e046f boost::asio::detail::scheduler::do_run_one()
    @     0x5590128e1971 boost::asio::detail::scheduler::run()
    @     0x5590128e29a2 boost::asio::io_context::run()
    @     0x5590123311ee plasma::PlasmaStoreRunner::Start()
    @     0x7f7c32a469e0 (unknown)
    @     0x7f7c323d36db start_thread
    @     0x7f7c31cdca0f clone

During this process I was tweaking some of the settings and I might've found a potentially separate issue given the new error I'm getting. The only things I changed are max_t and grace_period in the scheduler to make each trial much shorter (20 and 10 respectively). This would result in Check failed: actor_registry_.count(id.ActorId()) > 0 after running for several hours as opposed to days. Pasting the script again with the changes:

import copy

import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo


if __name__ == '__main__':
    ray.init(address="auto")

    config = copy.deepcopy(ppo.DEFAULT_CONFIG)
    config.update({
        "rollout_fragment_length": 32,
        "train_batch_size": 8192,
        "sgd_minibatch_size": 512,
        "num_sgd_iter": 1,
        "num_workers": 256,
        "num_gpus": 1,
        "num_sgd_iter": 1,
        "num_cpus_per_worker": 0.25,
        "num_cpus_for_driver": 1,
        "model": {"fcnet_hiddens": [1024, 1024]},
        "framework": "torch",
        "lr": ray.tune.sample_from(lambda s: np.random.random()),
    })

    trainer_cls = ppo.PPOTrainer

    config["env"] = "QbertNoFrameskip-v4"
    ray.tune.run(trainer_cls,
                 config=config,
                 fail_fast=True,
                 reuse_actors=False,
                 queue_trials=True,
                 num_samples=10000,
                 scheduler=ray.tune.schedulers.ASHAScheduler(
                    time_attr='training_iteration',
                    metric='episode_reward_mean',
                    mode='max',
                    max_t=5,
                    grace_period=5,
                    reduction_factor=3,
                    brackets=3),
                 )

Here's a sampled copy of raylet.err. I removed most of the duplicates (all IOError: 14). It's unclear if this new error is related to the one we found when running our own custom environment or when running QBert with long trial durations. The timing of this error is also right before we see increase in memory usage.

E1013 23:35:50.200567   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.200614   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.200753   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 1: Cancelled
E1013 23:35:50.200881   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.201339   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:35:50.201450   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 1: Cancelled
E1013 23:35:50.201604   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.201673   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:35:50.204852   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.204876   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.204927   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:35:50.204946   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:38:41.597056   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:38:41.597164   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:38:41.597545   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1013 23:41:24.079257   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:41:24.079306   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:41:24.079330   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:41:24.079609   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1013 23:41:24.370667   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: Connection reset by peer
E1014 00:19:38.115996   490   490 object_manager.cc:173] 6e6502fb8f6647db10a4406c0100000003000000 attempted to pull an object that's already local.
E1014 00:20:40.974277   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:53:53.745927   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:53:53.746177   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 1: Cancelled
E1014 00:53:53.746242   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:53:53.746289   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:53:53.746348   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1014 00:53:53.746407   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 00:57:30.838287   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:06:51.857200   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1014 01:06:51.857285   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 2: HandleServiceClosed
E1014 01:06:51.858800   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:06:51.859259   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:06:51.859380   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:06:51.859437   490   490 node_manager.cc:3307] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1014 01:14:52.450027   490   490 object_manager.cc:233] The object manager with ID a9f645669d1cbcc66f116857fc64d00cfbf9f245 is trying to pull object 5d4202bb609c5c6e25e7fd4c0100000003000000 but the object table suggests that this object manager already has the object. The object may have been evicted.
F1014 01:17:02.833871   490   490 node_manager.cc:2833]  Check failed: actor_registry_.count(id.ActorId()) > 0
*** Check failure stack trace: ***
    @     0x56294c1749cd  google::LogMessage::Fail()
    @     0x56294c175b2c  google::LogMessage::SendToLog()
    @     0x56294c1746a9  google::LogMessage::Flush()
    @     0x56294c1748c1  google::LogMessage::~LogMessage()
    @     0x56294c12cad9  ray::RayLog::~RayLog()
    @     0x56294be4d51d  ray::raylet::NodeManager::HandleObjectLocal()
    @     0x56294be4d6c5  _ZNSt17_Function_handlerIFvRKN3ray14object_manager8protocol11ObjectInfoTEEZNS0_6raylet11NodeManagerC1ERN5boost4asio10io_contextERKNS0_8ClientIDERKNS7_17NodeManagerConfigERNS0_13ObjectManagerESt10shared_ptrINS0_3gcs9GcsClientEESL_INS0_24ObjectDirectoryInterfaceEEEUlS5_E1_E9_M_invokeERKSt9_Any_dataS5_
    @     0x56294be820df  _ZN5boost4asio6detail18completion_handlerIZN3ray30ObjectStoreNotificationManager15ProcessStoreAddERKNS3_14object_manager8protocol11ObjectInfoTEEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
    @     0x56294c47246f  boost::asio::detail::scheduler::do_run_one()
    @     0x56294c473971  boost::asio::detail::scheduler::run()
    @     0x56294c4749a2  boost::asio::io_context::run()
    @     0x56294bd97fde  main
    @     0x7ffb5ef03b97  __libc_start_main
    @     0x56294bdaebb1  (unknown)

ericl · 2020-10-19T22:38:15Z

@waldroje @soundway I was investigating this and realized there is a high likelihood the issue is the same as #11309, which is triggered by creating and destroying many actors.

To validate this, could you try the following?

Instead of running with ASHAScheduler, run with just a small number of QBert trials that run forever (so that no actors are destroyed during the run)?

If this doesn't crash, then it is likely the same issue as 11309. If not, then it would be great if you could include the simplified script that triggers the crash without ASHAScheduler (ideally, simplified to just a rllib train command instead of a script).

ericl · 2020-10-19T22:39:07Z

Oh forgot to mention, you can also try the ulimit workaround in that issue: #11309

waldroje · 2020-10-20T00:53:18Z

I’ll run some stuff tomorrow morning and revert

…

On Oct 19, 2020, at 6:39 PM, Eric Liang ***@***.***> wrote: Oh forgot to mention, you can also try the ulimit workaround in that issue: #11309 <#11309> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11239 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHSHWMLQU3A2DM7F765FCI3SLS5ZXANCNFSM4SGQX3BA>.

ericl · 2020-10-21T01:12:45Z

@soundway @waldroje after experimenting with 1.0 vs 0.8.5 more, I think the main difference is we use 2-3x more file descriptors due to the change in the way actors are managed with the GCS service--- it's not really a leak. I'll put more details in the linked bug.

I believe that increasing the file descriptor limit (ulimit -n value) will resolve the problem, can you try increasing the limit to 10000 or more? The number of fds opened seems to stabilize at just a few thousand.

waldroje · 2020-10-21T04:44:56Z

I upped my file descriptors to ~16k… but still crashing, but getting more specific error now.. not something I’ve seen before.. Ray worker pid: 24628 WARNING:tensorflow:From /home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariabl e.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. 2020-10-21 00:25:36,762 ERROR worker.py:372 -- SystemExit was raised from the worker Traceback (most recent call last): File "/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2328, in get_attr pywrap_tf_session.TF_OperationGetAttrValueProto(self._c_op, name, buf) tensorflow.python.framework.errors_impl.InvalidArgumentError: Operation 'default_policy/Sum_4' has no attr named '_XlaCompile'. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 331, in _MaybeCompile xla_compile = op.get_attr("_XlaCompile") File "/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2332, in get_attr raise ValueError(str(e)) ValueError: Operation 'default_policy/Sum_4' has no attr named '_XlaCompile'. Is the gist of it…. It’s late here, and I’ll post a more complete set of logs, and make sure I am not making any mistakes.. but figured I’d pas that along...

…

On Oct 20, 2020, at 9:12 PM, Eric Liang ***@***.***> wrote: @soundway <https://github.com/soundway> @waldroje <https://github.com/waldroje> after experimenting with 1.0 vs 0.8.5 more, I think the main difference is we use 2-3x more file descriptors due to the change in the way actors are managed with the GCS service--- it's not really a leak. I'll put more details in the linked bug. I believe that increasing the file descriptor limit (ulimit -n value) will resolve the problem, can you try increasing the limit to 10000 or more? The number of fds opened seems to stabilize at just a few thousand. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11239 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHSHWMNMMR2DMGHPPFRHFHTSLYYRXANCNFSM4SGQX3BA>.

soundway · 2020-10-21T08:14:45Z

@ericl I did some tests and I have some doubt that this is the root cause for our problem for several reasons. Though there might be alternative reasons for my observations.

The ulimit for file descriptor per process for our system is already at 65K according to ulimit -a (and both hard/soft limit are the same at 65K). Unfortunately, I couldn't make this any higher due to our infra setup, but this should be more than sufficient. I'm running a few Q-bert jobs with long training iterations and the FIFO scheduler, but it will take a while to know whether they will crash. Will keep you updated.

That said, I looked at the number of file descriptors of two other types of jobs where eventually they tend to crash after a long time.

Training Q-bert with the last reproduction script I provided (i.e. creating trials with very high frequency).
Our own custom environment (according to our setup at the beginning of the thread).

Unfortunately in our system we can't use lsof, so I'm using some approximations.

For setup 1, the largest observed fd number per process doesn't exceed 1700. So the 65K limit per process in theory should be far more than enough. I got it via:

ls /proc/*/task/*/fd /proc/*/fd | sort -nr | head -n 10

The overall number of opened file descriptors according to the command below never exceeds 200K, but I do observe very large fluctuations between trial resets.

cat /proc/sys/fs/file-nr

For setup 2, the max fds per process and the total number of fds when I use the commands above don't exceeds 600/140K respectively.

In either case, the max number of fds per process is far less than the 65K threshold. The max number of fds across the system is also far less than the 78M allowed. Maybe something eventually would trigger a sudden increase in fd usage.

In the past, we observed jobs that crash somewhat early during training (with relatively long min trial durations), and it didn't appear that it was during a trial reset.

Based on these preliminary data, it appears that our problem won't be solved easily by just increasing ulimit (and we couldn't increase it easily any more anyway). Please let me know if you think there's any issue with the way I approached this.

ericl · 2020-10-21T17:58:22Z

Thanks for the investigations @soundway. It does indeed sound like a separate issue.

I'm trying to think of how to best reproduce this. Based on what you said, it sounds like it's potentially something triggered by the characteristics of a custom environment, and potentially TensorFlow related. To help me isolate the cause and hopefully get a fast repro, could you try the following?

Mock out the environment (e.g., just return np.zeros() for all observations, etc), so that you can share a fast reproduction publicly.
See if the problem is reproducible on a single machine (e.g., p2.16xl), or if only happens in the distributed setting.
See if the problem also happens if you switch to "framework": "torch" for PPO, to see if it's TF specific.
See if removing ray.tune.schedulers.ASHAScheduler helps (basically, run a fixed set of trial(s) forever instead of going through a Tune scheduler).

Probably (1) and (2) are the most critical, since if there's a standalone repro script that runs in a few minutes, I can always try the other steps myself pretty easily.

soundway · 2020-10-21T18:08:03Z

For (1) I have tried something like that (with varying obs size, model size, environment memory occupancy), but unfortunately it still takes roughly the same amount of time to crash, and I could not get it to crash within minutes. Though I haven't done this too rigorously and I can revisit this.

I haven't tried running long jobs in the cloud with just one instance yet in the but I can try that.

I never use TF -- everything I've done here is with Torch (1.6 specifically, but we also see problem with 1.4). I could try TF as well.

I started (4) with current setup with Q-bert and they haven't crashed yet, but will give you updates on this.

soundway · 2020-12-09T00:43:55Z

@ericl Thanks for providing the dependency list. I will see if I can get a similar docker image in our cluster.

For the other issue I mentioned, we aren't using any Tune/RLLib -- just vanilla Ray with our own actors/tasks. Unfortunately, we can't report a bug with our code so it might take some time before I can get a clean standalone script that could reproduce this. Even then, these things don't happen very often (about 1 in dozens of tries) so it won't be easy to reproduce.

ericl · 2020-12-09T19:38:54Z

@soundway It still seems ok after a few days:

== Status ==
Memory usage on this node: 45.6/720.4 GiB
Using AsyncHyperBand: num_stopped=9
Bracket: Iter 900.000: 150.0 | Iter 300.000: 150.0 | Iter 100.000: 150.0
Bracket: Iter 900.000: 150.0 | Iter 300.000: 100.00000000000001
Bracket: Iter 900.000: 125.0
Resources requested: 51.0/64 CPUs, 1/16 GPUs, 0.0/519.87 GiB heap, 0.0/128.52 GiB objects (0/1.0 accelerator_type:K80)
Result logdir: /home/ubuntu/ray_results/PPO_2020-12-07_21-25-58
Number of trials: 11/100 (1 PENDING, 1 RUNNING, 9 TERMINATED)
+-------------------------------------+------------+--------------------+-----------+--------+------------------+----------+----------+----------------------+----------------------+--------------------+
| Trial name                          | status     | loc                |        lr |   iter |   total time (s) |       ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|-------------------------------------+------------+--------------------+-----------+--------+------------------+----------+----------+----------------------+----------------------+--------------------|
| PPO_QbertNoFrameskip-v4_cc5de_00009 | RUNNING    | 172.30.4.121:38272 | 0.47612   |     92 |          1204.72 |  1177600 |        0 |                    0 |                    0 |               4152 |
| PPO_QbertNoFrameskip-v4_cc5de_00010 | PENDING    |                    | 0.568784  |        |                  |          |          |                      |                      |                    |
| PPO_QbertNoFrameskip-v4_cc5de_00000 | TERMINATED |                    | 0.0870813 |   2000 |         26155.3  | 25600000 |      125 |                  125 |                  125 |               1455 |
| PPO_QbertNoFrameskip-v4_cc5de_00001 | TERMINATED |                    | 0.855818  |   2000 |         26217.6  | 25600000 |      150 |                  150 |                  150 |               1302 |
| PPO_QbertNoFrameskip-v4_cc5de_00002 | TERMINATED |                    | 0.988422  |   2000 |         26205.6  | 25600000 |      150 |                  150 |                  150 |               1302 |
| PPO_QbertNoFrameskip-v4_cc5de_00003 | TERMINATED |                    | 0.252006  |   2000 |         26230.9  | 25600000 |      150 |                  150 |                  150 |               1302 |
| PPO_QbertNoFrameskip-v4_cc5de_00004 | TERMINATED |                    | 0.410412  |   2000 |         26256    | 25600000 |      150 |                  150 |                  150 |               1302 |
| PPO_QbertNoFrameskip-v4_cc5de_00005 | TERMINATED |                    | 0.722011  |    100 |          1310.15 |  1280000 |      125 |                  125 |                  125 |               1455 |
| PPO_QbertNoFrameskip-v4_cc5de_00006 | TERMINATED |                    | 0.541815  |   2000 |         26243.4  | 25600000 |      150 |                  150 |                  150 |               1302 |
| PPO_QbertNoFrameskip-v4_cc5de_00007 | TERMINATED |                    | 0.746764  |    300 |          3916.36 |  3840000 |        0 |                    0 |                    0 |               4152 |
| PPO_QbertNoFrameskip-v4_cc5de_00008 | TERMINATED |                    | 0.523399  |    100 |          1310.7  |  1280000 |        0 |                    0 |                    0 |               1159 |
+-------------------------------------+------------+--------------------+-----------+--------+------------------+----------+----------+----------------------+----------------------+--------------------+

Note that the changes I had to make to get it to run were settting

"num_workers": 200,

and

queue_trials=False,

The training hangs initially without this. Based on this result, it seems the instability is either related to these options, dependencies, or the use of docker, since the other factors are the same.

soundway · 2020-12-09T22:23:15Z

Thanks for the update. What's the max number of iterations each trial runs till? The script I provided set to 5 -- which is very aggressive at restarting trials. Here it seems to be 2000?

ericl · 2020-12-09T23:50:19Z

Ah I missed that. Retrying with max_t=5, grace_period=5:

import copy

import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo


if __name__ == '__main__':
    ray.init()

    config = copy.deepcopy(ppo.DEFAULT_CONFIG)
    config.update({
        "rollout_fragment_length": 32,
        "train_batch_size": 8192,
        "sgd_minibatch_size": 512,
        "num_sgd_iter": 1,
        "num_workers": 200,
        "num_gpus": 1,
        "num_sgd_iter": 1,
        "num_cpus_per_worker": 0.25,
        "num_cpus_for_driver": 1,
        "model": {"fcnet_hiddens": [1024, 1024]},
        "framework": "torch",
        "lr": ray.tune.sample_from(lambda s: np.random.random()),
    })

    trainer_cls = ppo.PPOTrainer

    config["env"] = "QbertNoFrameskip-v4"
    ray.tune.run(trainer_cls,
                 config=config,
                 fail_fast=True,
                 reuse_actors=False,
                 queue_trials=False,
                 num_samples=100,
                 scheduler=ray.tune.schedulers.ASHAScheduler(
                    time_attr='training_iteration',
                    metric='episode_reward_mean',
                    mode='max',
                    max_t=5,
                    grace_period=5,
                    reduction_factor=3,
                    brackets=3),
                 )

ericl · 2020-12-10T19:41:50Z

@soundway it succeeded (100 samples), I'm retrying with 1000 samples. Btw, is it possible for you to provide an example YAML of the p2.16xl workload to reproduce? Wondering if it's some odd dependency issue / docker image issue.

soundway · 2020-12-10T21:08:14Z

What exactly do you mean by the YAML file?

Sometimes the job crashes around 1K-2K samples. The "200" num_workers could also reduce the chance of crashing since less trials will be launched. When using 10000 samples, while there is an initial hanging, the job will eventually start.

ericl · 2020-12-10T21:13:56Z

You mentioned being able to reproduce on a p2.16xl; I'm wondering if you could share the Ray cluster launcher yaml used to create that node (if that was how it was created).

I agree the "200" could reduce the chances of crashing; however note that you shouldn't be running with >1 worker per core, I would be curious if the issues go away on your end if you set "num_cpu_per_worker" > 1 to avoid overload.

soundway · 2020-12-10T21:21:19Z

We have our own way of launching Ray jobs within our infra and don't use any Ray's launching mechanism. But at a high level we just launch bunch of instances with the specified types, each loaded with a docker image, then we run a command on one of the containers.

We generally don't use more than 1 worker per core, but this script was created in the context when you asked me to make the crash reproducible on a single machine, and increasing number of workers was one of the most reliable way to do that on a single machine. Running similar scripts with large number of actors over a large cluster also increases the chance of crashing, but it requires more complex setup (e.g. the first setup I described in this bug report).

richardliaw · 2020-12-11T18:07:40Z

@mkoh-asapp mentioned today he was seeing this issue on some other Ray issues - Matt, can you maybe add some information on this thread?

ericl · 2020-12-11T18:43:54Z

@richardliaw Please file a separate issue, this is a symptom not a cause.

richardliaw · 2020-12-11T20:26:10Z

Got it

…

On Fri, Dec 11, 2020 at 10:44 AM Eric Liang ***@***.***> wrote: @richardliaw <https://github.com/richardliaw> Please file a separate issue, this is a symptom not a cause. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11239 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCRZZMEASNRYZVSVYNLK7DSUJR7TANCNFSM4SGQX3BA> .

ericl · 2020-12-11T20:58:21Z

@soundway good news, I managed to repro. Filtering the logs with

[2020-12-11 08:38:20,767 W 119255 119255] node_manager.cc:3209: Broadcasting global GC request to all raylets. This is usually because clusters have memory pressure, and ray needs to GC unused memory.
[2020-12-11 08:49:44,024 W 119255 119255] node_manager.cc:3209: Broadcasting global GC request to all raylets. This is usually because clusters have memory pressure, and ray needs to GC unused memory.
[2020-12-11 08:49:44,176 I 119255 119255] node_manager.cc:819: Owner process fa751d03dfb76daaa64c29074758bc6e972fc5b7 died, killing leased worker 4a2ee6805a94a4537df3266bab953bbf2c2c1fe1
[2020-12-11 08:49:44,176 I 119255 119255] node_manager.cc:819: Owner process fa751d03dfb76daaa64c29074758bc6e972fc5b7 died, killing leased worker d4ebdcaed62ef83a85ed7182ef3240c8eea3a208
[2020-12-11 08:49:44,176 I 119255 119255] node_manager.cc:819: Owner process fa751d03dfb76daaa64c29074758bc6e972fc5b7 died, killing leased worker 193750a83443f8da74038e6564d8b979c6356988

dmesg:

[Thu Dec 10 02:46:50 2020] ray::PPO[120939]: segfault at 8 ip 00007f4830ce5bce sp 00007ffee9355c60 error 4 in libtorch.so[7f482ca53000+2f7bf000]
[Thu Dec 10 02:46:50 2020] Code: 31 d2 55 53 48 83 ec 08 48 63 2e 4c 8b 4f 08 48 89 e8 49 f7 f1 48 8b 07 4c 8b 14 d0 48 89 d3 4d 85 d2 74 50 49 8b 0a 49 89 eb <44> 8b 41 08 eb 23 0f 1f 40 00 48 8b 01 48 85 c0 74 38 44 8b 40 08
[Thu Dec 10 19:49:06 2020] ray::PPO[95647]: segfault at 68d4b2ab ip 00007f82fc624bce sp 00007ffd457c8f90 error 4 in libtorch.so[7f82f8392000+2f7bf000]
[Thu Dec 10 19:49:06 2020] Code: 31 d2 55 53 48 83 ec 08 48 63 2e 4c 8b 4f 08 48 89 e8 49 f7 f1 48 8b 07 4c 8b 14 d0 48 89 d3 4d 85 d2 74 50 49 8b 0a 49 89 eb <44> 8b 41 08 eb 23 0f 1f 40 00 48 8b 01 48 85 c0 74 38 44 8b 40 08
[Thu Dec 10 23:53:53 2020] ray::PPO[99828]: segfault at 48 ip 00007f043a353bce sp 00007ffd63409bc0 error 4 in libtorch.so[7f04360c1000+2f7bf000]
[Thu Dec 10 23:53:53 2020] Code: 31 d2 55 53 48 83 ec 08 48 63 2e 4c 8b 4f 08 48 89 e8 49 f7 f1 48 8b 07 4c 8b 14 d0 48 89 d3 4d 85 d2 74 50 49 8b 0a 49 89 eb <44> 8b 41 08 eb 23 0f 1f 40 00 48 8b 01 48 85 c0 74 38 44 8b 40 08
[Fri Dec 11 03:40:55 2020] traps: ray::PPO[64854] general protection fault ip:7f4dc8ce5bce sp:7ffe8c2291d0 error:0 in libtorch.so[7f4dc4a53000+2f7bf000]
[Fri Dec 11 09:07:51 2020] traps: ray::PPO[113262] general protection fault ip:7ece9f666bce sp:7fff943fa950 error:0 in libtorch.so[7ece9b3d4000+2f7bf000]

It could be that triggering GC somehow caused a segfault in the worker. I'll look into trying to reproduce this scenario.

ericl · 2020-12-11T20:59:02Z

Btw, I think you can mitigate transient trial crashes by using max_failures setting in Tune. @richardliaw not sure if it's compatible with hyperopt though.

richardliaw · 2020-12-11T21:11:22Z

It is compatible yep

…

On Fri, Dec 11, 2020 at 12:59 PM Eric Liang ***@***.***> wrote: Btw, I think you can mitigate transient trial crashes by using max_failures setting in Tune. @richardliaw <https://github.com/richardliaw> not sure if it's compatible with hyperopt though. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11239 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCRZZIBJ26B6VQJ5P5PAN3SUKB2LANCNFSM4SGQX3BA> .

ericl · 2020-12-14T23:09:01Z

Quick update: I've verified the issue is not due to GC. I also don't see any memory spikes or anything like that, so it's possible I've reproduced a different crash than what you've originally reported @soundway .

What I do see is "general protection fault" in libpytorch.so, which suggests this is a rare PyTorch bug. It seems to happen about one out of every 100000 actor launches (so after 500x trials of 200 actors each). Trying to repro in TF to see if it's indeed due to PyTorch.

soundway · 2020-12-14T23:51:47Z

I've only checked dmesg in a handful of failed instances, and did not see torch related memory errors in those. I've reproduced crashes with TF in the past, but unfortunately I don't have the log files anymore for them. So it is possible that it's an unrelated issue.

ericl · 2020-12-15T19:47:03Z

I've managed to repro without a segfault (using TF). It seems to happen after ~550 trials (not sure if exactly after 550). Trying to dig up worker logs for the crash if there are any.

ericl · 2020-12-17T22:26:43Z

I've validated that using ray.init(_system_config={"maximum_gcs_destroyed_actor_cached_count": 1000}) avoids the crash. All 1000 trials finished successfully with this config set.

This means that #12894 should resolve the issue, by raising the unique bytes in the actor id from 4 -> 12 (now we support up to ~281474976710656 instead of ~100000 actors).

soundway · 2020-12-17T22:49:46Z

Thanks for the update! If this is indeed the fix for the crashed produced by the contrived script, then I'm not entirely convinced that it shares the same root cause as the problem we have when running with our own environments (e.g. where sometimes we see errors related file descriptor from raylet). The experiment we run with our environments that crashed don't produce nearly as many actors, and the chance of collisions even with just 32 bit should be negligible compared to the rate of failure we were seeing according to this table. Unless the randomness of the IDs is not uniform. I guess we'll find out more with the latest wheel.

I suppose 0.8.5 uses a different scheme to manage actor IDs? Because we never experience any failures with 0.8.5.

ericl · 2020-12-17T23:01:00Z

Yep, 0.8.5 doesn't centralize actor management, so the probability of collision is much lower (you would have to have collisions within individual trials, which is extremely unlikely since each trial only had ~200 actors). However, the central GCS tracks all actors over all time, so collisions there become inevitable once you cycle through enough actors in your app.

That said it does seem likely your app has other issues. Let's see if this PR fixes them. If not, we should create a new issue thread since this one is quite overloaded by now.

soundway · 2020-12-18T18:21:41Z

@ericl I see that make sense. Another question -- during RLLib training, besides the obvious trial and worker actors, are other hidden actors being created in the background constantly? One of the earlier tests I did here is to use super long trial duration, but jobs still crashed without trial reset happening.

ericl · 2020-12-19T00:07:23Z

@soundway shouldn't be. That sounds like a different cause (btw, I also realized that you should be able to see crash logs in the python-core-worker*.log files in the ray logs dir). It would be great if you can file a separate issue with those if you can reproduce on the nightly wheels still.

ericl · 2020-12-19T19:00:03Z

#12990

The pr got reverted, pending re merge.

soundway added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 6, 2020

rkooo567 added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 9, 2020

ericl added the rllib label Oct 14, 2020

waldroje mentioned this issue Oct 16, 2020

[Core] Cluster crashes when the number of nodes exceeds certain threshold. #11369

Closed

2 tasks

ericl added P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Oct 16, 2020

ericl self-assigned this Oct 16, 2020

ericl mentioned this issue Oct 21, 2020

Try to raise ulimit for file descriptors to max allowed; warn if ulimit is still too low #11515

Merged

6 tasks

ericl added P2 Important issue, but not time-critical and removed P0 Issues that should be fixed in short order labels Dec 9, 2020

ericl added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Dec 9, 2020

ericl mentioned this issue Dec 11, 2020

Add automatic local GC and plasma debug logs every 10 minutes by default #12804

Merged

ericl mentioned this issue Dec 15, 2020

Increase the number of unique bits for actors to avoid handle collisions #12894

Merged

ericl closed this as completed in #12894 Dec 18, 2020

[rllib] Frequent “the actor died unexpectedly before finishing this task” errors with executions ops in Ray/RLLib 0.8.7+ #11239

[rllib] Frequent “the actor died unexpectedly before finishing this task” errors with executions ops in Ray/RLLib 0.8.7+ #11239

Comments

soundway commented Oct 6, 2020 • edited Loading

ericl commented Oct 7, 2020

rkooo567 commented Oct 7, 2020

rkooo567 commented Oct 7, 2020

soundway commented Oct 9, 2020

rkooo567 commented Oct 9, 2020

rkooo567 commented Oct 9, 2020

soundway commented Oct 9, 2020

rkooo567 commented Oct 9, 2020 • edited Loading

rkooo567 commented Oct 9, 2020

soundway commented Oct 9, 2020

soundway commented Oct 9, 2020

soundway commented Oct 14, 2020 • edited Loading

ericl commented Oct 19, 2020 • edited Loading

ericl commented Oct 19, 2020

waldroje commented Oct 20, 2020 via email

ericl commented Oct 21, 2020

waldroje commented Oct 21, 2020 via email

soundway commented Oct 21, 2020 • edited Loading

ericl commented Oct 21, 2020

soundway commented Oct 21, 2020

soundway commented Dec 9, 2020

ericl commented Dec 9, 2020

soundway commented Dec 9, 2020

ericl commented Dec 9, 2020 • edited Loading

ericl commented Dec 10, 2020

soundway commented Dec 10, 2020

ericl commented Dec 10, 2020

soundway commented Dec 10, 2020

richardliaw commented Dec 11, 2020

ericl commented Dec 11, 2020

richardliaw commented Dec 11, 2020 via email

ericl commented Dec 11, 2020

ericl commented Dec 11, 2020

richardliaw commented Dec 11, 2020 via email

ericl commented Dec 14, 2020

soundway commented Dec 14, 2020

ericl commented Dec 15, 2020

ericl commented Dec 17, 2020

soundway commented Dec 17, 2020

ericl commented Dec 17, 2020

soundway commented Dec 18, 2020

ericl commented Dec 19, 2020

ericl commented Dec 19, 2020

soundway commented Oct 6, 2020 •

edited

Loading

rkooo567 commented Oct 9, 2020 •

edited

Loading

soundway commented Oct 14, 2020 •

edited

Loading

ericl commented Oct 19, 2020 •

edited

Loading

soundway commented Oct 21, 2020 •

edited

Loading

ericl commented Dec 9, 2020 •

edited

Loading