You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
leoricmaster opened this issue
Jun 20, 2024
· 1 comment
Labels
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoretriageNeeds triage (eg: priority, bug/not-bug, and owning component)
Below codes will lead actor die in Ray 2.24.0. The same codes can work if roll back to Ray 2.23.0:
import ray
import asyncio
from datetime import datetime
@ray.remote
class AsyncActor:
# multiple invocation of this method can be running in
# the event loop at the same time
async def run_concurrent(self):
print(f"started at {datetime.now()}.")
await asyncio.sleep(2) # concurrent workload here
print(f"finished at {datetime.now()}.")
return True
# Create an async actor.
actor = AsyncActor.remote()
# Way 1: regular ray.get
print(ray.get([actor.run_concurrent.remote() for _ in range(4)]))
# Way 2: async ray.get
async def async_get():
tasks = [actor.run_concurrent.remote() for _ in range(4)]
completed_tasks = await asyncio.gather(*tasks)
return completed_tasks
print(asyncio.run(async_get()))
import time
time.sleep(0.1)
Logs:
(AsyncActor pid=46603) started at 2024-06-20 18:19:47.989231.
(AsyncActor pid=46603) started at 2024-06-20 18:19:47.989474.
(AsyncActor pid=46603) started at 2024-06-20 18:19:47.989493.
(AsyncActor pid=46603) started at 2024-06-20 18:19:47.989532.
(AsyncActor pid=46603) finished at 2024-06-20 18:19:49.999099.
(AsyncActor pid=46603) finished at 2024-06-20 18:19:49.999147.
(AsyncActor pid=46603) finished at 2024-06-20 18:19:49.999159.
(AsyncActor pid=46603) finished at 2024-06-20 18:19:49.999169.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff246e90dcf82b9d602dc76de701000000 Worker ID: 210b549e23bb6bb19cc404a7c1b29fd3bc550b630cc4f802360e8e49 Node ID: 1245d524846dd39e1efdf9cae388c228a9c1c4fb24b63935c8016968 Worker IP address: 10.219.204.96 Worker port: 37513 Worker PID: 46603 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(AsyncActor pid=46603) *** SIGSEGV received at time=1718878790 on cpu 1 ***
(AsyncActor pid=46603) PC: @ 0x787458702238 (unknown) boost::fibers::detail::spinlock_ttas::lock()
(AsyncActor pid=46603) @ 0x78745a442520 6768 (unknown)
(AsyncActor pid=46603) @ 0x787458873a70 64 boost::fibers::mutex::lock()
(AsyncActor pid=46603) @ 0x787458800580 96 std::_Function_handler<>::_M_invoke()
(AsyncActor pid=46603) @ 0x7874587f8b95 96 boost::fibers::worker_context<>::run_()
(AsyncActor pid=46603) @ 0x7874587f8910 80 boost::context::detail::fiber_entry<>()
(AsyncActor pid=46603) @ 0x787458874bcf (unknown) make_fcontext
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343: *** SIGSEGV received at time=1718878790 on cpu 1 ***
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343: PC: @ 0x787458702238 (unknown) boost::fibers::detail::spinlock_ttas::lock()
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343: @ 0x78745a442520 6768 (unknown)
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343: @ 0x787458873a70 64 boost::fibers::mutex::lock()
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343: @ 0x787458800580 96 std::_Function_handler<>::_M_invoke()
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343: @ 0x7874587f8b95 96 boost::fibers::worker_context<>::run_()
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343: @ 0x7874587f8910 80 boost::context::detail::fiber_entry<>()
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343: @ 0x787458874bcf (unknown) make_fcontext
(AsyncActor pid=46603) Fatal Python error: Segmentation fault
(AsyncActor pid=46603)
(AsyncActor pid=46603) Stack (most recent call first):
(AsyncActor pid=46603) <no Python frame>
(AsyncActor pid=46603)
(AsyncActor pid=46603) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, charset_normalizer.md, simplejson._speedups, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pyarrow._hdfsio, pyarrow._json (total: 27)
---------------------------------------------------------------------------
ActorDiedError Traceback (most recent call last)
Cell In[75], line 19
16 actor = AsyncActor.remote()
18 # Way 1: regular ray.get
---> 19 print(ray.get([actor.run_concurrent.remote() for _ in range(4)]))
21 # Way 2: async ray.get
22 async def async_get():
File ~/miniforge3/envs/ray/lib/python3.10/site-packages/ray/_private/auto_init_hook.py:21, in wrap_auto_init.<locals>.auto_init_wrapper(*args, **kwargs)
18 @wraps(fn)
19 def auto_init_wrapper(*args, **kwargs):
20 auto_init_ray()
---> 21 return fn(*args, **kwargs)
File ~/miniforge3/envs/ray/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:103, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
101 if func.__name__ != "init" or is_client_mode_enabled_by_default:
102 return getattr(ray, func.__name__)(*args, **kwargs)
--> 103 return func(*args, **kwargs)
File ~/miniforge3/envs/ray/lib/python3.10/site-packages/ray/_private/worker.py:2613, in get(object_refs, timeout)
2607 raise ValueError(
2608 f"Invalid type of object refs, {type(object_refs)}, is given. "
2609 "'object_refs' must either be an ObjectRef or a list of ObjectRefs. "
2610 )
2612 # TODO(ujvl): Consider how to allow user to retrieve the ready objects.
-> 2613 values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
2614 for i, value in enumerate(values):
2615 if isinstance(value, RayError):
File ~/miniforge3/envs/ray/lib/python3.10/site-packages/ray/_private/worker.py:863, in Worker.get_objects(self, object_refs, timeout)
861 raise value.as_instanceof_cause()
862 else:
--> 863 raise value
864 return values, debugger_breakpoint
ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: AsyncActor
actor_id: 246e90dcf82b9d602dc76de701000000
pid: 46603
namespace: 9d14f738-c93d-4029-bb20-a806e2f56aed
ip: 10.219.204.96
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Versions / Dependencies
Ray 2.24.0
Python 3.10.12
Reproduction script
import ray
import asyncio
from datetime import datetime
@ray.remote
class AsyncActor:
# multiple invocation of this method can be running in
# the event loop at the same time
async def run_concurrent(self):
print(f"started at {datetime.now()}.")
await asyncio.sleep(2) # concurrent workload here
print(f"finished at {datetime.now()}.")
return True
# Create an async actor.
actor = AsyncActor.remote()
# Way 1: regular ray.get
print(ray.get([actor.run_concurrent.remote() for _ in range(4)]))
# Way 2: async ray.get
async def async_get():
tasks = [actor.run_concurrent.remote() for _ in range(4)]
completed_tasks = await asyncio.gather(*tasks)
return completed_tasks
print(asyncio.run(async_get()))
import time
time.sleep(0.1)
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
leoricmaster
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jun 20, 2024
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoretriageNeeds triage (eg: priority, bug/not-bug, and owning component)
What happened + What you expected to happen
Below codes will lead actor die in Ray 2.24.0. The same codes can work if roll back to Ray 2.23.0:
Logs:
Versions / Dependencies
Ray 2.24.0
Python 3.10.12
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: