Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[<Ray component: Core>] actor die when using asyncio in actors #46151

Closed
leoricmaster opened this issue Jun 20, 2024 · 1 comment
Closed

[<Ray component: Core>] actor die when using asyncio in actors #46151

leoricmaster opened this issue Jun 20, 2024 · 1 comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@leoricmaster
Copy link

leoricmaster commented Jun 20, 2024

What happened + What you expected to happen

Below codes will lead actor die in Ray 2.24.0. The same codes can work if roll back to Ray 2.23.0:

import ray
import asyncio
from datetime import datetime

@ray.remote
class AsyncActor:
    # multiple invocation of this method can be running in
    # the event loop at the same time
    async def run_concurrent(self):
        print(f"started at {datetime.now()}.")
        await asyncio.sleep(2) # concurrent workload here
        print(f"finished at {datetime.now()}.")
        return True

# Create an async actor.
actor = AsyncActor.remote()

# Way 1: regular ray.get
print(ray.get([actor.run_concurrent.remote() for _ in range(4)]))

# Way 2: async ray.get
async def async_get():
    tasks = [actor.run_concurrent.remote() for _ in range(4)]
    completed_tasks = await asyncio.gather(*tasks)
    return completed_tasks

print(asyncio.run(async_get()))

import time
time.sleep(0.1)

Logs:

(AsyncActor pid=46603) started at 2024-06-20 18:19:47.989231.
(AsyncActor pid=46603) started at 2024-06-20 18:19:47.989474.
(AsyncActor pid=46603) started at 2024-06-20 18:19:47.989493.
(AsyncActor pid=46603) started at 2024-06-20 18:19:47.989532.
(AsyncActor pid=46603) finished at 2024-06-20 18:19:49.999099.
(AsyncActor pid=46603) finished at 2024-06-20 18:19:49.999147.
(AsyncActor pid=46603) finished at 2024-06-20 18:19:49.999159.
(AsyncActor pid=46603) finished at 2024-06-20 18:19:49.999169.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff246e90dcf82b9d602dc76de701000000 Worker ID: 210b549e23bb6bb19cc404a7c1b29fd3bc550b630cc4f802360e8e49 Node ID: 1245d524846dd39e1efdf9cae388c228a9c1c4fb24b63935c8016968 Worker IP address: 10.219.204.96 Worker port: 37513 Worker PID: 46603 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(AsyncActor pid=46603) *** SIGSEGV received at time=1718878790 on cpu 1 ***
(AsyncActor pid=46603) PC: @     0x787458702238  (unknown)  boost::fibers::detail::spinlock_ttas::lock()
(AsyncActor pid=46603)     @     0x78745a442520       6768  (unknown)
(AsyncActor pid=46603)     @     0x787458873a70         64  boost::fibers::mutex::lock()
(AsyncActor pid=46603)     @     0x787458800580         96  std::_Function_handler<>::_M_invoke()
(AsyncActor pid=46603)     @     0x7874587f8b95         96  boost::fibers::worker_context<>::run_()
(AsyncActor pid=46603)     @     0x7874587f8910         80  boost::context::detail::fiber_entry<>()
(AsyncActor pid=46603)     @     0x787458874bcf  (unknown)  make_fcontext
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343: *** SIGSEGV received at time=1718878790 on cpu 1 ***
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343: PC: @     0x787458702238  (unknown)  boost::fibers::detail::spinlock_ttas::lock()
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343:     @     0x78745a442520       6768  (unknown)
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343:     @     0x787458873a70         64  boost::fibers::mutex::lock()
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343:     @     0x787458800580         96  std::_Function_handler<>::_M_invoke()
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343:     @     0x7874587f8b95         96  boost::fibers::worker_context<>::run_()
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343:     @     0x7874587f8910         80  boost::context::detail::fiber_entry<>()
(AsyncActor pid=46603) [2024-06-20 18:19:50,103 E 46603 46638] logging.cc:343:     @     0x787458874bcf  (unknown)  make_fcontext
(AsyncActor pid=46603) Fatal Python error: Segmentation fault
(AsyncActor pid=46603) 
(AsyncActor pid=46603) Stack (most recent call first):
(AsyncActor pid=46603)   <no Python frame>
(AsyncActor pid=46603) 
(AsyncActor pid=46603) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, charset_normalizer.md, simplejson._speedups, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pyarrow._hdfsio, pyarrow._json (total: 27)
---------------------------------------------------------------------------
ActorDiedError                            Traceback (most recent call last)
Cell In[75], line 19
     16 actor = AsyncActor.remote()
     18 # Way 1: regular ray.get
---> 19 print(ray.get([actor.run_concurrent.remote() for _ in range(4)]))
     21 # Way 2: async ray.get
     22 async def async_get():

File ~/miniforge3/envs/ray/lib/python3.10/site-packages/ray/_private/auto_init_hook.py:21, in wrap_auto_init.<locals>.auto_init_wrapper(*args, **kwargs)
     18 @wraps(fn)
     19 def auto_init_wrapper(*args, **kwargs):
     20     auto_init_ray()
---> 21     return fn(*args, **kwargs)

File ~/miniforge3/envs/ray/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:103, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    101     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    102         return getattr(ray, func.__name__)(*args, **kwargs)
--> 103 return func(*args, **kwargs)

File ~/miniforge3/envs/ray/lib/python3.10/site-packages/ray/_private/worker.py:2613, in get(object_refs, timeout)
   2607     raise ValueError(
   2608         f"Invalid type of object refs, {type(object_refs)}, is given. "
   2609         "'object_refs' must either be an ObjectRef or a list of ObjectRefs. "
   2610     )
   2612 # TODO(ujvl): Consider how to allow user to retrieve the ready objects.
-> 2613 values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
   2614 for i, value in enumerate(values):
   2615     if isinstance(value, RayError):

File ~/miniforge3/envs/ray/lib/python3.10/site-packages/ray/_private/worker.py:863, in Worker.get_objects(self, object_refs, timeout)
    861             raise value.as_instanceof_cause()
    862         else:
--> 863             raise value
    864 return values, debugger_breakpoint

ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: AsyncActor
	actor_id: 246e90dcf82b9d602dc76de701000000
	pid: 46603
	namespace: 9d14f738-c93d-4029-bb20-a806e2f56aed
	ip: 10.219.204.96
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Versions / Dependencies

Ray 2.24.0
Python 3.10.12

Reproduction script

import ray
import asyncio
from datetime import datetime

@ray.remote
class AsyncActor:
    # multiple invocation of this method can be running in
    # the event loop at the same time
    async def run_concurrent(self):
        print(f"started at {datetime.now()}.")
        await asyncio.sleep(2) # concurrent workload here
        print(f"finished at {datetime.now()}.")
        return True

# Create an async actor.
actor = AsyncActor.remote()

# Way 1: regular ray.get
print(ray.get([actor.run_concurrent.remote() for _ in range(4)]))

# Way 2: async ray.get
async def async_get():
    tasks = [actor.run_concurrent.remote() for _ in range(4)]
    completed_tasks = await asyncio.gather(*tasks)
    return completed_tasks

print(asyncio.run(async_get()))

import time
time.sleep(0.1)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@leoricmaster leoricmaster added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 20, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label Jun 21, 2024
@jjyao
Copy link
Collaborator

jjyao commented Jul 1, 2024

Hi @leoricmaster , if you try nightly now it's working. I think it might be fixed by #46133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants