[Core] SIGABRT causes Ray to hang #29576

amogkam · 2022-10-22T01:00:43Z

What happened + What you expected to happen

The below script causes Ray to hang. I would expect to see a RayActorError instead.

The output when running the script

(base) ray@ip-172-31-80-95:~/default$ python test.py 
2022-10-21 17:56:17,484 INFO worker.py:1230 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
2022-10-21 17:56:17,742 INFO worker.py:1342 -- Connecting to existing Ray cluster at address: 172.31.80.95:9031...
2022-10-21 17:56:17,749 INFO worker.py:1519 -- Connected to Ray cluster. View the dashboard at https://console.anyscale-staging.com/api/v2/sessions/ses_tU65uDZ9uHVfEDuQzRnsX3EY/services?redirect_to=dashboard 
2022-10-21 17:56:17,751 INFO packaging.py:354 -- Pushing file package 'gcs://_ray_pkg_eff56cd0e859ca3470d515d0dc41effb.zip' (0.00MiB) to Ray cluster...
2022-10-21 17:56:17,752 INFO packaging.py:367 -- Successfully pushed file package 'gcs://_ray_pkg_eff56cd0e859ca3470d515d0dc41effb.zip'.
(Actor pid=11471) *** SIGABRT received at time=1666400179 on cpu 14 ***
(Actor pid=11471) PC: @     0x7f70d75e000b  (unknown)  raise
(Actor pid=11471)     @     0x7f70d77a3420  (unknown)  (unknown)
(Actor pid=11471)     @ ... and at least 3 more frames
(Actor pid=11471) [2022-10-21 17:56:19,050 E 11471 11471] logging.cc:361: *** SIGABRT received at time=1666400179 on cpu 14 ***
(Actor pid=11471) [2022-10-21 17:56:19,050 E 11471 11471] logging.cc:361: PC: @     0x7f70d75e000b  (unknown)  raise
(Actor pid=11471) [2022-10-21 17:56:19,050 E 11471 11471] logging.cc:361:     @     0x7f70d77a3420  (unknown)  (unknown)
(Actor pid=11471) [2022-10-21 17:56:19,050 E 11471 11471] logging.cc:361:     @ ... and at least 3 more frames
(Actor pid=11471) Fatal Python error: Aborted
(Actor pid=11471) 
(Actor pid=11471) Stack (most recent call first):
(Actor pid=11471)   File "test.py", line 32 in abort
(Actor pid=11471)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466 in _resume_span
(Actor pid=11471)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 674 in actor_method_executor
(Actor pid=11471)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/worker.py", line 763 in main_loop
(Actor pid=11471)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/workers/default_worker.py", line 231 in <module>
(raylet) [2022-10-21 17:57:27,057 E 761 802] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-10-21_17-03-26_460534_151 is over 95% full, available space: 15603879936; capacity: 312201752576. Object creation will fail if spilling is required.

Versions / Dependencies

master

Reproduction script

import os
import ray

@ray.remote
class Actor:
    def abort(self):
        os.abort()
        

a = Actor.remote()
ray.get(a.abort.remote())

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

amogkam · 2022-10-22T01:01:22Z

cc @rkooo567 @cadedaniel

rkooo567 · 2022-10-22T14:42:38Z

We want to finish this by 2.2

cadedaniel · 2022-11-08T19:16:08Z

When I try to reproduce, it looks like SIGABRT is crashing the worker process. Once Ray figures this out (~30 seconds), Ray raises a RayActorError. So looks like this isn't actually an issue.. let me try to reproduce with a NCCL workload.

(Actor pid=4993) *** SIGABRT received at time=1667934734 on cpu 3 ***
(Actor pid=4993) PC: @     0x7f6bd8a7c00b  (unknown)  raise
(Actor pid=4993)     @     0x7f6bd8c3f420  (unknown)  (unknown)
(Actor pid=4993)     @ ... and at least 3 more frames
(Actor pid=4993) [2022-11-08 11:12:14,808 E 4993 4993] logging.cc:361: *** SIGABRT received at time=1667934734 on cpu 3 ***
(Actor pid=4993) [2022-11-08 11:12:14,808 E 4993 4993] logging.cc:361: PC: @     0x7f6bd8a7c00b  (unknown)  raise
(Actor pid=4993) [2022-11-08 11:12:14,808 E 4993 4993] logging.cc:361:     @     0x7f6bd8c3f420  (unknown)  (unknown)
(Actor pid=4993) [2022-11-08 11:12:14,808 E 4993 4993] logging.cc:361:     @ ... and at least 3 more frames
(Actor pid=4993) Fatal Python error: Aborted
(Actor pid=4993)
(Actor pid=4993) Stack (most recent call first):
(Actor pid=4993)   File "example", line 9 in abort
(Actor pid=4993)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466 in _resume_span
(Actor pid=4993)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 674 in actor_method_executor
(Actor pid=4993)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/worker.py", line 763 in main_loop
(Actor pid=4993)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/workers/default_worker.py", line 232 in <module>
2022-11-08 11:12:57,501 WARNING worker.py:1849 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff95b8c940dfe0d149ef8ee1ba02000000 Worker ID: 8f958f908210706cbe8f7fa5fe90d09967a200824c6e93e6c15008be Node ID: 3eea48bbcef4da3b5f2559af2183a0a5e695fb30ed82901e618712fc Worker IP address: 172.31.106.205 Worker port: 10005 Worker PID: 4993 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Traceback (most recent call last):
  File "example", line 13, in <module>
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/worker.py", line 2305, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: Actor
        actor_id: 95b8c940dfe0d149ef8ee1ba02000000
        pid: 4993
        namespace: f466e965-d536-4ed2-bc0c-adb22dce4a3f
        ip: 172.31.106.205
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

cadedaniel · 2022-11-08T19:33:23Z

Unexpectedly, the process stays alive for some time with status alternating between D and R. Not sure why it needs to go in D..

ray         9348 11.7 20.9 19624540 13481692 ?   DNl  11:29   0:13 ray::Actor.abort()

cadedaniel · 2022-11-09T05:12:31Z

Looks like this is actually expected behavior. I put together a doc explaining why this is and a proposal on how to improve the user experience: [public] Coredumps appearing to cause hangs in Ray

Tl;DR

Calling os.abort() inside a Ray actor on a large instance looks like it causes a hang. However, it actually merely looks like a hang because it is writing a very large coredump file (~object store size). We should improve the user experience so that they’re not confused by either improving documentation or filtering out the object store mmaps from coredumps.

Related issue: #12505

cadedaniel · 2022-11-09T18:04:25Z

@amogkam I think you can merge your PR #29562 since it's expected behavior. I'll create a PR soon that will prevent the object store from getting written in the coredump, which will improve the user experience when Torch detects NCCL hangs on Ray.

amogkam · 2022-11-15T21:50:29Z

Thanks for the investigation @cadedaniel, makes sense!

Trying again on my laptop instead of on a cluster, I do get a RayActorError immediately.

However, when running in python interactive shell, I do not get a RayActorError on ray.get.

amogkam · 2022-11-15T22:42:43Z

We should address the large coredump issue before merging #29562 though.

Otherwise, it will appear as if training is hanging on large clusters

) Description If a Ray worker process crashes and generates a coredump, Linux will by default dump all pages mapped by the process. This includes the plasma pages, which on large instances can be quite large. This PR uses madvise to disable dumping of the plasma pages in worker processes. The impact is that coredumps generated by Ray worker processes are now ~300MB instead of ~= object store size. See this [public] Coredumps appearing to cause hangs in Ray for more information. Testing It is difficult to test this in CI because 1) at the C++ level there isn't a Linux API to verify madvise status of pages and 2) coredumps aren't enabled (/proc/sys/kernel/core_pattern is set to an invalid value and ulimit -c is 0) so a Python-level test won't work. I can go and enable coredumps in CI but that feels like a big change, want to check before going down that path. In terms of manual testing, this is disabled by macro for non Linux builds. On Linux, the coredump size goes down significantly: $ ls -alh /tmp/core.ray::Actor.abor.88940 # without madvise -rw------- 1 ray users 9.6G Nov 13 10:52 /tmp/core.ray::Actor.abor.88940 $ ls -alh /tmp/core.ray::Actor.abor.97217 # with madvise -rw------- 1 ray users 239M Nov 13 11:09 /tmp/core.ray::Actor.abor.97217 $ gdb -c /tmp/core.ray::Actor.abor.97217 (gdb) info proc mappings Mapped address spaces: Start Addr End Addr Size Offset objfile 0x55a3af4a7000 0x55a3af506000 0x5f000 0x0 /home/ray/anaconda3/bin/python3.8 0x55a3af506000 0x55a3af6fe000 0x1f8000 0x5f000 /home/ray/anaconda3/bin/python3.8 0x55a3af6fe000 0x55a3af7e5000 0xe7000 0x257000 /home/ray/anaconda3/bin/python3.8 0x55a3af7e6000 0x55a3af7eb000 0x5000 0x33e000 /home/ray/anaconda3/bin/python3.8 0x55a3af7eb000 0x55a3af823000 0x38000 0x343000 /home/ray/anaconda3/bin/python3.8 0x7fc9792c0000 0x7fcd64000000 0x3ead40000 0x0 /dev/shm/plasmax6oDcM (deleted) (...) Closes #29576

…-project#30150) Description If a Ray worker process crashes and generates a coredump, Linux will by default dump all pages mapped by the process. This includes the plasma pages, which on large instances can be quite large. This PR uses madvise to disable dumping of the plasma pages in worker processes. The impact is that coredumps generated by Ray worker processes are now ~300MB instead of ~= object store size. See this [public] Coredumps appearing to cause hangs in Ray for more information. Testing It is difficult to test this in CI because 1) at the C++ level there isn't a Linux API to verify madvise status of pages and 2) coredumps aren't enabled (/proc/sys/kernel/core_pattern is set to an invalid value and ulimit -c is 0) so a Python-level test won't work. I can go and enable coredumps in CI but that feels like a big change, want to check before going down that path. In terms of manual testing, this is disabled by macro for non Linux builds. On Linux, the coredump size goes down significantly: $ ls -alh /tmp/core.ray::Actor.abor.88940 # without madvise -rw------- 1 ray users 9.6G Nov 13 10:52 /tmp/core.ray::Actor.abor.88940 $ ls -alh /tmp/core.ray::Actor.abor.97217 # with madvise -rw------- 1 ray users 239M Nov 13 11:09 /tmp/core.ray::Actor.abor.97217 $ gdb -c /tmp/core.ray::Actor.abor.97217 (gdb) info proc mappings Mapped address spaces: Start Addr End Addr Size Offset objfile 0x55a3af4a7000 0x55a3af506000 0x5f000 0x0 /home/ray/anaconda3/bin/python3.8 0x55a3af506000 0x55a3af6fe000 0x1f8000 0x5f000 /home/ray/anaconda3/bin/python3.8 0x55a3af6fe000 0x55a3af7e5000 0xe7000 0x257000 /home/ray/anaconda3/bin/python3.8 0x55a3af7e6000 0x55a3af7eb000 0x5000 0x33e000 /home/ray/anaconda3/bin/python3.8 0x55a3af7eb000 0x55a3af823000 0x38000 0x343000 /home/ray/anaconda3/bin/python3.8 0x7fc9792c0000 0x7fcd64000000 0x3ead40000 0x0 /dev/shm/plasmax6oDcM (deleted) (...) Closes ray-project#29576 Signed-off-by: Weichen Xu <[email protected]>

amogkam added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels Oct 22, 2022

amogkam mentioned this issue Oct 22, 2022

[Train] Don't use NCCL_BLOCKING_WAIT #29562

Merged

7 tasks

cadedaniel added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 22, 2022

rkooo567 added this to the Core Backlog milestone Oct 22, 2022

rkooo567 added the Ray 2.2 label Oct 24, 2022

amogkam mentioned this issue Oct 24, 2022

[Core] NCCL_BLOCKING_WAIT=1 leads to performance regression #29419

Closed

cadedaniel self-assigned this Nov 2, 2022

cadedaniel mentioned this issue Nov 14, 2022

[Core] Disabling object store from appearing in worker coredumps #30150

Merged

scv119 closed this as completed in #30150 Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] SIGABRT causes Ray to hang #29576

[Core] SIGABRT causes Ray to hang #29576

amogkam commented Oct 22, 2022

amogkam commented Oct 22, 2022

rkooo567 commented Oct 22, 2022

cadedaniel commented Nov 8, 2022

cadedaniel commented Nov 8, 2022

cadedaniel commented Nov 9, 2022

cadedaniel commented Nov 9, 2022

amogkam commented Nov 15, 2022

amogkam commented Nov 15, 2022 •

edited

Loading

[Core] SIGABRT causes Ray to hang #29576

[Core] SIGABRT causes Ray to hang #29576

Comments

amogkam commented Oct 22, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

amogkam commented Oct 22, 2022

rkooo567 commented Oct 22, 2022

cadedaniel commented Nov 8, 2022

cadedaniel commented Nov 8, 2022

cadedaniel commented Nov 9, 2022

Tl;DR

cadedaniel commented Nov 9, 2022

amogkam commented Nov 15, 2022

amogkam commented Nov 15, 2022 • edited Loading

amogkam commented Nov 15, 2022 •

edited

Loading