Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] SIGABRT causes Ray to hang #29576

Closed
amogkam opened this issue Oct 22, 2022 · 8 comments · Fixed by #30150
Closed

[Core] SIGABRT causes Ray to hang #29576

amogkam opened this issue Oct 22, 2022 · 8 comments · Fixed by #30150
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Milestone

Comments

@amogkam
Copy link
Contributor

amogkam commented Oct 22, 2022

What happened + What you expected to happen

The below script causes Ray to hang. I would expect to see a RayActorError instead.

The output when running the script

(base) ray@ip-172-31-80-95:~/default$ python test.py 
2022-10-21 17:56:17,484 INFO worker.py:1230 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
2022-10-21 17:56:17,742 INFO worker.py:1342 -- Connecting to existing Ray cluster at address: 172.31.80.95:9031...
2022-10-21 17:56:17,749 INFO worker.py:1519 -- Connected to Ray cluster. View the dashboard at https://console.anyscale-staging.com/api/v2/sessions/ses_tU65uDZ9uHVfEDuQzRnsX3EY/services?redirect_to=dashboard 
2022-10-21 17:56:17,751 INFO packaging.py:354 -- Pushing file package 'gcs://_ray_pkg_eff56cd0e859ca3470d515d0dc41effb.zip' (0.00MiB) to Ray cluster...
2022-10-21 17:56:17,752 INFO packaging.py:367 -- Successfully pushed file package 'gcs://_ray_pkg_eff56cd0e859ca3470d515d0dc41effb.zip'.
(Actor pid=11471) *** SIGABRT received at time=1666400179 on cpu 14 ***
(Actor pid=11471) PC: @     0x7f70d75e000b  (unknown)  raise
(Actor pid=11471)     @     0x7f70d77a3420  (unknown)  (unknown)
(Actor pid=11471)     @ ... and at least 3 more frames
(Actor pid=11471) [2022-10-21 17:56:19,050 E 11471 11471] logging.cc:361: *** SIGABRT received at time=1666400179 on cpu 14 ***
(Actor pid=11471) [2022-10-21 17:56:19,050 E 11471 11471] logging.cc:361: PC: @     0x7f70d75e000b  (unknown)  raise
(Actor pid=11471) [2022-10-21 17:56:19,050 E 11471 11471] logging.cc:361:     @     0x7f70d77a3420  (unknown)  (unknown)
(Actor pid=11471) [2022-10-21 17:56:19,050 E 11471 11471] logging.cc:361:     @ ... and at least 3 more frames
(Actor pid=11471) Fatal Python error: Aborted
(Actor pid=11471) 
(Actor pid=11471) Stack (most recent call first):
(Actor pid=11471)   File "test.py", line 32 in abort
(Actor pid=11471)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466 in _resume_span
(Actor pid=11471)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 674 in actor_method_executor
(Actor pid=11471)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/worker.py", line 763 in main_loop
(Actor pid=11471)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/workers/default_worker.py", line 231 in <module>
(raylet) [2022-10-21 17:57:27,057 E 761 802] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-10-21_17-03-26_460534_151 is over 95% full, available space: 15603879936; capacity: 312201752576. Object creation will fail if spilling is required.

Versions / Dependencies

master

Reproduction script

import os
import ray

@ray.remote
class Actor:
    def abort(self):
        os.abort()
        

a = Actor.remote()
ray.get(a.abort.remote())

Issue Severity

High: It blocks me from completing my task.

@amogkam amogkam added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels Oct 22, 2022
@amogkam
Copy link
Contributor Author

amogkam commented Oct 22, 2022

cc @rkooo567 @cadedaniel

@cadedaniel cadedaniel added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 22, 2022
@rkooo567 rkooo567 added this to the Core Backlog milestone Oct 22, 2022
@rkooo567
Copy link
Contributor

We want to finish this by 2.2

@cadedaniel
Copy link
Member

When I try to reproduce, it looks like SIGABRT is crashing the worker process. Once Ray figures this out (~30 seconds), Ray raises a RayActorError. So looks like this isn't actually an issue.. let me try to reproduce with a NCCL workload.

(Actor pid=4993) *** SIGABRT received at time=1667934734 on cpu 3 ***
(Actor pid=4993) PC: @     0x7f6bd8a7c00b  (unknown)  raise
(Actor pid=4993)     @     0x7f6bd8c3f420  (unknown)  (unknown)
(Actor pid=4993)     @ ... and at least 3 more frames
(Actor pid=4993) [2022-11-08 11:12:14,808 E 4993 4993] logging.cc:361: *** SIGABRT received at time=1667934734 on cpu 3 ***
(Actor pid=4993) [2022-11-08 11:12:14,808 E 4993 4993] logging.cc:361: PC: @     0x7f6bd8a7c00b  (unknown)  raise
(Actor pid=4993) [2022-11-08 11:12:14,808 E 4993 4993] logging.cc:361:     @     0x7f6bd8c3f420  (unknown)  (unknown)
(Actor pid=4993) [2022-11-08 11:12:14,808 E 4993 4993] logging.cc:361:     @ ... and at least 3 more frames
(Actor pid=4993) Fatal Python error: Aborted
(Actor pid=4993)
(Actor pid=4993) Stack (most recent call first):
(Actor pid=4993)   File "example", line 9 in abort
(Actor pid=4993)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466 in _resume_span
(Actor pid=4993)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 674 in actor_method_executor
(Actor pid=4993)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/worker.py", line 763 in main_loop
(Actor pid=4993)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/workers/default_worker.py", line 232 in <module>
2022-11-08 11:12:57,501 WARNING worker.py:1849 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff95b8c940dfe0d149ef8ee1ba02000000 Worker ID: 8f958f908210706cbe8f7fa5fe90d09967a200824c6e93e6c15008be Node ID: 3eea48bbcef4da3b5f2559af2183a0a5e695fb30ed82901e618712fc Worker IP address: 172.31.106.205 Worker port: 10005 Worker PID: 4993 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Traceback (most recent call last):
  File "example", line 13, in <module>
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/worker.py", line 2305, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: Actor
        actor_id: 95b8c940dfe0d149ef8ee1ba02000000
        pid: 4993
        namespace: f466e965-d536-4ed2-bc0c-adb22dce4a3f
        ip: 172.31.106.205
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

@cadedaniel
Copy link
Member

Unexpectedly, the process stays alive for some time with status alternating between D and R. Not sure why it needs to go in D..

ray         9348 11.7 20.9 19624540 13481692 ?   DNl  11:29   0:13 ray::Actor.abort()

@cadedaniel
Copy link
Member

Looks like this is actually expected behavior. I put together a doc explaining why this is and a proposal on how to improve the user experience: [public] Coredumps appearing to cause hangs in Ray

Tl;DR

Calling os.abort() inside a Ray actor on a large instance looks like it causes a hang. However, it actually merely looks like a hang because it is writing a very large coredump file (~object store size). We should improve the user experience so that they’re not confused by either improving documentation or filtering out the object store mmaps from coredumps.

Related issue: #12505

@cadedaniel
Copy link
Member

@amogkam I think you can merge your PR #29562 since it's expected behavior. I'll create a PR soon that will prevent the object store from getting written in the coredump, which will improve the user experience when Torch detects NCCL hangs on Ray.

@amogkam
Copy link
Contributor Author

amogkam commented Nov 15, 2022

Thanks for the investigation @cadedaniel, makes sense!

Trying again on my laptop instead of on a cluster, I do get a RayActorError immediately.

However, when running in python interactive shell, I do not get a RayActorError on ray.get.

@amogkam
Copy link
Contributor Author

amogkam commented Nov 15, 2022

We should address the large coredump issue before merging #29562 though.

Otherwise, it will appear as if training is hanging on large clusters

scv119 pushed a commit that referenced this issue Nov 22, 2022
)

Description
If a Ray worker process crashes and generates a coredump, Linux will by default dump all pages mapped by the process. This includes the plasma pages, which on large instances can be quite large. This PR uses madvise to disable dumping of the plasma pages in worker processes. The impact is that coredumps generated by Ray worker processes are now ~300MB instead of ~= object store size.

See this [public] Coredumps appearing to cause hangs in Ray for more information.

Testing
It is difficult to test this in CI because 1) at the C++ level there isn't a Linux API to verify madvise status of pages and 2) coredumps aren't enabled (/proc/sys/kernel/core_pattern is set to an invalid value and ulimit -c is 0) so a Python-level test won't work. I can go and enable coredumps in CI but that feels like a big change, want to check before going down that path.

In terms of manual testing, this is disabled by macro for non Linux builds. On Linux, the coredump size goes down significantly:

$ ls -alh /tmp/core.ray::Actor.abor.88940 # without madvise
-rw------- 1 ray users 9.6G Nov 13 10:52 /tmp/core.ray::Actor.abor.88940
$ ls -alh /tmp/core.ray::Actor.abor.97217 # with madvise
-rw------- 1 ray users 239M Nov 13 11:09 /tmp/core.ray::Actor.abor.97217
$ gdb -c /tmp/core.ray::Actor.abor.97217
(gdb) info proc mappings
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
      0x55a3af4a7000     0x55a3af506000    0x5f000        0x0 /home/ray/anaconda3/bin/python3.8
      0x55a3af506000     0x55a3af6fe000   0x1f8000    0x5f000 /home/ray/anaconda3/bin/python3.8
      0x55a3af6fe000     0x55a3af7e5000    0xe7000   0x257000 /home/ray/anaconda3/bin/python3.8
      0x55a3af7e6000     0x55a3af7eb000     0x5000   0x33e000 /home/ray/anaconda3/bin/python3.8
      0x55a3af7eb000     0x55a3af823000    0x38000   0x343000 /home/ray/anaconda3/bin/python3.8
      0x7fc9792c0000     0x7fcd64000000 0x3ead40000        0x0 /dev/shm/plasmax6oDcM (deleted)
(...)
Closes #29576
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this issue Dec 19, 2022
…-project#30150)

Description
If a Ray worker process crashes and generates a coredump, Linux will by default dump all pages mapped by the process. This includes the plasma pages, which on large instances can be quite large. This PR uses madvise to disable dumping of the plasma pages in worker processes. The impact is that coredumps generated by Ray worker processes are now ~300MB instead of ~= object store size.

See this [public] Coredumps appearing to cause hangs in Ray for more information.

Testing
It is difficult to test this in CI because 1) at the C++ level there isn't a Linux API to verify madvise status of pages and 2) coredumps aren't enabled (/proc/sys/kernel/core_pattern is set to an invalid value and ulimit -c is 0) so a Python-level test won't work. I can go and enable coredumps in CI but that feels like a big change, want to check before going down that path.

In terms of manual testing, this is disabled by macro for non Linux builds. On Linux, the coredump size goes down significantly:

$ ls -alh /tmp/core.ray::Actor.abor.88940 # without madvise
-rw------- 1 ray users 9.6G Nov 13 10:52 /tmp/core.ray::Actor.abor.88940
$ ls -alh /tmp/core.ray::Actor.abor.97217 # with madvise
-rw------- 1 ray users 239M Nov 13 11:09 /tmp/core.ray::Actor.abor.97217
$ gdb -c /tmp/core.ray::Actor.abor.97217
(gdb) info proc mappings
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
      0x55a3af4a7000     0x55a3af506000    0x5f000        0x0 /home/ray/anaconda3/bin/python3.8
      0x55a3af506000     0x55a3af6fe000   0x1f8000    0x5f000 /home/ray/anaconda3/bin/python3.8
      0x55a3af6fe000     0x55a3af7e5000    0xe7000   0x257000 /home/ray/anaconda3/bin/python3.8
      0x55a3af7e6000     0x55a3af7eb000     0x5000   0x33e000 /home/ray/anaconda3/bin/python3.8
      0x55a3af7eb000     0x55a3af823000    0x38000   0x343000 /home/ray/anaconda3/bin/python3.8
      0x7fc9792c0000     0x7fcd64000000 0x3ead40000        0x0 /dev/shm/plasmax6oDcM (deleted)
(...)
Closes ray-project#29576

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants