-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] SIGABRT causes Ray to hang #29576
Comments
We want to finish this by 2.2 |
When I try to reproduce, it looks like SIGABRT is crashing the worker process. Once Ray figures this out (~30 seconds), Ray raises a
|
Unexpectedly, the process stays alive for some time with status alternating between
|
Looks like this is actually expected behavior. I put together a doc explaining why this is and a proposal on how to improve the user experience: [public] Coredumps appearing to cause hangs in Ray Tl;DRCalling Related issue: #12505 |
Thanks for the investigation @cadedaniel, makes sense! Trying again on my laptop instead of on a cluster, I do get a However, when running in python interactive shell, I do not get a |
We should address the large coredump issue before merging #29562 though. Otherwise, it will appear as if training is hanging on large clusters |
) Description If a Ray worker process crashes and generates a coredump, Linux will by default dump all pages mapped by the process. This includes the plasma pages, which on large instances can be quite large. This PR uses madvise to disable dumping of the plasma pages in worker processes. The impact is that coredumps generated by Ray worker processes are now ~300MB instead of ~= object store size. See this [public] Coredumps appearing to cause hangs in Ray for more information. Testing It is difficult to test this in CI because 1) at the C++ level there isn't a Linux API to verify madvise status of pages and 2) coredumps aren't enabled (/proc/sys/kernel/core_pattern is set to an invalid value and ulimit -c is 0) so a Python-level test won't work. I can go and enable coredumps in CI but that feels like a big change, want to check before going down that path. In terms of manual testing, this is disabled by macro for non Linux builds. On Linux, the coredump size goes down significantly: $ ls -alh /tmp/core.ray::Actor.abor.88940 # without madvise -rw------- 1 ray users 9.6G Nov 13 10:52 /tmp/core.ray::Actor.abor.88940 $ ls -alh /tmp/core.ray::Actor.abor.97217 # with madvise -rw------- 1 ray users 239M Nov 13 11:09 /tmp/core.ray::Actor.abor.97217 $ gdb -c /tmp/core.ray::Actor.abor.97217 (gdb) info proc mappings Mapped address spaces: Start Addr End Addr Size Offset objfile 0x55a3af4a7000 0x55a3af506000 0x5f000 0x0 /home/ray/anaconda3/bin/python3.8 0x55a3af506000 0x55a3af6fe000 0x1f8000 0x5f000 /home/ray/anaconda3/bin/python3.8 0x55a3af6fe000 0x55a3af7e5000 0xe7000 0x257000 /home/ray/anaconda3/bin/python3.8 0x55a3af7e6000 0x55a3af7eb000 0x5000 0x33e000 /home/ray/anaconda3/bin/python3.8 0x55a3af7eb000 0x55a3af823000 0x38000 0x343000 /home/ray/anaconda3/bin/python3.8 0x7fc9792c0000 0x7fcd64000000 0x3ead40000 0x0 /dev/shm/plasmax6oDcM (deleted) (...) Closes #29576
…-project#30150) Description If a Ray worker process crashes and generates a coredump, Linux will by default dump all pages mapped by the process. This includes the plasma pages, which on large instances can be quite large. This PR uses madvise to disable dumping of the plasma pages in worker processes. The impact is that coredumps generated by Ray worker processes are now ~300MB instead of ~= object store size. See this [public] Coredumps appearing to cause hangs in Ray for more information. Testing It is difficult to test this in CI because 1) at the C++ level there isn't a Linux API to verify madvise status of pages and 2) coredumps aren't enabled (/proc/sys/kernel/core_pattern is set to an invalid value and ulimit -c is 0) so a Python-level test won't work. I can go and enable coredumps in CI but that feels like a big change, want to check before going down that path. In terms of manual testing, this is disabled by macro for non Linux builds. On Linux, the coredump size goes down significantly: $ ls -alh /tmp/core.ray::Actor.abor.88940 # without madvise -rw------- 1 ray users 9.6G Nov 13 10:52 /tmp/core.ray::Actor.abor.88940 $ ls -alh /tmp/core.ray::Actor.abor.97217 # with madvise -rw------- 1 ray users 239M Nov 13 11:09 /tmp/core.ray::Actor.abor.97217 $ gdb -c /tmp/core.ray::Actor.abor.97217 (gdb) info proc mappings Mapped address spaces: Start Addr End Addr Size Offset objfile 0x55a3af4a7000 0x55a3af506000 0x5f000 0x0 /home/ray/anaconda3/bin/python3.8 0x55a3af506000 0x55a3af6fe000 0x1f8000 0x5f000 /home/ray/anaconda3/bin/python3.8 0x55a3af6fe000 0x55a3af7e5000 0xe7000 0x257000 /home/ray/anaconda3/bin/python3.8 0x55a3af7e6000 0x55a3af7eb000 0x5000 0x33e000 /home/ray/anaconda3/bin/python3.8 0x55a3af7eb000 0x55a3af823000 0x38000 0x343000 /home/ray/anaconda3/bin/python3.8 0x7fc9792c0000 0x7fcd64000000 0x3ead40000 0x0 /dev/shm/plasmax6oDcM (deleted) (...) Closes ray-project#29576 Signed-off-by: Weichen Xu <[email protected]>
What happened + What you expected to happen
The below script causes Ray to hang. I would expect to see a
RayActorError
instead.The output when running the script
Versions / Dependencies
master
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: