Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Core] Disabling object store from appearing in worker coredumps (ray…
…-project#30150) Description If a Ray worker process crashes and generates a coredump, Linux will by default dump all pages mapped by the process. This includes the plasma pages, which on large instances can be quite large. This PR uses madvise to disable dumping of the plasma pages in worker processes. The impact is that coredumps generated by Ray worker processes are now ~300MB instead of ~= object store size. See this [public] Coredumps appearing to cause hangs in Ray for more information. Testing It is difficult to test this in CI because 1) at the C++ level there isn't a Linux API to verify madvise status of pages and 2) coredumps aren't enabled (/proc/sys/kernel/core_pattern is set to an invalid value and ulimit -c is 0) so a Python-level test won't work. I can go and enable coredumps in CI but that feels like a big change, want to check before going down that path. In terms of manual testing, this is disabled by macro for non Linux builds. On Linux, the coredump size goes down significantly: $ ls -alh /tmp/core.ray::Actor.abor.88940 # without madvise -rw------- 1 ray users 9.6G Nov 13 10:52 /tmp/core.ray::Actor.abor.88940 $ ls -alh /tmp/core.ray::Actor.abor.97217 # with madvise -rw------- 1 ray users 239M Nov 13 11:09 /tmp/core.ray::Actor.abor.97217 $ gdb -c /tmp/core.ray::Actor.abor.97217 (gdb) info proc mappings Mapped address spaces: Start Addr End Addr Size Offset objfile 0x55a3af4a7000 0x55a3af506000 0x5f000 0x0 /home/ray/anaconda3/bin/python3.8 0x55a3af506000 0x55a3af6fe000 0x1f8000 0x5f000 /home/ray/anaconda3/bin/python3.8 0x55a3af6fe000 0x55a3af7e5000 0xe7000 0x257000 /home/ray/anaconda3/bin/python3.8 0x55a3af7e6000 0x55a3af7eb000 0x5000 0x33e000 /home/ray/anaconda3/bin/python3.8 0x55a3af7eb000 0x55a3af823000 0x38000 0x343000 /home/ray/anaconda3/bin/python3.8 0x7fc9792c0000 0x7fcd64000000 0x3ead40000 0x0 /dev/shm/plasmax6oDcM (deleted) (...) Closes ray-project#29576 Signed-off-by: Weichen Xu <[email protected]>
- Loading branch information