Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Debugging] Print Python stack trace in addition to C++ stack trace, when Python worker crashes #19423

Merged
merged 3 commits into from
Oct 18, 2021

Conversation

mwtian
Copy link
Member

@mwtian mwtian commented Oct 16, 2021

Why are these changes needed?

Right now the failure signal handler registered in Python worker is skipped on crashes like segfault, because C++ core worker overrides the failure signal handler here and does not call the previously registered handler. This prevents Python stack trace from being printed on crashes. The fix is to make the C++ fault signal handler to call the previous signal handler registered in Python. For example with the script below which segfaults,

import ray
ray.init()

@ray.remote
def f():
    import ctypes;
    ctypes.string_at(0)

ray.get(f.remote())

Ray currently only prints the following stack trace:

(pid=26693) *** SIGSEGV received at time=1634418743 ***
(pid=26693) PC: @     0x7fff203d9552  (unknown)  _platform_strlen
(pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: *** SIGSEGV received at time=1634418743 ***
(pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: PC: @     0x7fff203d9552  (unknown)  _platform_strlen

With this change, Python stack trace will be printed in addition to the stack trace above:

(pid=26693) Fatal Python error: Segmentation fault
(pid=26693)
(pid=26693) Stack (most recent call first):
(pid=26693)   File "/Users/mwtian/opt/anaconda3/envs/ray/lib/python3.7/ctypes/__init__.py", line 505 in string_at
(pid=26693)   File "stack.py", line 7 in f
(pid=26693)   File "/Users/mwtian/work/ray-project/ray/python/ray/worker.py", line 425 in main_loop
(pid=26693)   File "/Users/mwtian/work/ray-project/ray/python/ray/workers/default_worker.py", line 212 in <module>

This should make debugging crashes in Python worker easier, for users and Ray devs.

Also, try to initialize symbolizer in GCS, Raylet and core worker. This is a no-op on MacOS and some Linux environments (e.g. Ray on Ubuntu 20.04 already produces symbolized stack traces), but should make Ray more likely to have symbolized stack traces on other platforms.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

//
// Also, chain the crash handler installed by the language worker, e.g. Python
// worker.
RayLog::InstallFailureSignalHandler(nullptr, /*chain=*/true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this work for C++/Java too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I believe this should work.

Java worker does not enable signal handler in the core worker (install_failure_signal_handler=false).

For C++ worker, I believe it is relying on the core worker here to install signal handlers. Initializing symbolizer should not break things. And chaining signal handler should be a no-op, since there is no previous handler.

/// to locate the object file containing debug symbols for ELF format executables. If
/// this is left as nullptr, symbolization can fail in some cases. More details in:
/// https://github.com/abseil/abseil-cpp/blob/master/absl/debugging/symbolize_elf.inc
/// \parem chain Whether to call the previous signal handler. See important caveats:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call_previous_handler

Copy link
Contributor

@scv119 scv119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@scv119 scv119 merged commit 9742abb into master Oct 18, 2021
@scv119 scv119 deleted the test_wheels/mwtian-symbol branch October 18, 2021 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants