-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Debugging] Print Python stack trace in addition to C++ stack trace, when Python worker crashes #19423
Conversation
src/ray/core_worker/core_worker.cc
Outdated
// | ||
// Also, chain the crash handler installed by the language worker, e.g. Python | ||
// worker. | ||
RayLog::InstallFailureSignalHandler(nullptr, /*chain=*/true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this work for C++/Java too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I believe this should work.
Java worker does not enable signal handler in the core worker (install_failure_signal_handler=false).
For C++ worker, I believe it is relying on the core worker here to install signal handlers. Initializing symbolizer should not break things. And chaining signal handler should be a no-op, since there is no previous handler.
src/ray/util/logging.h
Outdated
/// to locate the object file containing debug symbols for ELF format executables. If | ||
/// this is left as nullptr, symbolization can fail in some cases. More details in: | ||
/// https://github.com/abseil/abseil-cpp/blob/master/absl/debugging/symbolize_elf.inc | ||
/// \parem chain Whether to call the previous signal handler. See important caveats: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call_previous_handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these changes needed?
Right now the failure signal handler registered in Python worker is skipped on crashes like segfault, because C++ core worker overrides the failure signal handler here and does not call the previously registered handler. This prevents Python stack trace from being printed on crashes. The fix is to make the C++ fault signal handler to call the previous signal handler registered in Python. For example with the script below which segfaults,
Ray currently only prints the following stack trace:
With this change, Python stack trace will be printed in addition to the stack trace above:
This should make debugging crashes in Python worker easier, for users and Ray devs.
Also, try to initialize symbolizer in GCS, Raylet and core worker. This is a no-op on MacOS and some Linux environments (e.g. Ray on Ubuntu 20.04 already produces symbolized stack traces), but should make Ray more likely to have symbolized stack traces on other platforms.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.