Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stack overflow reporting with multiple PALs #107264

Merged

Conversation

janvorli
Copy link
Member

@janvorli janvorli commented Sep 2, 2024

A recently reported issue in the stack overflow test running with superpmi collect has revealed that there is a problem when multiple shared libraries using coreclr PAL are loaded into the process. Each such library registers signal handlers, including the SIGSEGV one. The unexpected thing in the SIGSEGV handler in this scenario is the fact that the GetCurrentPalThread() returns NULL in the non-coreclr shared libraries in case the library didn't invoke any function that would create the thread object.
That leads to the handler just writing "Stack overflow." to the console and then calling PROCAbort(). This abort causes the process to be torn down without reporting the stack overflow with full stack trace by the coreclr.

This change fixes it by calling the previous registered handler instead aborting in this case. That gives the coreclr SIGSEGV handler a chance to do the reporting as expected.

Close #107226

A recently reported issue in the stack overflow test running with
superpmi collect has revealed that there is a problem when multiple
shared libraries using coreclr PAL are loaded into the process.
Each such library registers signal handlers, including the SIGSEGV
one. The unexpected thing in the SIGSEGV handler in this scenario
is the fact that the GetCurrentPalThread() returns NULL in the
non-coreclr shared libraries in case the library didn't invoke
any function that would create the thread object.
That leads to the handler just writing "Stack overflow." to
the console and then calling PROCAbort(). This abort causes the process
to be torn down without reporting the stack overflow with full stack
trace by the coreclr.

This change fixes it by calling the previous registered handler instead
aborting in this case. That gives the coreclr SIGSEGV handler a chance to
do the reporting as expected.
@janvorli janvorli added this to the 10.0.0 milestone Sep 2, 2024
@janvorli janvorli requested a review from jkotas September 2, 2024 20:39
@janvorli janvorli self-assigned this Sep 2, 2024
@am11
Copy link
Member

am11 commented Sep 2, 2024

Thank you! Please execute /azp run runtime-coreclr superpmi-collect-test.

@janvorli
Copy link
Member Author

janvorli commented Sep 2, 2024

/azp run runtime-coreclr superpmi-collect-test

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jakobbotsch
Copy link
Member

#107277 should fix superpmi-collect-test.

@jakobbotsch
Copy link
Member

All runtime tests passed though, so seems like this change fixes the problem.

@am11
Copy link
Member

am11 commented Sep 3, 2024

Yes, it was failing on x64 as well, which is now green. 👍

@jkotas
Copy link
Member

jkotas commented Sep 4, 2024

cc @lambdageek Aleksey has been looking into removing superpmi dependency on PAL exception handling.

@janvorli janvorli closed this Sep 5, 2024
@janvorli janvorli reopened this Sep 5, 2024
@lambdageek
Copy link
Member

lambdageek commented Sep 5, 2024

cc @lambdageek Aleksey has been looking into removing superpmi dependency on PAL exception handling.

👍 My idea is to remove the CatchHardwareExceptionHandler mechanism from the PAL and add a sigsegv signal handler to superpmi. (And as a technical simplification instead of turning a SIGSEGV into a C++ exception that is then caught by superpmi to run their "finally" logic, it will instead wake a utility thread that will run the superpmi "finally" code).

Like the changes here, by SIGSEGV handler will chain to the previous one, if it is present, rather than abort

@janvorli janvorli merged commit e166eb6 into dotnet:main Sep 5, 2024
88 of 90 checks passed
@janvorli janvorli deleted the fix-stack-overflow-reporting-multiple-pals branch September 5, 2024 23:05
jtschuster pushed a commit to jtschuster/runtime that referenced this pull request Sep 17, 2024
* Fix stack overflow reporting with multiple PALs

A recently reported issue in the stack overflow test running with
superpmi collect has revealed that there is a problem when multiple
shared libraries using coreclr PAL are loaded into the process.
Each such library registers signal handlers, including the SIGSEGV
one. The unexpected thing in the SIGSEGV handler in this scenario
is the fact that the GetCurrentPalThread() returns NULL in the
non-coreclr shared libraries in case the library didn't invoke
any function that would create the thread object.
That leads to the handler just writing "Stack overflow." to
the console and then calling PROCAbort(). This abort causes the process
to be torn down without reporting the stack overflow with full stack
trace by the coreclr.

This change fixes it by calling the previous registered handler instead
aborting in this case. That gives the coreclr SIGSEGV handler a chance to
do the reporting as expected.

* Reflect PR feedback - move comment
sirntar pushed a commit to sirntar/runtime that referenced this pull request Sep 30, 2024
* Fix stack overflow reporting with multiple PALs

A recently reported issue in the stack overflow test running with
superpmi collect has revealed that there is a problem when multiple
shared libraries using coreclr PAL are loaded into the process.
Each such library registers signal handlers, including the SIGSEGV
one. The unexpected thing in the SIGSEGV handler in this scenario
is the fact that the GetCurrentPalThread() returns NULL in the
non-coreclr shared libraries in case the library didn't invoke
any function that would create the thread object.
That leads to the handler just writing "Stack overflow." to
the console and then calling PROCAbort(). This abort causes the process
to be torn down without reporting the stack overflow with full stack
trace by the coreclr.

This change fixes it by calling the previous registered handler instead
aborting in this case. That gives the coreclr SIGSEGV handler a chance to
do the reporting as expected.

* Reflect PR feedback - move comment
@github-actions github-actions bot locked and limited conversation to collaborators Oct 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Secondary frame stackoverflow test failure on linux
5 participants