-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: unexpected return pc for runtime.sigpanic called from 0x1 #43496
Comments
Maybe related #41099 |
Going through the stack trace multiple times, the additional thing that stuck out was:
And maybe:
|
My best guess is that a bad BP in asm code can cause traceback to crash. Based on seeing something similar in minio/sha256-simd#55 |
/cc aclements @randall77 @mknyszek @prattmic |
Traceback doesn't use BP currently, so that isn't it. (Using BP for traceback is #16638.) That minio bug involves playing with SP, not BP, which could cause problems. (Our traceback code can't handle variable-sized frames.) goroutine 8278 does look like frames are missing as it starts with a |
Since sighandler/preparePanic do some manipulation of the stack to make it look like there was a direct call to sigpanic(), is it possible that this is some extremely rare race in the sighander or preparePanic? From the list of stack traces, we can't definitively pinpoint where the original panic happened. But during panic handling, open-coded defers do need to look at the stack trace of the panic goroutine, hence will cause a problem if the backtrace has an issue. For reference, here is the source code for errgroup.(*Group).Go(). It seems possible that goroutine 8278 (the one that started at line 54 below) is the one that panicked and has a corrupted stack and hence cause secondary panic in the addOneOpenDeferFrame/gentraceback. It's not clear to me if the original panic or problem is defer-related, or just that the defer-processing is highlighting the bad backtrace.
|
Sidenote, is or should there be a linter that verifies that SP is not being used? |
@egonelbre Yes, that would make sense. We could also be more conservative with tracebacks during profiling interrupts, as that's the only case where we would try to traceback assembly like that. |
Is this possibly the same cause as #43942? |
I have what appears at the moment to be a reproducer for this, sort of, but it's not an isolated reproducer, and also it's extremely unstable. MacOS amd64 host, running Symptoms: First, all of yesterday, any attempt to run
I don't know exactly what's happening, but I do know that this appears to happen at the same point in the test run every time (the previous message is the same, but it's somewhere during a test which might take quite a few seconds, so I don't know). The specific test it happens in completes without issue, though, if I run it alone; it's only with the other tests before it that it blows up. Also, there's about 1k goroutines (this is perhaps not surprising for the tests in question). In one of the dumps, four goroutines were all in the same function,
These calls are occurring at roughly the same point in several goroutines, but there's nothing about these that makes me think they should be taking over a microsecond or so, and yet, sometimes I see several goroutines stuck in this, or on a line which looks like I don't know what's going on but it's definitely something. |
Do you still have the reproducer? If so, does the issue still appear when building with a Go release >= 1.17.4? If not, this might have been fixed by #49729. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Unable to reproduce.
What operating system and processor architecture are you using (
go env
)?It's using
golang:1.15.6
docker image as the base for CI testing.What did you do?
We have automated tests running for changes and one of them failed with a crash.
The race-detector is enabled for the tests.
The relevant part of the crash seems to be:
The text was updated successfully, but these errors were encountered: