-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cockroachdb crashed in Go runtime after reporting EINVAL from port_getn #1130
Comments
Where's the test program?
Stacks:
I didn't look at that very closely, but enough to not see it obviously spinning in a loop or blocked on a specific operation. [narrator: maybe could have looked more closely.] It looks blocked. Is that right? Let's see what it's doing over time:
Yeah, it's not on-CPU much. I figured at this point I'd save a core file. It's been running for 5 minutes:
Maybe we can infer something about what it's doing from the syscalls it's making?
It seems to be trying to connect to something. At this point I realized we should still have the log file! With luck, it might be the only one we're currently writing to, so it would be the most recently modified one:
Perfect. How about it:
We seem to be having trouble connecting to the database. (And you'd see that in the pstack output if you looked more closely than I did.) Can we connect to the database now? The URL is in the log above, near the top.
Strange. It doesn't seem to be running:
CockroachDB's temporary directory is also in the log output. What's in there?
Anything good in the output files?
No obvious sign of a crash. I took a look at the log file with
Yeah, nothing's even got the log file open and it's last entry was when it started, many minutes ago:
(more coming -- have to break up this GitHub comment) |
At this point I took a look at that
What was that? I don't know any way to get information about defunct processes except through the kernel debugger. I believe these are process that have exited but not been waited-on. The kernel throws away a lot of the state, I think. The userland tools don't say much:
but the kernel debugger can help!
Aha! That was CockroachDB. What happened to it? I wonder if we can find anything in its
I did a similar search for p_wcode, found this assignment, looked for CLD_CONTINUED, and found these definitions. Our value is 1, which is At this point I went further into the database directory looking for any recently-written files.
What's in that most recently written log file?
and I think we have our culprit. I'll have to attach the file separately because it's too big to put in a GitHub comment. But the relevant line is:
What's errno 22?
Yikes. That sounds like a bad misuse of event ports but I don't know how to tell more given what's here. Had CockroachDB aborted, we'd have a core file, and we'd know what fd 4 was. We'd also potentially know what it had passed to port_getn. So what's the summary of problems here?
It's conceivable that this problem is related to the 30-second CockroachDB startup timeouts that we've occasionally seen in CI, but I don't think we've seen those on illumos systems and this problem is very likely illumos-specific (since it's an illumos-specific interface that it's invoking and getting EINVAL). |
Full log: cockroach-stderr-log.gz |
Full CockroachDB directory, a tarball inside a ZIP file because GitHub doesn't support |
I used the command-line that we logged in the file to start CockroachDB again by hand:
then saved a "debug zip" so we can report this to CockroachDB:
That's attached here: cockroachdb-zip.zip. Just for fun, I tweaked the CockroachDB command-line so that it would listen on the listen-url that was in our test's log file and started it up. The end of the log file currently includes:
I hope that if I wait about 50 minutes, the test should complete successfully. |
Maybe related? golang/go#45643 CC @jclulow |
Following up on my previous-previous comment: after starting CockroachDB on the right port, that test did eventually uncork and complete successfully! It happened around 20:16:57, earlier than I'd hoped, because we were only blocked on the populate step, not the saga list. That checks out -- that's when the populate step was scheduled to be retried. |
Last week, we dug in a little bit and found:
Based on the Go code, I don't see how the first invocation of There are a few possibilities here:
There are a couple of angles here:
I've been running this DTrace script to try to collect more data if this happens again (edit: the previous version of this script had a bug in the execname predicate and I've fixed it below):
The intent is that if port_getn ever returns |
With the above D script running, I've been trying to reproduce this with:
which I run as:
So far, I've only caught #1144 instead. |
I successfully reproduced this issue after 1328 iterations. But there was a bug in the predicate of my D script that caused it to not ever fire. I've fixed that and am trying again. |
Okay, I managed to reproduce this with the right D enabling:
The D script exited and emitted:
This confirms that the kernel really did return EINVAL from port_getn. The D script stopped the process and I saved arguments (with In terms of the core file, here are the registers of the stopped thread:
From the kernel implementation of
So everything looks right except for the timespec, and given the timespec's current value, the EINVAL is expected. Here's the stack:
We appear to have useful frame pointers, fortunately. Note that the timespec address is on the stack inside the "netpoll" frame, which makes sense. According to the From the Go code, we'd expect the stack to include the actual, zero'd timespec; the pointer to it; and an array of 128 port_events. I'd expect:
In fact:
It looks like "wait" got put right after "ts" on the stack, which is surprising but fine. That explains what's in between. It looks from the disassembly like we initialize the It sure looks as though something has just scribbled on this thread's stack. It doesn't look like there are that many instructions between where we initialize the timespec and where we call into Well, what are those values (0xc001c97500 and 0xc000240000)? They look like addresses, and they are mapped:
pmap says it's an anonymous region:
That region is in a weirdly different spot than the rest:
which makes me wonder if it was mapped with MAP_FIXED. A quick search in the Golang source suggests that this region is used by the Go memory allocator. So what's at those addresses? That might give a clue about who clobbered the stack.
That's not super conclusive but it looks worryingly like CockroachDB itself or Golang did that, maybe part of Pebble (part of CockroachDB)? One idea we had is to trace |
I filed cockroachdb/cockroach#82958 for this issue to get input from the CockroachDB team. |
Following up on my previous comment, I wrote this D script to try to blow open the window for possible corruption:
This uses the |
The discussion on the cockroach issue got me thinking to try LD_PRELOADing libumem, which can identify various kinds of corruption when it's happening and has other facilities for debugging corruption. It didn't work to just LD_PRELOAD it with the
and that appears to work. I tested this by using this DTrace invocation to stop a
Then I ran For the original binary, I got:
For my new binary, I got:
which at least means something was allocated with libumem! |
I put a bunch of the relevant files on catacomb:
|
I also put some D scripts and other data from the June 1-2 investigation there:
|
Found a few more. I regret that this isn't better organized, but I preserved the mtimes here for correlation with the comments here:
|
See #1223 (comment). With a workaround for https://www.illumos.org/issues/15254 in place, the test suite has been running for almost three days without issue. Also, this was seen on Go 1.16; we're now on Go 1.19. I can't quite see how that illumos bug could cause this issue but I'm not sure it's worth spending more active time on this? Okay, I did take a swing at writing a Rust program to try to identify this problem from a dump. On a dump from this core file, it produces:
and I've confirmed that these do indeed look like runs consistent with that problem. They're all in the Go heap. I'm still not sure how we get to this root cause. The closest thing of interest is that a bunch of the stack frames are around c002702598, but that's a few megabytes away. |
Since the memory that's supposed to be zero'd in this case was written using %xmm0 after that register was zero'd, it is conceivable that this problem was caused by illumos#15367. However, we'd have to get quite unlucky: it'd have to be the case that the rest of the xmm/ymm/zmm registers were all zeroes so that clearing xmm0 put the FPU into the default state; then we'd have to take a signal on that very next instruction. |
Closing this for the reasons mentioned above. We can reopen if we start seeing it again. |
While testing #1129:
I don't see how this could be related to #1129. More data coming.
The text was updated successfully, but these errors were encountered: