-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crash in Go runtime after port_getn returned EINVAL #82958
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates. I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
cc @cockroachdb/replication |
Thanks for this lovely report, @davepacheco. I'm going to move this from KV-repl into Storage, as this looks to be a Pebble thing - which would make more sense as there's a fair amount of manual memory management happening down there (edit: but certainly not anything jumping to mind that would explain scribbling over another thread's stack 😱). |
@davepacheco - what might help narrow this down a bit on our side, if we suspect Pebble (at least from the stacks you provided and the fact that we are doing some memory management in that library), would be to run our Pebble metamorphic test suite (think of it like our fuzzer for the storage engine). If you check out Pebble at $ go test -mod=vendor -tags invariants -v -run TestMeta$ ./internal/metamorphic -seed=123 -keep In the background I'm going to try and spin up an Illumos VM somewhere and do the same. Hopefully the kernel version, C compiler and Go version you provided is enough to reproduce. |
Doing a re-read again now of the original issue (I should strive to do that more often), the goroutine that puked wasn't in Pebble-land, it was still up in CockroachDB, and looks like it was busy getting some more stack somewhere in here. However - I was seeing Pebble close to the memory address you mentioned Pebble may be a red herring, though I would be interested in seeing if you can get the metamorphic tests running on your distribution and whether anything shakes out there. We've had a lot of luck internally finding obscure bugs.
On this front, I was able to find an OmniOS distribution I could run, but I don't think that's what I want. If it's possible to (easily) build and run the OS y'all are using (helios?), we can keep poking on our side. Otherwise, we might have to leave this up to the experts to debug on your end, with some input on our side (if it looks like it's a Cockroach thing). |
👋
I did this on a physical omnios machine:
and saw:
It is :) both OmniOS and Helios use illumos. |
This patch gets it working: diff --git a/vfs/disk_usage_solaris.go b/vfs/disk_usage_solaris.go
new file mode 100644
index 00000000..30da621b
--- /dev/null
+++ b/vfs/disk_usage_solaris.go
@@ -0,0 +1,25 @@
+// Copyright 2020 The LevelDB-Go and Pebble Authors. All rights reserved. Use
+// of this source code is governed by a BSD-style license that can be found in
+// the LICENSE file.
+
+// +build solaris
+
+package vfs
+
+import "golang.org/x/sys/unix"
+
+func (defaultFS) GetDiskUsage(path string) (DiskUsage, error) {
+ stat := unix.Statvfs_t{}
+ if err := unix.Statvfs(path, &stat); err != nil {
+ return DiskUsage{}, err
+ }
+
+ freeBytes := uint64(stat.Bsize) * uint64(stat.Bfree)
+ availBytes := uint64(stat.Bsize) * uint64(stat.Bavail)
+ totalBytes := uint64(stat.Bsize) * uint64(stat.Blocks)
+ return DiskUsage{
+ AvailBytes: availBytes,
+ TotalBytes: totalBytes,
+ UsedBytes: totalBytes - freeBytes,
+ }, nil
+}
diff --git a/vfs/errors_unix.go b/vfs/errors_unix.go
index 31b4dc74..bbc4ebc2 100644
--- a/vfs/errors_unix.go
+++ b/vfs/errors_unix.go
@@ -2,7 +2,7 @@
// of this source code is governed by a BSD-style license that can be found in
// the LICENSE file.
-// +build darwin dragonfly freebsd linux openbsd
+// +build darwin dragonfly freebsd linux openbsd solaris
package vfs
|
Neat. Thanks. Wasn't sure if distros would be wildly different. Will keep poking with your patch. 👍 |
Update: while :;
do
go test -mod=vendor -tags invariants -v -run TestMeta$ ./internal/metamorphic -keep --ops 10000
sleep 1
done has been going strong for hours. I'll keep running it for now. |
Q: is this running with the jemalloc custom allocator? or the base go one? |
Thanks @jmpesp! - I've also had a test run going on a VM for ~12 hours without issue. I will note that I built for I've asked around internally about this particular issue. Hoping that it will pique the interest of some more folks. |
My question above refers to the fact that by default cockroachdb integrates jemalloc (see |
Another thing worth looking into: have you tried building cockroachdb with the latest go 1.18 instead? There's a couple of changes in the runtime system that this could pick up. |
What you see is what I'm running, I'm not sure what is selected.
Can I do this for the metamorphic test?
My go is |
Presumably you know how you built the
it would be moot if you knew already that jemalloc is not being used. But I'd be interested to see what behavior you observe on the entire
Interesting. For the sake of the experiment, do you get different results when building with 1.17.10? |
I responded to the question for what I was doing (pebble's metamorphic test), not the original issue (port_getn related crash), sorry for the confusion. The golang issue I'm looking at was linked to and I came to this issue hoping for something related to that which would help with the debugging effort. I'll bow out :) |
Thanks @nicktrav for digging in here! Does @jmpesp's data (that the metamorphic test ran without issue for 12 hours) help?
Can you tell how we would know, either from the binary or the build process? (It seems like this would be a good addition to the
The other reason I think jemalloc is being used is that I tried to LD_PRELOAD libumem.so (which implements malloc(3c) and friends and has good facilities for identifying and debugging corruption), and I ran Does a stdmalloc build cause CockroachDB to use malloc(3c) directly? If so, I will probably try that so we can see if libumem can shed some light here. I have another data point that could be related, but I'm not sure: I just ran into another SIGSEGV: oxidecomputer/omicron#1223. This one looks more obviously inside CockroachDB. I've saved the entire CockroachDB data directory (attached to that ticket), including the full stderr capture. Let me know if there's more you'd like from here. Unfortunately since Go just exits on SIGSEGV, I don't have a core file. I also wanted to mention that process that triggered this initial report remains stopped on my system at the same point that I mentioned above. I have a core file, a list of open files, arguments, environment, etc. If there's anything else you'd like from the running process, let me know. Otherwise I may kill it soon. |
Yes please.
I replied on that ticket.
Would it be possible to emit a stack dump for all the threads in the process? |
Also let me encourage you to build with go 1.17 or 1.18 instead of 1.16. |
It's definitely a useful signal - thank you for your help there. I was also able to do the same, and without issue. There are a couple caveats here in that I'm not sure how similar the environments / binaries are - I tried my best to align the Go runtime, Pebble version and C compiler, but I was running on OmniOS, so there are likely differences in the kernel. The other thing is that it looks like the issue isn't actually in Pebble itself - it's in adjacent code that has probably called into Pebble (or is just about to), but it's panic-ing above Pebble - so it may not be as interesting that the metamorphic tests aren't picking anything up as we're not exercising the exact codenpaths. That said, we're certainly exercising a lot of the manual memory management code paths, without issue. As a side note, it probably makes sense for us to build and test Pebble on Solaris / illumos. I think that's tangential to this issue though, but we'll see what we can do. |
We have marked this issue as stale because it has been inactive for |
We haven't seen this problem in a while, though we have no reason to believe it's fixed. |
Background: we've got a test suite that spins up single-node CockroachDB clusters a lot of times during each run. We're tracking a few cases where CockroachDB seems to crash during startup. I'm filing this issue for oxidecomputer/omicron#1130 because it looks kind of like memory corruption and wanted y'all's input. But I wanted to mention that we also saw oxidecomputer/omicron#1144 and oxidecomputer/omicron#1146. I don't know if they're related. We also filed golang/go#53289 because that one blows up explicitly inside Go.
For this problem, the failure mode is that CockroachDB prints this to stderr:
and then exits.
For context,
port_getn
is a libc function on Solaris and illumos systems that's analogous to thepoll
/epoll
/kqueue family of APIs.Some more data:
cockroach debug zip
We're using:
on helios-1.0.21004 (an illumos distribution).
This is reproducible but not easily. It takes several hours and often hits some other bug instead (that's how we found the ones I mentioned above).
Now, so far this looks like either an OS or Go runtime issue, but we've got reason to suspect memory corruption and wanted to raise this with you all. Go is clearly not expecting to get EINVAL from
port_getn
. With DTrace, I confirmed that the kernel really is returning EINVAL. I grabbed a core file at that moment and inspected the arguments being passed to the syscall. Everything looks correct except thestruct timespec
that Go is passing into the kernel, which is:Based on reading the Go code, I expected this struct to be zero'd. Since tv_nsec is outside the range
[0, 1e9)
, it makes sense that we'd get EINVAL. From the Go code, I don't see how these values should be there.Here's the stack trace:
It looks like something has scribbled over the thread's stack where this
struct timespec
is supposed to be. Those values (0xc001c97500 and 0xc000240000) look like addresses, and they appear to be coming from the Go memory allocator. Here's the first 20 words at each of those addresses:It's this second one that makes me worried that something inside CockroachDB scribbled over the stack. There's more detail in oxidecomputer/omicron#1130 and more detailed notes about how I came to these conclusions in this comment.
I'd be interested to know if this rings a bell for any of you or if you have thoughts on any of the data here!
Jira issue: CRDB-16755
The text was updated successfully, but these errors were encountered: