-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cockroachdb hit SIGSEGV during test suite #1223
Comments
See also #1144, but this one's actually inside CockroachDB. |
I'm attaching issue-1223.tgz.zip, which includes the whole test run temp directory, which includes our test suite log file as well as the CockroachDB data directory (which includes the above stderr file). |
This is not in cockroachdb directly. This is in the go runtime system, when assigning to a (primitive) map type. I would recommend you attempt building with go 1.17 or 1.18. There were improvements in that area. |
@knz I'm working on getting a build of Cockroach using Go 1.18. As I think you've seen, we've hit a half dozen or so different memory-related issues (including SIGSEGV, stack corruption, and panics in the Go memory allocator). We don't know if they're related. For most of them (including this one), we've only seen them once despite many iterations. So if we don't see this with Go 1.18, that won't really tell us whether the problem was somehow fixed. |
I've seem to hit the same issue. Working on #1474, I ran a full
I'm attaching the whole directory here as well. |
This last stack strace is also with go 1.16. let's look at what 1.17 or 1.18 has for us.
(fyi there's a PR open on the CRDB repo containing the tweaks to make it buildable with the 1.18 compiler. You can perhaps use that.)
--
Verstuurd vanaf mijn Android apparaat met K-9 Mail. Excuseer mijn beknoptheid.
|
The next step here seems to be updating Omicron to the latest CockroachDB, which uses Go 1.17. This is being done under oxidecomputer/garbage-compactor#5. |
Okay, I've finally got an up-to-date CockroachDB build with an up-to-date Go. It's v22.1.9 (released just this week) on Go v1.17.13. Go v1.17.13 is the latest Go release of 1.17. It looks like Go 1.17 is what Cockroach v22.1.9 expects. I ran this loop:
This happened on iteration 7 -- basically right away. Trying to repro, I ran it again, and it's been running several minutes without incident. Separately, I ran the Omicron test suite twice. Each run of the test suite runs dozens of tests that each run
For some reason, I only have stderr from the last of those failures. Following the log mentioned above:
As I wrote this up, I started a second bash loop to repro with just
@knz these feel pretty similar to the problems we were seeing before, don't they? They don't seem fixed in Go 1.17. |
Thanks. Can you compare the results of you build CRDB with |
@knz This should be a
By comparison, here's one that was not built this way:
(edit sorry jnz for the stray mention) |
Here's a different failure mode, also from the
|
Also noteworthy: I updated the loop to use libumem and still hit this failure mode (without libumem reporting any problem):
I also saw this when run with UMEM_DEBUG=default (using |
Folk i don't know what to say. I really don't want to point fingers but the likelihood of a bug in cockroachdb is low. What i think is happening is that your HW/OS/compiler combination is creating an environment with nonstandard memory consistency properties and that throws a wrench into go's runtime semantics. Let's test this theory: is there a way for y'all to configure your OS to run processes on a single processor? Like force crdb to be pinned on one core, to serialize all memory accesses. What sould that give? |
@davepacheco I'd also be interested in seeing what libumem finds if you bypass the Go allocator and turn off GC:
|
@knz No problem. I don't think we have evidence here to indict or exonerate the hardware, operating system, compiler, or a combination of those yet. I've used all these things quite a lot without issues like this one. I have not used much Go on this system and it's possible there's an issue there or with the combination of Go with those things (although still no evidence here points to a platform-specific issue -- it's not like similar issues haven't been seen with Golang before). I briefly played with trying to reproduce this on a single processor but I'm not sure that's useful. The problem did not reproduce right away, but the test suite takes about 6x as long to run to begin with. Plus, if the problem is related to concurrency (even if were a platform-independent bug in Go or CockroachDB), it may not reproduce with one CPU. On @jmpesp's suggestion above, I tried a few different GODEBUG runtime variables. In these tests, I ran with
I also ran the Go test suite on 1.17.13 (that is, ran |
After updating to Helios helios-1.0.21316, two more failure modes here: #1876, and this different error from the memory allocator:
|
Another run of the above test suite, another failure mode. (I probably need a better way to catalog these, but for now I just don't want to lose them.)
This one looks like golang/go#53289 (comment). |
I wanted to rule out some quirk of this particular machine so I ran this repro script on atrium:
|
Many of the failure modes described above seem readily explainable by the issue described under #1146. See this comment in particular. I suspect the SIGSEGVs are also possible from that bug, both from first principles (supposedly-zero'd memory that's not zero'd could easily cause something to try to dereference it, even though it may be garbage) and because a similar underlying issue on Linux also resulted in segfaults. Without core files, I don't think we can confirm or eliminate this explanation for the failures here. I think I'll suggest we set GOTRACEBACK=crash in the environment whenever we run |
From Omicron as of 6fa27a3, I have been running the test suite in a loop using a binary of Cockroach v22.1.9 that's been patched as described in #1146. This has run successfully 557 times (just over 2d 21h). I'm running it with:
That seems pretty good. As I mentioned above the only other thing I can think to do is reproduce this without the bug fixed, but in all my work on these issues I've only seen 2-3 SIGSEGVs that I remember (compared with a few dozen of the other failure modes) so that doesn't seem worthwhile. Let's make sure we enable core dumps and plan to reopen this if we see it again organically. |
I'm on the fence about whether it's worth leaving this open or closed given the current state. (I kind of wish we had a substate or tag for "believed-resolved" so that we could use that in various queries, like searching for open test flakes.) |
Closing for the reasons mentioned above. We can reopen if we start seeing it again. |
Propolis changes: Add `IntrPin::import_state` and migrate LPC UART pin states (#669) Attempt to set WCE for raw file backends Fix clippy/lint nits for rust 1.77.0 Crucible changes: Correctly (and robustly) count bytes (#1237) test-replay.sh fix name of DTrace script (#1235) BlockReq -> BlockOp (#1234) Simplify `BlockReq` (#1218) DTrace, cmon, cleanup, retry downstairs connections at 10 seconds. (#1231) Remove `MAX_ACTIVE_COUNT` flow control system (#1217) Crucible changes that were in Omicron but not in Propolis before this commit. Return *410 Gone* if volume is inactive (#1232) Update Rust crate opentelemetry to 0.22.0 (#1224) Update Rust crate base64 to 0.22.0 (#1222) Update Rust crate async-recursion to 1.1.0 (#1221) Minor cleanups to extent implementations (#1230) Update Rust crate http to 0.2.12 (#1220) Update Rust crate reedline to 0.30.0 (#1227) Update Rust crate rayon to 1.9.0 (#1226) Update Rust crate nix to 0.28 (#1223) Update Rust crate async-trait to 0.1.78 (#1219) Various buffer optimizations (#1211) Add low-level test for message encoding (#1214) Don't let df failures ruin the buildomat tests (#1213) Activate the NBD server's psuedo file (#1209) --------- Co-authored-by: Alan Hanson <[email protected]>
While trying to reproduce #1130, I ran into a different SIGSEGV in CockroachDB:
The text was updated successfully, but these errors were encountered: