Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cockroachdb crashed in Go runtime during test run: s.allocCount != s.nelems #1146

Closed
davepacheco opened this issue Jun 2, 2022 · 16 comments
Closed
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@davepacheco
Copy link
Collaborator

davepacheco commented Jun 2, 2022

There's a lot of detail in this report. For a summary of this problem, the root cause, and a workaround, see this comment below.

Again trying to reproduce #1130, I found a different issue that caused CockroachDB to exit early:

ATTEMPT 4517

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED

failures:

---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.18644/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.4297.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.18644/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.4297.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'called `Result::unwrap()` on an `Err` value: Exited', /home/dap/omicron/test-utils/src/dev/mod.rs:141:42
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    integration_tests::disks::test_disk_create_disk_that_already_exists_fails

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 0.52s

Test log file:

$ bunyan "/dangerzone/omicron_tmp/try_repro.18644/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.4297.0.log"
[2022-06-02T04:28:27.661727495Z]  INFO: test_disk_create_disk_that_already_exists_fails/4297 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/try_repro.18644/.tmpomCYKb
[2022-06-02T04:28:27.662076216Z]  INFO: test_disk_create_disk_that_already_exists_fails/4297 on ivanova: cockroach: copying from seed directory (/home/dap/omicron/target/debug/build/nexus-test-utils-308521ed0d0eed98/out/crdb-base) to storage directory (/dangerzone/omicron_tmp/try_repro.18644/.tmpomCYKb/data)
[2022-06-02T04:28:27.67286125Z]  INFO: test_disk_create_disk_that_already_exists_fails/4297 on ivanova: cockroach command line: cockroach start-single-node --insecure --http-addr=:0 --store=path=/dangerzone/omicron_tmp/try_repro.18644/.tmpomCYKb/data,ballast-size=0 --listen-addr 127.0.0.1:0 --listening-url-file /dangerzone/omicron_tmp/try_repro.18644/.tmpomCYKb/listen-url

The CockroachDB output:

$ ls /dangerzone/omicron_tmp/try_repro.18644/.tmpomCYKb/
total 9
-rw-r--r--   1 dap      staff      10265 Jun  1 21:28 cockroachdb_stderr
-rw-r--r--   1 dap      staff          0 Jun  1 21:28 cockroachdb_stdout
drwxr-xr-x   4 dap      staff         22 Jun  1 21:28 data/
$ cat /dangerzone/omicron_tmp/try_repro.18644/.tmpomCYKb/cockroachdb_stderr 
runtime: s.allocCount= 30 s.nelems= 56
fatal error: s.allocCount != s.nelems && freeIndex == s.nelems

goroutine 1 [running, locked to thread]:
runtime.throw(0x5aa10c3, 0x31)
	/opt/ooce/go-1.16/src/runtime/panic.go:1117 +0x72 fp=0xc000d6d2f8 sp=0xc000d6d2c8 pc=0x10a9512
runtime.(*mcache).nextFree(0xfffffc7fef180108, 0x316, 0x340, 0xfffffc7fef180108, 0x380)
	/opt/ooce/go-1.16/src/runtime/malloc.go:880 +0x228 fp=0xc000d6d330 sp=0xc000d6d2f8 pc=0x107dd88
runtime.mallocgc(0x88, 0x57b9880, 0xfda023cce83e2c01, 0x4)
	/opt/ooce/go-1.16/src/runtime/malloc.go:1069 +0x850 fp=0xc000d6d3b8 sp=0xc000d6d330 pc=0x107e5f0
runtime.newobject(...)
	/opt/ooce/go-1.16/src/runtime/malloc.go:1177
runtime.mapassign(0x5305d00, 0xc000cb8720, 0x712bea0, 0xc000cb8720)
	/opt/ooce/go-1.16/src/runtime/map.go:667 +0x286 fp=0xc000d6d438 sp=0xc000d6d3b8 pc=0x1080286
github.com/aws/aws-sdk-go/aws/endpoints.init()
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/aws/aws-sdk-go/aws/endpoints/defaults.go:6880 +0x83005 fp=0xc000d7f7a8 sp=0xc000d6d438 pc=0x479d345
runtime.doInit(0x922b7a0)
	/opt/ooce/go-1.16/src/runtime/proc.go:6315 +0xec fp=0xc000d7f8f8 sp=0xc000d7f7a8 pc=0x10b99ec
runtime.doInit(0x9233c80)
	/opt/ooce/go-1.16/src/runtime/proc.go:6292 +0x72 fp=0xc000d7fa48 sp=0xc000d7f8f8 pc=0x10b9972
runtime.doInit(0x9242ae0)
	/opt/ooce/go-1.16/src/runtime/proc.go:6292 +0x72 fp=0xc000d7fb98 sp=0xc000d7fa48 pc=0x10b9972
runtime.doInit(0x92299a0)
	/opt/ooce/go-1.16/src/runtime/proc.go:6292 +0x72 fp=0xc000d7fce8 sp=0xc000d7fb98 pc=0x10b9972
runtime.doInit(0x9277c00)
	/opt/ooce/go-1.16/src/runtime/proc.go:6292 +0x72 fp=0xc000d7fe38 sp=0xc000d7fce8 pc=0x10b9972
runtime.doInit(0x92220c0)
	/opt/ooce/go-1.16/src/runtime/proc.go:6292 +0x72 fp=0xc000d7ff88 sp=0xc000d7fe38 pc=0x10b9972
runtime.main()
	/opt/ooce/go-1.16/src/runtime/proc.go:208 +0x205 fp=0xc000d7ffe0 sp=0xc000d7ff88 pc=0x10abd25
runtime.goexit()
	/opt/ooce/go-1.16/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc000d7ffe8 sp=0xc000d7ffe0 pc=0x10e35a1

goroutine 52 [chan receive]:
github.com/cockroachdb/cockroach/pkg/util/log.flushDaemon()
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:75 +0x74
created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:41 +0x35

goroutine 36 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e1a0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 37 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e270)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 38 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e340)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 39 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e410)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 40 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e4e0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 41 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e5b0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 42 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e680)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 43 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e750)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 44 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e820)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 45 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e8f0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 46 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e9c0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 47 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043ea90)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 48 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043eb60)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 49 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043ec30)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 50 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043ed00)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 51 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043edd0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 53 [chan receive]:
github.com/cockroachdb/cockroach/pkg/util/log.signalFlusher()
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:98 +0x12c
created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:42 +0x4d

goroutine 55 [syscall]:
os/signal.signal_recv(0x0)
	/opt/ooce/go-1.16/src/runtime/sigqueue.go:168 +0xa5
os/signal.loop()
	/opt/ooce/go-1.16/src/os/signal/signal_unix.go:23 +0x25
created by os/signal.Notify.func1.1
	/opt/ooce/go-1.16/src/os/signal/signal.go:151 +0x45

goroutine 33 [chan receive]:
github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0.func1()
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/goschedstats/runnable.go:165 +0x16b
created by github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/goschedstats/runnable.go:157 +0x35

This appears to be dying inside the Go runtime memory allocator. It looks like golang/go#45775, which unfortunately was closed last year as not-reproducible.

@davepacheco davepacheco changed the title cockroachdb crash in golang during test run: s.allocCount != s.nelems cockroachdb crash in Go runtime during test run: s.allocCount != s.nelems Jun 2, 2022
@davepacheco davepacheco changed the title cockroachdb crash in Go runtime during test run: s.allocCount != s.nelems cockroachdb crashed in Go runtime during test run: s.allocCount != s.nelems Jun 2, 2022
@davepacheco davepacheco added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Jul 15, 2022
@davepacheco
Copy link
Collaborator Author

davepacheco commented Dec 16, 2022

(part 1/5)

The error described above is the most common of several different failure modes that I've suspected (hoped?) were the same root cause. I think I've finally got this one nailed. My write-up is pretty long and I've split it across five comments:

Background on this problem

Recall that Omicron's test suite has 160+ tests that spin up transient instances of CockroachDB. In some fraction of these attempts, CockroachDB exits unexpectedly with one of several errors from the Go runtime:

We've seen a number of issues that seemed like they could also be related (mentioned below). These three are by far the most common that I've seen, and in that order (i.e., the s.allocCount assertion is the most common failure of all). To my knowledge, until recently I'm the only one that's reported these issues, though I did reproduce them on two different machines. (@smklein just yesterday said he's hitting this a lot on his machine now that he's testing Helios on hardware.) The failures are fairly reliable on my test machine ivanova. I can generally reproduce it within 3 minutes by running part of the test suite in a loop.

Sometimes these failures happen after the database has already opened up its listening port. When we're lucky, it happens during the cockroach version invocation that we run first to make sure we've got the right version.

Most of the testing has been with CockroachDB 22.1.9 using Go 1.17.13. We've also seen these failures from the Go test suite from 1.16 up through at least 1.19.2. I've also seen them with CockroachDB v22.2.0 (recently released) using Go 1.19.

I fumbled around for a while doing what I'd call heuristic debugging: looking for patterns and commonalities and guessing at possible causes. What really broke this open was taking a specific failure, diving into the code (the Go runtime memory subsystem), adding instrumentation, poring over data, and iterating. I chose this failure in particular since it was the most common.

Background on the Go memory subsystem

This is of course a gross simplification, but here are some highlights:

  • The Go runtime organizes memory into objects called mspans (sometimes just called spans).
    • Each mspan describes a block of contiguous memory that will be used for allocations of a fixed size. They resemble kmem caches in this way. You can think of the mspan's memory as a sequence of slots that may be either allocated or free. Much of the metadata is organized this way, in terms of each slot's index.
    • The metadata for the mspan is stored separately from the mspan.
    • Each mspan describes one or more consecutive 8192-byte "pages".
    • The mspan also has an "elemsize" (the size of elements in it), "allocCount" (number of items allocated from it, which is not really used for much), and "nelems" (the number of items in it).
  • Two mspan fields are used to track allocations:
    • "freeIndex" is an index (in the range [0, nelems)) pointing to where to start looking for a free slot. Items with index before "freeIndex" are allocated and cannot be used for new allocation. Items after "freeIndex" may or may not be free -- see the next item.
    • "allocBits" is an 8-byte aligned bitmap with one bit for each item in the mspan. For items with index greater than "freeIndex", the corresponding bit in "allocBits" is 1 if the item is allocated and 0 otherwise.
  • To allocate: you look at the "allocBits" bitmap and find the first 0 bit after bit "freeIndex". Assuming that's less than "nelems", you've found a free object. Set "freeIndex" to the next index (after the one you found). (I'm omitting that there's an optimization here, which is that the current 8-byte hunk of allocBits is inverted and stored into allocCache, and that's what's directly used to find the next 0 bit. This is not relevant to this bug.) This is all implemented (twice, separately) in mspan.nextFreeFast() and mspan.nextFreeIndex().
  • During garbage collection (GC):
    • GC scans memory looking for references to objects that have been allocated from mspans.
    • Then it sweeps each mspan. In sweepLocked.sweep():
      • The sweep() process is given a "gcmarkBits" bitmap. This is the same structure as "allocBits" (one bit per object), but where each bit is 1 iff the corresponding object was marked by GC (i.e., GC found a reference to it).
      • The sweep() process essentially just replaces "allocBits" with the mark bits and sets freeIndex to zero. It also recalculates allocCount based on the number of 1 bits in allocBits.

That's a little unintuitive. Let's walk through an example.

  1. mspan initialized with (say) 56 items, each of size 144 bytes.
    • allocCount = 0, since we haven't allocated anything
    • freeIndex = 0, for the same reason
    • allocBits is a bitmap of all zeros
    • several items fixed for the life of the mspan:
      • nelems = 56
      • elemsize = 144
      • npages = 1 (8192-byte pages)
      • startAddr = start of virtual address range
      • limit = end of virtual address range: this will be startAddr + npages * 8192
  2. Allocate one object.
    • We look at allocBits, starting with bit freeIndex = 0, find a 0, and allocate that object (the first one).
    • allocCount = 1
    • freeIndex = 1
  3. Allocate another object.
    • We look at allocBits, starting with bit freeIndex = 1, find a 1, and allocate that object (the second one).
    • allocCount = 2
    • freeIndex = 2
  4. GC happens.
    • GC finds (say) that only the second object is referenced. gcmarkBits is 010000...
    • GC sweep sets allocBits = gcmarkBits, reflecting that object 0 is now free and object 1 is still in use
    • GC sweep sets freeIndex = 0
    • GC sweep sets allocCount = 1
  5. Allocate another object.
    • We look at allocBits, starting with bit freeIndex = 0, find a 0, and allocate that object (the first one).
    • allocCount = 2
    • freeIndex = 1
  6. Allocate another object.
    • We look at allocBits, starting with bit freeIndex = 1, find that the next 0 is another bit ahead, and allocate that object (the third one).
    • allocCount = 3
    • freeIndex = 3 (i.e., we skipped over the previously-allocated object)

Importantly: "allocBits" is not updated after an allocation! It's only updated during GC, and then it's set directly to "gcmarkBits". This works because we never look at "allocBits" for items with index less than "freeIndex".

@davepacheco
Copy link
Collaborator Author

davepacheco commented Dec 16, 2022

(part 2/5)

Digging into this specific failure

The message reported is:

runtime: s.allocCount= 30 s.nelems= 56
fatal error: s.allocCount != s.nelems && freeIndex == s.nelems

That's coming from this Go code. What's happening here is that we're trying to do an allocation from this mspan, and we found that it's full (freeIndex == nelems). Fine. But then we assert that allocCount == nelems -- and it doesn't! In other words, we've apparently run out of free slots in this mspan, but we haven't allocated everything in it.

So, how many things were allocated from this mspan? From the initial failures, we couldn't tell. When this problem occurs, Go dumps out the message shown above and the Goroutine stack traces, but we get no core file and no information about the history of this mspan. The Go runtime provides the GODEBUG=allocfreetrace=1 environment variable setting to trace allocations. Unfortunately I found it way too expensive to use here. cockroach version took upwards of 90s to complete with this enabled. Instead, I wrote a DTrace script to trace allocations and GC sweep events on an mspan. (I tested the script by running with allocfreetrace=1 and spot-checking that various allocations lined up, so the flag was still helpful!)

The script also traces frees, which was a little tricky. There's not normally a "free" operation per se. sweep() just replaces the allocBits bitmap with the gcmarkBits one, which clears bits of anything that wasn't found so that we'll reuse them for subsequent allocations. (Neat!) But there is a GODEBUG=clobberfree=1 flag you can set to cause Go to scribble over each freed object with a known pattern so that it's recognizable when debugging. That's something we can trace. In Go, it uses a function called clobberfree to do this, but we can't directly trace clobberfree with the pid provider because it's inlined. Instead, I used the DWARF to find its address ranges (one could as well skim the disassembly of the calling function for that recognizable pattern). Then I used a pid provider instruction probe to trace it. It was easy to see which register the argument is in.

The result is this D script. I ran this with GOTRACEBACK=crash and GODEBUG=clobberfree=1 in the environment, which causes Go to dump core on a fatal error.

After a bunch of iteration and loops to try to reproduce, I managed to catch this failure mode while Go was instrumented with DTrace and got a core file from it:

runtime: s.allocCount= 28 s.nelems= 56
fatal error: s.allocCount != s.nelems && freeIndex == s.nelems

goroutine 1 [running, locked to thread]:
runtime.throw({0x612f80e, 0x31})
	/opt/ooce/go-1.17/src/runtime/panic.go:1198 +0x74 fp=0xc000a2bdf8 sp=0xc000a2bdc8 pc=0x127d7f4
runtime.(*mcache).nextFree(0xfffffc7fef0b55b8, 0x16)
	/opt/ooce/go-1.17/src/runtime/malloc.go:884 +0x228 fp=0xc000a2be38 sp=0xc000a2bdf8 pc=0x124e828
runtime.mallocgc(0x88, 0x5e490e0, 0x1)
	/opt/ooce/go-1.17/src/runtime/malloc.go:1077 +0x530 fp=0xc000a2beb8 sp=0xc000a2be38 pc=0x124ed70
runtime.newobject(...)
	/opt/ooce/go-1.17/src/runtime/malloc.go:1234
runtime.mapassign(0x59ef940, 0xc000bef410, 0x87253c8)
	/opt/ooce/go-1.17/src/runtime/map.go:667 +0x485 fp=0xc000a2bf38 sp=0xc000a2beb8 pc=0x1250ea5
github.com/aws/aws-sdk-go/aws/endpoints.init()
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/aws/aws-sdk-go/aws/endpoints/defaults.go:3321 +0x45dfb fp=0xc000a3f7a8 sp=0xc000a2bf38 pc=0x4e283db
runtime.doInit(0xaf42440)
	/opt/ooce/go-1.17/src/runtime/proc.go:6498 +0x129 fp=0xc000a3f8f8 sp=0xc000a3f7a8 pc=0x128ea89
runtime.doInit(0xaf4b660)
	/opt/ooce/go-1.17/src/runtime/proc.go:6475 +0x7e fp=0xc000a3fa48 sp=0xc000a3f8f8 pc=0x128e9de
runtime.doInit(0xaf60ba0)
	/opt/ooce/go-1.17/src/runtime/proc.go:6475 +0x7e fp=0xc000a3fb98 sp=0xc000a3fa48 pc=0x128e9de
runtime.doInit(0xaf40340)
	/opt/ooce/go-1.17/src/runtime/proc.go:6475 +0x7e fp=0xc000a3fce8 sp=0xc000a3fb98 pc=0x128e9de
runtime.doInit(0xaf98dc0)
	/opt/ooce/go-1.17/src/runtime/proc.go:6475 +0x7e fp=0xc000a3fe38 sp=0xc000a3fce8 pc=0x128e9de
runtime.doInit(0xaf38680)
	/opt/ooce/go-1.17/src/runtime/proc.go:6475 +0x7e fp=0xc000a3ff88 sp=0xc000a3fe38 pc=0x128e9de
runtime.main()
	/opt/ooce/go-1.17/src/runtime/proc.go:238 +0x205 fp=0xc000a3ffe0 sp=0xc000a3ff88 pc=0x1280305
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000a3ffe8 sp=0xc000a3ffe0 pc=0x12b6e81

goroutine 2 [force gc (idle)]:
runtime.gopark(0x747ba70, 0xb3751f0, 0x11, 0x14, 0x1)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc0000b0fb0 sp=0xc0000b0f90 pc=0x12807c5
runtime.goparkunlock(...)
	/opt/ooce/go-1.17/src/runtime/proc.go:372
runtime.forcegchelper()
	/opt/ooce/go-1.17/src/runtime/proc.go:306 +0xc5 fp=0xc0000b0fe0 sp=0xc0000b0fb0 pc=0x1280625
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000b0fe8 sp=0xc0000b0fe0 pc=0x12b6e81
created by runtime.init.7
	/opt/ooce/go-1.17/src/runtime/proc.go:294 +0x35

goroutine 3 [runnable]:
runtime.Gosched(...)
	/opt/ooce/go-1.17/src/runtime/proc.go:322
runtime.bgsweep()
	/opt/ooce/go-1.17/src/runtime/mgcsweep.go:168 +0x13e fp=0xc0000b17e0 sp=0xc0000b17b0 pc=0x1267dbe
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000b17e8 sp=0xc0000b17e0 pc=0x12b6e81
created by runtime.gcenable
	/opt/ooce/go-1.17/src/runtime/mgc.go:181 +0x75

goroutine 4 [GC scavenge wait]:
runtime.gopark(0x747ba70, 0xb380220, 0xd, 0x14, 0x1)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc0000b1f80 sp=0xc0000b1f60 pc=0x12807c5
runtime.goparkunlock(...)
	/opt/ooce/go-1.17/src/runtime/proc.go:372
runtime.bgscavenge()
	/opt/ooce/go-1.17/src/runtime/mgcscavenge.go:314 +0x2bb fp=0xc0000b1fe0 sp=0xc0000b1f80 pc=0x1265d9b
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000b1fe8 sp=0xc0000b1fe0 pc=0x12b6e81
created by runtime.gcenable
	/opt/ooce/go-1.17/src/runtime/mgc.go:182 +0x8d

goroutine 5 [finalizer wait]:
runtime.gopark(0x747ba70, 0xb3cba98, 0x10, 0x14, 0x1)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc0000b0740 sp=0xc0000b0720 pc=0x12807c5
runtime.goparkunlock(...)
	/opt/ooce/go-1.17/src/runtime/proc.go:372
runtime.runfinq()
	/opt/ooce/go-1.17/src/runtime/mfinal.go:177 +0xc6 fp=0xc0000b07e0 sp=0xc0000b0740 pc=0x125c7c6
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000b07e8 sp=0xc0000b07e0 pc=0x12b6e81
created by runtime.createfing
	/opt/ooce/go-1.17/src/runtime/mfinal.go:157 +0x57

goroutine 18 [chan receive]:
runtime.gopark(0x747b770, 0xc000282418, 0xe, 0x17, 0x2)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc0000ac6a8 sp=0xc0000ac688 pc=0x12807c5
runtime.chanrecv(0xc0002823c0, 0xc0000ac7b8, 0x1)
	/opt/ooce/go-1.17/src/runtime/chan.go:576 +0x5f7 fp=0xc0000ac738 sp=0xc0000ac6a8 pc=0x12474b7
runtime.chanrecv2(0xc0002823c0, 0xc0000ac7b8)
	/opt/ooce/go-1.17/src/runtime/chan.go:444 +0x2b fp=0xc0000ac768 sp=0xc0000ac738 pc=0x1246eab
github.com/cockroachdb/cockroach/pkg/util/log.flushDaemon()
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:75 +0x76 fp=0xc0000ac7e0 sp=0xc0000ac768 pc=0x1d494d6
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000ac7e8 sp=0xc0000ac7e0 pc=0x12b6e81
created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:41 +0x35

goroutine 19 [chan receive]:
runtime.gopark(0x747b770, 0xc0000de118, 0xe, 0x17, 0x2)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc0000ace88 sp=0xc0000ace68 pc=0x12807c5
runtime.chanrecv(0xc0000de0c0, 0xc0000acfb0, 0x1)
	/opt/ooce/go-1.17/src/runtime/chan.go:576 +0x5f7 fp=0xc0000acf18 sp=0xc0000ace88 pc=0x12474b7
runtime.chanrecv2(0xc0000de0c0, 0xc0000acfb0)
	/opt/ooce/go-1.17/src/runtime/chan.go:444 +0x2b fp=0xc0000acf48 sp=0xc0000acf18 pc=0x1246eab
github.com/cockroachdb/cockroach/pkg/util/log.signalFlusher()
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:98 +0x145 fp=0xc0000acfe0 sp=0xc0000acf48 pc=0x1d497c5
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000acfe8 sp=0xc0000acfe0 pc=0x12b6e81
created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:42 +0x4d

goroutine 6 [select, locked to thread]:
runtime.gopark(0x747bac8, 0x0, 0x9, 0x18, 0x1)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc0000bee20 sp=0xc0000bee00 pc=0x12807c5
runtime.selectgo(0xc0000befa8, 0xc0000b2790, 0x0, 0x0, 0x2, 0x1)
	/opt/ooce/go-1.17/src/runtime/select.go:327 +0x7b0 fp=0xc0000bef40 sp=0xc0000bee20 pc=0x1291bd0
runtime.ensureSigM.func1()
	/opt/ooce/go-1.17/src/runtime/signal_unix.go:890 +0x1f2 fp=0xc0000befe0 sp=0xc0000bef40 pc=0x12ae4f2
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000befe8 sp=0xc0000befe0 pc=0x12b6e81
created by runtime.ensureSigM
	/opt/ooce/go-1.17/src/runtime/signal_unix.go:873 +0x105

goroutine 20 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00011bf20, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc0000ad760 sp=0xc0000ad740 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc0000ad7e0 sp=0xc0000ad760 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000ad7e8 sp=0xc0000ad7e0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 21 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00020a400, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc0000adf60 sp=0xc0000adf40 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc0000adfe0 sp=0xc0000adf60 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000adfe8 sp=0xc0000adfe0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 22 [syscall]:
runtime.notetsleepg(0xb3ccb60, 0xffffffffffffffff)
	/opt/ooce/go-1.17/src/runtime/lock_sema.go:295 +0x45 fp=0xc0000ae798 sp=0xc0000ae758 pc=0x124d7a5
os/signal.signal_recv()
	/opt/ooce/go-1.17/src/runtime/sigqueue.go:169 +0xab fp=0xc0000ae7c0 sp=0xc0000ae798 pc=0x12b23cb
os/signal.loop()
	/opt/ooce/go-1.17/src/os/signal/signal_unix.go:24 +0x25 fp=0xc0000ae7e0 sp=0xc0000ae7c0 pc=0x1d27ba5
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000ae7e8 sp=0xc0000ae7e0 pc=0x12b6e81
created by os/signal.Notify.func1.1
	/opt/ooce/go-1.17/src/os/signal/signal.go:151 +0x3a

goroutine 34 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a000, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000586760 sp=0xc000586740 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc0005867e0 sp=0xc000586760 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0005867e8 sp=0xc0005867e0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 35 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a020, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000586f60 sp=0xc000586f40 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc000586fe0 sp=0xc000586f60 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000586fe8 sp=0xc000586fe0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 36 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a040, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000587760 sp=0xc000587740 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc0005877e0 sp=0xc000587760 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0005877e8 sp=0xc0005877e0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 37 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a060, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000587f60 sp=0xc000587f40 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc000587fe0 sp=0xc000587f60 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000587fe8 sp=0xc000587fe0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 38 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a080, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000588760 sp=0xc000588740 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc0005887e0 sp=0xc000588760 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0005887e8 sp=0xc0005887e0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 39 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a0a0, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000588f60 sp=0xc000588f40 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc000588fe0 sp=0xc000588f60 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000588fe8 sp=0xc000588fe0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 40 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a0c0, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000589760 sp=0xc000589740 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc0005897e0 sp=0xc000589760 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0005897e8 sp=0xc0005897e0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 41 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a0e0, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000589f60 sp=0xc000589f40 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc000589fe0 sp=0xc000589f60 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000589fe8 sp=0xc000589fe0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 42 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a100, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000582760 sp=0xc000582740 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc0005827e0 sp=0xc000582760 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0005827e8 sp=0xc0005827e0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 43 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a120, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc0000c3f60 sp=0xc0000c3f40 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc0000c3fe0 sp=0xc0000c3f60 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000c3fe8 sp=0xc0000c3fe0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 44 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a140, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000583760 sp=0xc000583740 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc0005837e0 sp=0xc000583760 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0005837e8 sp=0xc0005837e0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 45 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a160, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000583f60 sp=0xc000583f40 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc000583fe0 sp=0xc000583f60 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000583fe8 sp=0xc000583fe0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 46 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a180, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000584760 sp=0xc000584740 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc0005847e0 sp=0xc000584760 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0005847e8 sp=0xc0005847e0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 47 [GC worker (idle)]:
runtime.gopark(0x747b810, 0xc00058a1a0, 0x18, 0x14, 0x0)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc000584f60 sp=0xc000584f40 pc=0x12807c5
runtime.gcBgMarkWorker()
	/opt/ooce/go-1.17/src/runtime/mgc.go:1200 +0x118 fp=0xc000584fe0 sp=0xc000584f60 pc=0x125f6f8
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000584fe8 sp=0xc000584fe0 pc=0x12b6e81
created by runtime.gcBgMarkStartWorkers
	/opt/ooce/go-1.17/src/runtime/mgc.go:1124 +0x37

goroutine 52 [chan receive]:
runtime.gopark(0x747b770, 0xc0001022f8, 0xe, 0x17, 0x2)
	/opt/ooce/go-1.17/src/runtime/proc.go:366 +0x105 fp=0xc0000b3de8 sp=0xc0000b3dc8 pc=0x12807c5
runtime.chanrecv(0xc0001022a0, 0xc0000b3f28, 0x1)
	/opt/ooce/go-1.17/src/runtime/chan.go:576 +0x5f7 fp=0xc0000b3e78 sp=0xc0000b3de8 pc=0x12474b7
runtime.chanrecv1(0xc0001022a0, 0xc0000b3f28)
	/opt/ooce/go-1.17/src/runtime/chan.go:439 +0x2b fp=0xc0000b3ea8 sp=0xc0000b3e78 pc=0x1246e6b
github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0.func1()
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/goschedstats/runnable.go:165 +0x1de fp=0xc0000b3fe0 sp=0xc0000b3ea8 pc=0x43b525e
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000b3fe8 sp=0xc0000b3fe0 pc=0x12b6e81
created by github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/goschedstats/runnable.go:157 +0x35

Okay! What does the mspan look like? First, we have to find it. The stack trace that Go printed includes this frame:

runtime.(*mcache).nextFree(0xfffffc7fef0b55b8, 0x16)
	/opt/ooce/go-1.17/src/runtime/malloc.go:884 +0x228 fp=0xc000a2be38 sp=0xc000a2bdf8 pc=0x124e828

0xfffffc7fef0b55b8 is the mcache we're operating on and 0x16 is the spanclass argument. See the source of the function for details, but all we need to know here is that the mspan we picked is at mcache.alloc[spc]. In order to print Go structures (like mcache and mspan) in mdb, I hand-wrote C definitions that looked like the Go ones, and I used mdb's ::typedef to teach mdb about them:

$ pfexec mdb /rpool/cores/core.cockroach.19343
Loading modules: [ libc.so.1 ld.so.1 ]
> ::typedef -c LP64
> ::typedef "struct mspan {     void *next;     void *prev;     void *list;      uintptr_t startAddr;     uintptr_t npages;     void *manualFreeList;     uintptr_t freeindex;     uintptr_t nelems;     uint64_t allocCache;     void *allocBits;     void *gcmarkBits;     uint32_t sweepgen;     uint32_t divMul;     uint16_t allocCount;     uint8_t spanclass;     uint8_t state;     uint8_t needzero;     uint16_t allocCountBeforeCache;     uintptr_t elemsize;     uintptr_t limit; }" mspan_t
> ::typedef "struct { uintptr_t nextSample; uintptr_t scanAlloc; uintptr_t tiny; uintptr_t tinyoffset; uintptr_t tinyAllocs; mspan_t *alloc[136]; }" mcache_t;

Now, we can print that mspan:

> fffffc7fef0b55b8::print -at mcache_t
fffffc7fef0b55b8 mcache_t {
    fffffc7fef0b55b8 uintptr_t nextSample = 0x1d578
    fffffc7fef0b55c0 uintptr_t scanAlloc = 0xe00
    fffffc7fef0b55c8 uintptr_t tiny = 0
    fffffc7fef0b55d0 uintptr_t tinyoffset = 0
    fffffc7fef0b55d8 uintptr_t tinyAllocs = 0
    fffffc7fef0b55e0 mspan_t *[310] alloc = [ cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, 0xfffffc7fee310698, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, 0xfffffc7fe81ebd40, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, 0xfffffc7febf68ea0, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, 0xfffffc7fe81e6f50, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, cockroach`runtime.emptymspan, 0xfffffc7fe81f7350, cockroach`runtime.emptymspan, ... ]
}

> fffffc7fef0b55b8::print -at mcache_t alloc[0x16]
fffffc7fef0b5690 mspan_t *alloc[0x16] = 0xfffffc7febf68ea0

> 0xfffffc7febf68ea0::print -at mspan_t
fffffc7febf68ea0 mspan_t {
    fffffc7febf68ea0 void *next = 0
    fffffc7febf68ea8 void *prev = 0
    fffffc7febf68eb0 void *list = 0
    fffffc7febf68eb8 uintptr_t startAddr = 0xc000850000
    fffffc7febf68ec0 uintptr_t npages = 0x1
    fffffc7febf68ec8 void *manualFreeList = 0
    fffffc7febf68ed0 uintptr_t freeindex = 0x38
    fffffc7febf68ed8 uintptr_t nelems = 0x38
    fffffc7febf68ee0 uint64_t allocCache = 0x9a
    fffffc7febf68ee8 void *allocBits = 0xfffffc7feeb11088
    fffffc7febf68ef0 void *gcmarkBits = 0xfffffc7feeb11080
    fffffc7febf68ef8 uint32_t sweepgen = 0xd
    fffffc7febf68efc uint32_t divMul = 0x1c71c72
    fffffc7febf68f00 uint16_t allocCount = 0x1c
    fffffc7febf68f02 uint8_t spanclass = 0x16
    fffffc7febf68f03 uint8_t state = 0x1
    fffffc7febf68f04 uint8_t needzero = 0
    fffffc7febf68f06 uint16_t allocCountBeforeCache = 0
    fffffc7febf68f08 uintptr_t elemsize = 0x90
    fffffc7febf68f10 uintptr_t limit = 0xc000851f80
}

That looks plausible -- it's got the right spanclass (from the stack trace), allocCount and nelems (from the error message). It's got the right freeIndex, too.

I looked through the DTrace output for this failure, looking for sweeps of this mspan:

$ grep fffffc7febf68ea0 dtrace-19336.0.out
dap: span fffffc7febf68ea0: begin sweep
dap: span fffffc7febf68ea0: begin sweep: allocCount = 1 (0x1)
dap: span fffffc7febf68ea0: begin sweep: freeindex = 1 (0x1)
dap: span fffffc7febf68ea0: begin sweep: sweepgen = 7 (0x7)
dap: span fffffc7febf68ea0: begin sweep: state = 1 (0x1)
dap: span fffffc7febf68ea0: begin sweep: allocCache = 0x7fffffffffffffff
dap: span fffffc7febf68ea0: begin sweep: range [ c000276000, c000278000 )
dap: span fffffc7febf68ea0: begin sweep: nelems = 1 (0x1)
dap: span fffffc7febf68ea0: begin sweep: elemsize = 8192 (0x2000)
dap: span fffffc7febf68ea0: begin sweep: npages = 1
dap: span fffffc7febf68ea0: allocBits:
dap: span fffffc7febf68ea0: gcmarkBits:
dap: span fffffc7febf68ea0: clobbering 0xc000276000
dap: span fffffc7febf68ea0: end sweep
dap: span fffffc7febf68ea0: end sweep: allocCount = 0 (0x0)
dap: span fffffc7febf68ea0: end sweep: freeindex = 0 (0x0)
dap: span fffffc7febf68ea0: end sweep: sweepgen = 8 (0x8)
dap: span fffffc7febf68ea0: end sweep: state = 0 (0x0)
dap: span fffffc7febf68ea0: end sweep: allocCache = 0xffffffffffffffff
dap: span fffffc7febf68ea0: end sweep: range [ c000276000, c000278000 )
dap: span fffffc7febf68ea0: end sweep: nelems = 1 (0x1)
dap: span fffffc7febf68ea0: end sweep: elemsize = 8192 (0x2000)
dap: span fffffc7febf68ea0: end sweep: npages = 1
dap: span fffffc7febf68ea0: allocBits:
dap: span fffffc7febf68ea0: begin sweep
dap: span fffffc7febf68ea0: begin sweep: allocCount = 1 (0x1)
dap: span fffffc7febf68ea0: begin sweep: freeindex = 1 (0x1)
dap: span fffffc7febf68ea0: begin sweep: sweepgen = 9 (0x9)
dap: span fffffc7febf68ea0: begin sweep: state = 1 (0x1)
dap: span fffffc7febf68ea0: begin sweep: allocCache = 0x7fffffffffffffff
dap: span fffffc7febf68ea0: begin sweep: range [ c0007ae000, c0007b5f80 )
dap: span fffffc7febf68ea0: begin sweep: nelems = 5 (0x5)
dap: span fffffc7febf68ea0: begin sweep: elemsize = 6528 (0x1980)
dap: span fffffc7febf68ea0: begin sweep: npages = 4
dap: span fffffc7febf68ea0: allocBits:
dap: span fffffc7febf68ea0: gcmarkBits:
dap: span fffffc7febf68ea0: clobbering 0xc0007ae000
dap: span fffffc7febf68ea0: end sweep
dap: span fffffc7febf68ea0: end sweep: allocCount = 0 (0x0)
dap: span fffffc7febf68ea0: end sweep: freeindex = 0 (0x0)
dap: span fffffc7febf68ea0: end sweep: sweepgen = 10 (0xa)
dap: span fffffc7febf68ea0: end sweep: state = 0 (0x0)
dap: span fffffc7febf68ea0: end sweep: allocCache = 0xffffffffffffffff
dap: span fffffc7febf68ea0: end sweep: range [ c0007ae000, c0007b5f80 )
dap: span fffffc7febf68ea0: end sweep: nelems = 5 (0x5)
dap: span fffffc7febf68ea0: end sweep: elemsize = 6528 (0x1980)
dap: span fffffc7febf68ea0: end sweep: npages = 4
dap: span fffffc7febf68ea0: allocBits:

It looks like an mspan with this address has been swept twice, but both times it was a different mspan (different range, element size, etc.). It's never been swept in its current state. Just to be sure, I checked for any errors or drops from DTrace:

> !grep 'dtrace:' dtrace-19336.0.out
>

and I believe the DTrace output file is complete (not truncated) because it contains all the stderr writes of Go dumping its stack traces (which happens at the end).

So, the assertion is complaining that we've got a span with no free items but allocCount is too low. So what is allocated? There are two ways to look at it. First, I enumerated the addresses covered by the mspan, and for each one, checked whether there's an allocation and/or free for that address in the DTrace output. The easiest way to do this was to tell mdb about a "my_buffer" type with the same size as the elements in this mspan and have it enumerate the addresses in an array of my_buffer starting at the span's start address:

> ::typedef 'struct { uint8_t foo[0t144]; }' my_buffer
> 0xc000850000::array my_buffer 0t56 ! cat > expected-all.txt

Then I searched for each one in the trace output:

$ cat expected-all.txt | while read x; do echo "searching for $x: "; grep $x dtrace-19336.0.out; echo; done
searching for c000850000:
dap: alloc size 0x1fb0 = 0xc000850000
dap: span fffffc7fee310c70: begin sweep: range [ c000850000, c000852000 )
dap: span fffffc7fee310c70: clobbering 0xc000850000
dap: span fffffc7fee310c70: end sweep: range [ c000850000, c000852000 )
dap: span fffffc7fee310cf8: begin sweep: range [ c00084e000, c000850000 )
dap: span fffffc7fee310cf8: end sweep: range [ c00084e000, c000850000 )
dap: span fffffc7fee310cf8: begin sweep: range [ c00084e000, c000850000 )
dap: span fffffc7fee310cf8: end sweep: range [ c00084e000, c000850000 )

searching for c000850090:

searching for c000850120:

searching for c0008501b0:
dap: alloc size 0x88 = 0xc0008501b0

searching for c000850240:
dap: alloc size 0x88 = 0xc000850240

searching for c0008502d0:
dap: alloc size 0x88 = 0xc0008502d0

searching for c000850360:

searching for c0008503f0:

searching for c000850480:

searching for c000850510:

searching for c0008505a0:
dap: alloc size 0x88 = 0xc0008505a0

searching for c000850630:

searching for c0008506c0:
dap: alloc size 0x88 = 0xc0008506c0

searching for c000850750:
dap: alloc size 0x88 = 0xc000850750

searching for c0008507e0:
dap: alloc size 0x88 = 0xc0008507e0

searching for c000850870:

searching for c000850900:
dap: alloc size 0x88 = 0xc000850900

searching for c000850990:
dap: alloc size 0x88 = 0xc000850990

searching for c000850a20:
dap: alloc size 0x88 = 0xc000850a20

searching for c000850ab0:
dap: alloc size 0x88 = 0xc000850ab0

searching for c000850b40:

searching for c000850bd0:

searching for c000850c60:
dap: alloc size 0x88 = 0xc000850c60

searching for c000850cf0:

searching for c000850d80:
dap: alloc size 0x88 = 0xc000850d80

searching for c000850e10:
dap: alloc size 0x88 = 0xc000850e10

searching for c000850ea0:
dap: alloc size 0x88 = 0xc000850ea0

searching for c000850f30:
dap: alloc size 0x88 = 0xc000850f30

searching for c000850fc0:
dap: alloc size 0x88 = 0xc000850fc0

searching for c000851050:
dap: alloc size 0x88 = 0xc000851050

searching for c0008510e0:

searching for c000851170:

searching for c000851200:

searching for c000851290:

searching for c000851320:
dap: alloc size 0x88 = 0xc000851320

searching for c0008513b0:
dap: alloc size 0x88 = 0xc0008513b0

searching for c000851440:

searching for c0008514d0:

searching for c000851560:
dap: alloc size 0x88 = 0xc000851560

searching for c0008515f0:
dap: alloc size 0x88 = 0xc0008515f0

searching for c000851680:

searching for c000851710:

searching for c0008517a0:

searching for c000851830:

searching for c0008518c0:
dap: alloc size 0x88 = 0xc0008518c0

searching for c000851950:
dap: alloc size 0x88 = 0xc000851950

searching for c0008519e0:

searching for c000851a70:

searching for c000851b00:
dap: alloc size 0x88 = 0xc000851b00

searching for c000851b90:
dap: alloc size 0x88 = 0xc000851b90

searching for c000851c20:

searching for c000851cb0:

searching for c000851d40:
dap: alloc size 0x88 = 0xc000851d40

searching for c000851dd0:

searching for c000851e60:
dap: alloc size 0x88 = 0xc000851e60

searching for c000851ef0:

The very first address has some false positives. We have a 8112-byte allocation that returned c000850000 -- I infer that this is the allocation for the memory that became the mspan we're inspecting. Then we swept fffffc7fee310c70, which appears to be that single-element 8192-byte mspan. Then we swept an unrelated span that just happened to end at c000850000. I think we can ignore all of those -- this is essentially saying that c000850000 was never allocated from the mspan we're interested in.

Then notice that we didn't allocate a bunch of other addresses (e.g., c000850090, c000850120), but we did allocate some later ones. This seems weird. We never freed any of the addresses and, again, we don't seem to have ever swept this mspan. I summarized it like this:

$ cat expected-all.txt | while read x; do echo -n "$x: "; if grep "dap: alloc size 0x88 = 0x$x" dtrace-19336.0.out > /dev/null; then echo yes; else echo no; fi; done
c000850000: no
c000850090: no
c000850120: no
c0008501b0: yes
c000850240: yes
c0008502d0: yes
c000850360: no
c0008503f0: no
c000850480: no
c000850510: no
c0008505a0: yes
c000850630: no
c0008506c0: yes
c000850750: yes
c0008507e0: yes
c000850870: no
c000850900: yes
c000850990: yes
c000850a20: yes
c000850ab0: yes
c000850b40: no
c000850bd0: no
c000850c60: yes
c000850cf0: no
c000850d80: yes
c000850e10: yes
c000850ea0: yes
c000850f30: yes
c000850fc0: yes
c000851050: yes
c0008510e0: no
c000851170: no
c000851200: no
c000851290: no
c000851320: yes
c0008513b0: yes
c000851440: no
c0008514d0: no
c000851560: yes
c0008515f0: yes
c000851680: no
c000851710: no
c0008517a0: no
c000851830: no
c0008518c0: yes
c000851950: yes
c0008519e0: no
c000851a70: no
c000851b00: yes
c000851b90: yes
c000851c20: no
c000851cb0: no
c000851d40: yes
c000851dd0: no
c000851e60: yes
c000851ef0: no

I also confirmed by hand that they addresses were allocated in increasing order of addresses:

$ cat expected-all.txt | while read x; do echo -n "$x: "; grep -n "dap: alloc size 0x88 = 0x$x" dtrace-19336.0.out || echo none; done | grep -v none
c0008501b0: 229472:dap: alloc size 0x88 = 0xc0008501b0
c000850240: 229476:dap: alloc size 0x88 = 0xc000850240
c0008502d0: 229480:dap: alloc size 0x88 = 0xc0008502d0
c0008505a0: 229483:dap: alloc size 0x88 = 0xc0008505a0
c0008506c0: 229487:dap: alloc size 0x88 = 0xc0008506c0
c000850750: 229491:dap: alloc size 0x88 = 0xc000850750
c0008507e0: 229494:dap: alloc size 0x88 = 0xc0008507e0
c000850900: 229498:dap: alloc size 0x88 = 0xc000850900
c000850990: 229502:dap: alloc size 0x88 = 0xc000850990
c000850a20: 229505:dap: alloc size 0x88 = 0xc000850a20
c000850ab0: 229512:dap: alloc size 0x88 = 0xc000850ab0
c000850c60: 229532:dap: alloc size 0x88 = 0xc000850c60
c000850d80: 229533:dap: alloc size 0x88 = 0xc000850d80
c000850e10: 229562:dap: alloc size 0x88 = 0xc000850e10
c000850ea0: 229563:dap: alloc size 0x88 = 0xc000850ea0
c000850f30: 229564:dap: alloc size 0x88 = 0xc000850f30
c000850fc0: 229565:dap: alloc size 0x88 = 0xc000850fc0
c000851050: 229568:dap: alloc size 0x88 = 0xc000851050
c000851320: 229572:dap: alloc size 0x88 = 0xc000851320
c0008513b0: 229575:dap: alloc size 0x88 = 0xc0008513b0
c000851560: 229579:dap: alloc size 0x88 = 0xc000851560
c0008515f0: 229582:dap: alloc size 0x88 = 0xc0008515f0
c0008518c0: 229586:dap: alloc size 0x88 = 0xc0008518c0
c000851950: 229589:dap: alloc size 0x88 = 0xc000851950
c000851b00: 229593:dap: alloc size 0x88 = 0xc000851b00
c000851b90: 229596:dap: alloc size 0x88 = 0xc000851b90
c000851d40: 229600:dap: alloc size 0x88 = 0xc000851d40
c000851e60: 229603:dap: alloc size 0x88 = 0xc000851e60

I decided to take a look at allocBits for this span. I'd expected these bits to be all zeroes because, again, it seems like this span has never been swept, and it looks to me like these are only ever set during sweep. But what I found is that the allocBits exactly match what the DTrace output shows about which of these are allocated -- except for one thing:

> 0xfffffc7febf68ea0::print -at mspan_t
fffffc7febf68ea0 mspan_t {
    fffffc7febf68ea0 void *next = 0
    fffffc7febf68ea8 void *prev = 0
    fffffc7febf68eb0 void *list = 0
    fffffc7febf68eb8 uintptr_t startAddr = 0xc000850000
    fffffc7febf68ec0 uintptr_t npages = 0x1
    fffffc7febf68ec8 void *manualFreeList = 0
    fffffc7febf68ed0 uintptr_t freeindex = 0x38
    fffffc7febf68ed8 uintptr_t nelems = 0x38
    fffffc7febf68ee0 uint64_t allocCache = 0x9a
    fffffc7febf68ee8 void *allocBits = 0xfffffc7feeb11088
    fffffc7febf68ef0 void *gcmarkBits = 0xfffffc7feeb11080
    fffffc7febf68ef8 uint32_t sweepgen = 0xd
    fffffc7febf68efc uint32_t divMul = 0x1c71c72
    fffffc7febf68f00 uint16_t allocCount = 0x1c
    fffffc7febf68f02 uint8_t spanclass = 0x16
    fffffc7febf68f03 uint8_t state = 0x1
    fffffc7febf68f04 uint8_t needzero = 0
    fffffc7febf68f06 uint16_t allocCountBeforeCache = 0
    fffffc7febf68f08 uintptr_t elemsize = 0x90
    fffffc7febf68f10 uintptr_t limit = 0xc000851f80
}

# print 8 bytes from allocBits:
> 0xfffffc7feeb11088,0t8/B
0xfffffc7feeb11088:             c7      8b      b0      c0      33      cf      ac      b2

# assemble into a little-endian number:
> 0xb2accf33c0b08bc7=K
                b2accf33c0b08bc7

# print the 1 bits
> b2accf33c0b08bc7=j
                1011001010101100110011110011001111000000101100001000101111000111
                | ||  | | | ||  ||  ||||  ||  ||||      | ||    |   | ||||   |||
                | ||  | | | ||  ||  ||||  ||  ||||      | ||    |   | ||||   ||+-- bit 0  mask 0x0000000000000001
                | ||  | | | ||  ||  ||||  ||  ||||      | ||    |   | ||||   |+--- bit 1  mask 0x0000000000000002
                | ||  | | | ||  ||  ||||  ||  ||||      | ||    |   | ||||   +---- bit 2  mask 0x0000000000000004
                | ||  | | | ||  ||  ||||  ||  ||||      | ||    |   | |||+-------- bit 6  mask 0x0000000000000040
                | ||  | | | ||  ||  ||||  ||  ||||      | ||    |   | ||+--------- bit 7  mask 0x0000000000000080
                | ||  | | | ||  ||  ||||  ||  ||||      | ||    |   | |+---------- bit 8  mask 0x0000000000000100
                | ||  | | | ||  ||  ||||  ||  ||||      | ||    |   | +----------- bit 9  mask 0x0000000000000200
                | ||  | | | ||  ||  ||||  ||  ||||      | ||    |   +------------- bit 11 mask 0x0000000000000800
                | ||  | | | ||  ||  ||||  ||  ||||      | ||    +----------------- bit 15 mask 0x0000000000008000
                | ||  | | | ||  ||  ||||  ||  ||||      | |+---------------------- bit 20 mask 0x0000000000100000
                | ||  | | | ||  ||  ||||  ||  ||||      | +----------------------- bit 21 mask 0x0000000000200000
                | ||  | | | ||  ||  ||||  ||  ||||      +------------------------- bit 23 mask 0x0000000000800000
                | ||  | | | ||  ||  ||||  ||  |||+-------------------------------- bit 30 mask 0x0000000040000000
                | ||  | | | ||  ||  ||||  ||  ||+--------------------------------- bit 31 mask 0x0000000080000000
                | ||  | | | ||  ||  ||||  ||  |+---------------------------------- bit 32 mask 0x0000000100000000
                | ||  | | | ||  ||  ||||  ||  +----------------------------------- bit 33 mask 0x0000000200000000
                | ||  | | | ||  ||  ||||  |+-------------------------------------- bit 36 mask 0x0000001000000000
                | ||  | | | ||  ||  ||||  +--------------------------------------- bit 37 mask 0x0000002000000000
                | ||  | | | ||  ||  |||+------------------------------------------ bit 40 mask 0x0000010000000000
                | ||  | | | ||  ||  ||+------------------------------------------- bit 41 mask 0x0000020000000000
                | ||  | | | ||  ||  |+-------------------------------------------- bit 42 mask 0x0000040000000000
                | ||  | | | ||  ||  +--------------------------------------------- bit 43 mask 0x0000080000000000
                | ||  | | | ||  |+------------------------------------------------ bit 46 mask 0x0000400000000000
                | ||  | | | ||  +------------------------------------------------- bit 47 mask 0x0000800000000000
                | ||  | | | |+---------------------------------------------------- bit 50 mask 0x0004000000000000
                | ||  | | | +----------------------------------------------------- bit 51 mask 0x0008000000000000
                | ||  | | +------------------------------------------------------- bit 53 mask 0x0020000000000000
                | ||  | +--------------------------------------------------------- bit 55 mask 0x0080000000000000
                | ||  +----------------------------------------------------------- bit 57 mask 0x0200000000000000
                | |+-------------------------------------------------------------- bit 60 mask 0x1000000000000000
                | +--------------------------------------------------------------- bit 61 mask 0x2000000000000000
                +----------------------------------------------------------------- bit 63 mask 0x8000000000000000

Now, I expected bits 56-63 to be 0, but they shouldn't matter anyway. The rest of these bits align exactly with the unallocated items. This is surprising to me on two levels: if this mspan has never been swept, I'd expect these to be all zeroes. If for some reason it has been swept and these accurately reflect what's allocated, they appear to be inverted, right?

I also checked allocCache (0x9a). This is the result of inverting the above and shifting it by 55:

> ~b2accf33c0b08bc7>>0t55=K
                9a

So that's pretty self-consistent (though I'm not sure why it took 55 shifts and not 56).

There's a lot that's confusing here:

  1. Did we ever sweep this mspan or not? The trace output strongly suggests not -- not just that we don't see one, but we also don't see allocation addresses from this mspan ever going backwards (as would happen if we swept it and set freeindex = 0).
  2. But then how did allocBits get set to something that seems close to accurate?
  3. Most importantly: from my read of the code, Go should always allocate consecutive addresses from an mspan until it is swept the first time. How did we manage to skip some addresses?

I found one more compelling piece of information, which is that the addresses for allocBits and gcmarkBits are adjacent:

> 0xfffffc7febf68ea0::print -at mspan_t
fffffc7febf68ea0 mspan_t {
...
    fffffc7febf68ee8 void *allocBits = 0xfffffc7feeb11088
    fffffc7febf68ef0 void *gcmarkBits = 0xfffffc7feeb11080
...
}

That's significant because under normal operation, these two bitfields are allocated from different arenas, which come from different mappings. The only time they'd be adjacent is after an mspan is initialized but before the first time it's swept. Between the DTrace output showing no sweep, the DTrace output showing all allocations in address order, and this data point, I think we can be confident that this mspan has never been swept.

So it seems as though we allocated from this span sparsely, which is strange, given the way we said the allocator works. Why would that be? Maybe our tracing somehow missed the allocations? I wonder if these supposedly unallocated things were referenced anywhere? Here's one way to grep the dump the whole address space to a text file:

> ::mappings ! awk '/\[/{ print $1",",$3"::dump -g8 -e" }' > memory-dump-commands.txt

$ mdb core.cockroach.19343 < memory-dump-commands.txt > memory.txt

Now, indeed, some of those things that were never allocated are not referenced anywhere:

$ grep c0008510e0 memory.txt
c0008510e0:  0000000000000000 0000000000000000
$ grep c000851170 memory.txt
c000851170:  0000000000000000 0000000000000000

By contrast, this one that was allocated is referenced elsewhere:

$ grep c0008501b0 memory.txt
c0008501b0:  0000000000000000 0000000000000000
c000beb480:  000000c0008501b0 000000c0008502d0
$

This is all consistent with the idea that these just were skipped over. But why? And what's the deal with allocBits being so close to consistent with the allocated addresses, but backwards? The solution may be obvious to the reader, but wasn't to me. I'd been assuming that the mspan was correctly initialized and some behavior caused us to skip some elements and I'd misunderstood the code that manages allocBits. While discussing with @bnaecker, we realized that if allocBits were set to this pattern when it was initialized, then everything else makes sense, including both the pattern of allocations and the allocCount being 27, even though we're apparently out of free slots. In other words, these bits are inverted from what's actually allocated because they caused the allocations to look like that.

To play this out, suppose the initial allocBits for some reason were 01010101... Then we'd expect to allocate address 0, then address 2, then address 4, etc. By the end, the addresses allocated would be exactly the 0 bits.

So at this point, it seems likely that one of these things happened:

  • the allocBits were not all-zero when they were allocated
  • something clobbered allocBits some time after they were allocated

If they were clobbered, there might be other other references to the allocBits address in the core file. Are there?

$ grep fffffc7feeb11088 memory.txt 
fffffc7febf68ee0:  000000000000009a fffffc7feeb11088

Nope -- that's the reference within the mspan itself. What about the adjacent word?

$ grep fffffc7feeb11080 memory.txt
fffffc7febf68ef0:  fffffc7feeb11080 01c71c720000000d
fffffc7feeb11080:  275e00ddafc44382 b2accf33c0b08bc7

Nope. The first hit there is the reference within the mspan, and the second is line describing what's at the address we grep'd for. (There's no corresponding line in the previous example because the way we dumped this, only every other word-aligned address is labeled.)

What about the bit pattern? It doesn't look like ASCII or any other obvious value. Does it exist elsewhere in the core file?

$ grep b2accf33c0b08bc7 memory.txt
fffffc7feeb11000:  275e00ddafc44382 b2accf33c0b08bc7
fffffc7feeb11020:  275e00ddafc44382 b2accf33c0b08bc7
fffffc7feeb11040:  275e00ddafc44382 b2accf33c0b08bc7
fffffc7feeb11060:  275e00ddafc44382 b2accf33c0b08bc7
fffffc7feeb11080:  275e00ddafc44382 b2accf33c0b08bc7
...
fffffc7feeb1ffc0:  275e00ddafc44382 b2accf33c0b08bc7
fffffc7feeb1ffe0:  275e00ddafc44382 b2accf33c0b08bc7
$

My jaw dropped. There's a ton of this pattern! 1,920 elements to be exact:

dap@ivanova repro-2-omicron-test $ grep b2accf33c0b08bc7 memory.txt | awk '{$1 = ""; print}' | sort | uniq -c
1920  275e00ddafc44382 b2accf33c0b08bc7

That's pretty surprising for a bit pattern that supposedly represents the set of objects allocated from one particular 8KiB page.

Looking closer, it's really a 32-byte pattern:

    fffffc7feeb11000:  275e00ddafc44382 b2accf33c0b08bc7
    fffffc7feeb11010:  0000000000000000 0000000000000000
    fffffc7feeb11020:  275e00ddafc44382 b2accf33c0b08bc7
    fffffc7feeb11030:  0000000000000000 0000000000000000
    fffffc7feeb11040:  275e00ddafc44382 b2accf33c0b08bc7
    fffffc7feeb11050:  0000000000000000 0000000000000000
    ...
    fffffc7feeb1ffc0:  275e00ddafc44382 b2accf33c0b08bc7
    fffffc7feeb1ffd0:  0000000000000000 0000000000000000
    fffffc7feeb1ffe0:  275e00ddafc44382 b2accf33c0b08bc7
    fffffc7feeb1fff0:  0000000000000000 0000000000000000
    fffffc7feecf0000:  0000000000001000 fffffc7fef000000
    fffffc7feecf0010:  0000000000000778 fffffc7feecf0000

So that's almost 64KiB (32 bytes times 1920 rows) -- 61,440 bytes, to be more precise. So...who's referencing the start of that range?

$ grep fffffc7feeb11000 memory.txt
fffffc7fee311540:  fffffc7feeb11000 01c71c720000000a
fffffc7feeb11000:  275e00ddafc44382 b2accf33c0b08bc7

So it's referred-to by fffffc7fee311540. But I don't know what that is either:

> fffffc7fee311540$m
            BASE            LIMIT             SIZE NAME
fffffc7fee30e000 fffffc7fee34e000            40000 [ unknown ]

but clearly this bit pattern is present in a lot of places and it seems
incredibly unlikely that it would (correctly) represent the set of allocBits for
any span.

Let's take another look at where this memory comes from. They're allocated via runtime.newMarkBits(), which has a few paths, all of which boil down to a call to either sysAlloc, which winds up mmap'ing a 64 KiB region of anonymous memory (which will be implicitly zero'd), or else a block of memory cleared with runtime.memclrNoHeapPointers().

I see that this function seems to work by copying data from %xmm15 all over the buffer it's given. This sounds suspiciously like a bug that @mknyszek had mentioned was found on Linux some time ago. If these registers were somehow non-zero, that could totally explain a bunch of stuff having the same bit pattern.

I had previously found that setting GODEBUG=asyncpreemptoff=1 appears to chase this problem away (i.e., it's not reproducible with that set). As I understand it, async preemption is how the Goroutine scheduler ensures that goroutines can't hog the CPU by not reaching a yield point. It does this by arranging to have a signal (SIGURG) sent pretty frequently. Based on this data point and the input from Michael, I had previously spent some time digging into the signal handler angle without finding an obvious problem.

At this point I wondered if we're somehow not preserving this register? @rmustacc and I explored this. %xmm isn't a problem, but the code also uses %ymm0, and that turns out to be a problem. Robert helped me write a C program to test this behavior. We confirmed the bug and filed illumos#15254.

So we've confirmed that this is a problem. And it can explain all the data here: if, during span allocation, while allocating allocBits, runtime.memclrNoHeapPointers is interrupted with a signal after it clears %ymm0, then the high bits of %ymm0 are likely to be clobbered with some 1s (the low bits remain zero'd even when hitting that illumos bug), and we'd expect to see exactly what we see: a repeated 32-byte pattern, the trailing half of which are all zeros; we'd then see a strange but predictable pattern of allocations from this span; and if we got to the end of the span before sweeping it, we'd panic exactly like this.

That sounds like a lot of "if"s, but then again, this problem isn't very reproducible to begin with. Still, can we convince ourselves that's what did happen here?

@davepacheco
Copy link
Collaborator Author

davepacheco commented Dec 16, 2022

(part 3/5)

Trying to prove this problem is the same problem

Robert had the idea to trace the point where the kernel delivers a signal and examine %rip. This could tell us whether it's at least possible that we're interrupting this function. I used this D script:

#!/usr/sbin/dtrace -Cs

#pragma D option quiet

BEGIN
{
	printf("dap: tracing pid %d\n", $target);
}

proc:::signal-handle
/pid == $target/
{
	this->regs = (struct regs *)curthread->t_lwp->lwp_regs;
	this->rip = this->regs->r_rip;
	@addrs[uaddr(this->rip)] = count();

	@syms[usym(this->rip)] = count();
}
$ pfexec ./gotracesigs2.d -c 'cockroach version'
dap: tracing pid 10882
Build Tag:        v22.1.9-dirty
Build Time:       2022/10/21 16:56:49
Distribution:     OSS
Platform:         illumos amd64 (x86_64-pc-solaris2.11)
Go Version:       go1.17.13
C Compiler:       gcc 10.3.0
Build Commit ID:  e438c2f89282e607e0e6ca1d38b2e0a622f94493
Build Type:       release

  cockroach`runtime.lock2+0x55                                      1
  cockroach`runtime.unlock2+0x44                                    1
  cockroach`runtime.newobject+0x25                                  1
  cockroach`runtime.mapassign_faststr+0x74                          1
  cockroach`runtime.findObject+0x9                                  1
  cockroach`runtime.findObject+0x3b                                 1
  cockroach`runtime.findObject+0x75                                 1
  cockroach`runtime.findObject+0x78                                 1
  cockroach`runtime.findObject+0x11c                                1
  cockroach`runtime.findObject+0x143                                1
  cockroach`runtime.findObject+0x17e                                1
  cockroach`runtime.bulkBarrierPreWrite+0x295                       1
  cockroach`runtime.gcBgMarkWorker+0x167                            1
  cockroach`runtime.gcDrain+0x120                                   1
  cockroach`runtime.gcDrain+0x126                                   1
  cockroach`runtime.scanblock+0x6b                                  1
  cockroach`runtime.scanblock+0x7c                                  1
  cockroach`runtime.scanblock+0x157                                 1
  cockroach`runtime.scanobject+0x135                                1
  cockroach`runtime.scanobject+0x13e                                1
  cockroach`runtime.scanobject+0x152                                1
  cockroach`runtime.scanobject+0x166                                1
  cockroach`runtime.scanobject+0x16a                                1
  cockroach`runtime.scanobject+0x1a4                                1
  cockroach`runtime.scanobject+0x1b3                                1
  cockroach`runtime.scanobject+0x215                                1
  cockroach`runtime.scanobject+0x225                                1
  cockroach`runtime.scanobject+0x227                                1
  cockroach`runtime.greyobject+0xf9                                 1
  cockroach`runtime.getempty+0x18b                                  1
  cockroach`runtime.getempty+0x202                                  1
  cockroach`runtime.trygetfull+0x44                                 1
  cockroach`runtime.semacreate+0x1b                                 1
  cockroach`runtime.startm+0x80                                     1
  cockroach`runtime.findrunnable+0x2fe                              1
  cockroach`runtime.stealWork+0xbf                                  1
  cockroach`runtime.schedule+0xa0                                   1
  cockroach`runtime.schedule+0x10b                                  1
  cockroach`runtime.checkTimers+0x9                                 1
  cockroach`runtime.checkTimers+0x2f                                1
  cockroach`runtime.mput+0x27                                       1
  cockroach`runtime.runqgrab+0x2f                                   1
  cockroach`runtime.runqgrab+0x3a                                   1
  cockroach`runtime.makeslice+0xd                                   1
  cockroach`runtime.stackpoolalloc+0x122                            1
  cockroach`runtime.stackalloc+0x10e                                1
  cockroach`runtime.pcvalue+0x1ca                                   1
  cockroach`runtime.step+0xc2                                       1
  cockroach`runtime.siftdownTimer+0x56                              1
  cockroach`runtime.gentraceback+0xfc9                              1
  cockroach`runtime.gcBgMarkWorker.func1+0x26                       1
  cockroach`runtime.memclrNoHeapPointers+0x1f1                      1
  cockroach`runtime.memclrNoHeapPointers+0x24c                      1
  cockroach`runtime.asmsysvicall6+0x7f                              1
  cockroach`compress/flate.(*decompressor).huffmanBlock+0xb6                1
  cockroach`github.com/kr/text.WrapWords+0xef                       1
  cockroach`github.com/kr/text.WrapWords+0x3e5                      1
  libc.so.1`__cp_tsc_read+0xf                                       1
  libc.so.1`queue_lock+0x6e                                         1
  libc.so.1`no_preempt+0x41                                         1
  libc.so.1`mmap+0xa                                                1
  libc.so.1`_lwp_kill+0x18                                          1
  libc.so.1`_syscall6+0x1b                                          1
  cockroach`runtime.(*lfstack).push+0x69                            2
  cockroach`runtime.lock2+0x50                                      2
  cockroach`runtime.findObject+0x2c                                 2
  cockroach`runtime.gcBgMarkWorker+0x26d                            2
  cockroach`runtime.scanblock+0x80                                  2
  cockroach`runtime.scanblock+0x83                                  2
  cockroach`runtime.scanblock+0x9b                                  2
  cockroach`runtime.greyobject+0xae                                 2
  cockroach`runtime.procyield+0x6                                   2
  cockroach`runtime.memclrNoHeapPointers+0x128                      2
  cockroach`runtime.memmove+0x42a                                   2
  libc.so.1`_portfs+0xa                                             2
  libc.so.1`yield+0xa                                               2
  cockroach`runtime.findObject+0x14                                 3
  cockroach`runtime.findObject+0x3f                                 3
  cockroach`runtime.scanobject+0x16d                                7
  libc.so.1`__lwp_unpark+0x14                                       9
  libc.so.1`__lwp_park+0x17                                        10
  libc.so.1`__sigresend+0xa                                        11
  libc.so.1`_lwp_kill+0xa                                          13
  cockroach`runtime.unlock2                                         1
  cockroach`runtime.newobject                                       1
  cockroach`runtime.mapassign_faststr                               1
  cockroach`runtime.bulkBarrierPreWrite                             1
  cockroach`runtime.trygetfull                                      1
  cockroach`runtime.semacreate                                      1
  cockroach`runtime.startm                                          1
  cockroach`runtime.findrunnable                                    1
  cockroach`runtime.stealWork                                       1
  cockroach`runtime.mput                                            1
  cockroach`runtime.makeslice                                       1
  cockroach`runtime.stackpoolalloc                                  1
  cockroach`runtime.stackalloc                                      1
  cockroach`runtime.pcvalue                                         1
  cockroach`runtime.step                                            1
  cockroach`runtime.siftdownTimer                                   1
  cockroach`runtime.gentraceback                                    1
  cockroach`runtime.gcBgMarkWorker.func1                            1
  cockroach`runtime.asmsysvicall6                                   1
  cockroach`compress/flate.(*decompressor).huffmanBlock                1
  libc.so.1`__cp_tsc_read                                           1
  libc.so.1`queue_lock                                              1
  libc.so.1`no_preempt                                              1
  libc.so.1`mmap                                                    1
  libc.so.1`_syscall6                                               1
  cockroach`runtime.(*lfstack).push                                 2
  cockroach`runtime.gcDrain                                         2
  cockroach`runtime.getempty                                        2
  cockroach`runtime.schedule                                        2
  cockroach`runtime.checkTimers                                     2
  cockroach`runtime.runqgrab                                        2
  cockroach`runtime.procyield                                       2
  cockroach`runtime.memmove                                         2
  cockroach`github.com/kr/text.WrapWords                            2
  libc.so.1`_portfs                                                 2
  libc.so.1`yield                                                   2
  cockroach`runtime.lock2                                           3
  cockroach`runtime.gcBgMarkWorker                                  3
  cockroach`runtime.greyobject                                      3
  cockroach`runtime.memclrNoHeapPointers                            4
  cockroach`runtime.scanblock                                       9
  libc.so.1`__lwp_unpark                                            9
  libc.so.1`__lwp_park                                             10
  libc.so.1`__sigresend                                            11
  libc.so.1`_lwp_kill                                              14
  cockroach`runtime.findObject                                     15
  cockroach`runtime.scanobject                                     17

There are two histograms back-to-back there: the first is a count of actual instructions interrupted, while the second is a count by function name. We interrupted runtime.memclrNoHeapPointers 4 times during this invocation. I've run this a few times and each time we interrupt that function a few times. So it's certainly possible we're taking a signal there.

I thought I'd try the experiment of using DTrace to instrument an instruction inside runtime.memclrNoHeapPointers after we've cleared %ymm0 and raise(SIGURG) (which, as the name suggests, raises that signal synchronously in the current thread). (I tried SIGUSR1 first, then used SIGURG to match what would be happening with Go. The results didn't change.) From my read of the function, we're going to do the bulk of our work in this loop:

 92 loop_avx2:
 93         VMOVDQU Y0, 0(DI)
 94         VMOVDQU Y0, 32(DI)
 95         VMOVDQU Y0, 64(DI)
 96         VMOVDQU Y0, 96(DI)
 97         SUBQ    $128, BX
 98         ADDQ    $128, DI
 99         CMPQ    BX, $128
100         JAE     loop_avx2
101         VMOVDQU  Y0, -32(DI)(BX*1)
102         VMOVDQU  Y0, -64(DI)(BX*1)
103         VMOVDQU  Y0, -96(DI)(BX*1)
104         VMOVDQU  Y0, -128(DI)(BX*1)
105         VZEROUPPER
106         RET

which corresponds to these instructions, as viewed in mdb:

runtime.memclrNoHeapPointers+0x128:     vmovdqu %ymm0,(%rdi)
runtime.memclrNoHeapPointers+0x12c:     vmovdqu %ymm0,0x20(%rdi)
runtime.memclrNoHeapPointers+0x131:     vmovdqu %ymm0,0x40(%rdi)
runtime.memclrNoHeapPointers+0x136:     vmovdqu %ymm0,0x60(%rdi)
runtime.memclrNoHeapPointers+0x13b:     subq   $0x80,%rbx
runtime.memclrNoHeapPointers+0x142:     addq   $0x80,%rdi
runtime.memclrNoHeapPointers+0x149:     cmpq   $0x80,%rbx
runtime.memclrNoHeapPointers+0x150:     jae    -0x2a    <runtime.memclrNoHeapPointers+0x128>
runtime.memclrNoHeapPointers+0x152:     vmovdqu %ymm0,-0x20(%rdi,%rbx)
runtime.memclrNoHeapPointers+0x158:     vmovdqu %ymm0,-0x40(%rdi,%rbx)
runtime.memclrNoHeapPointers+0x15e:     vmovdqu %ymm0,-0x60(%rdi,%rbx)
runtime.memclrNoHeapPointers+0x164:     vmovdqu %ymm0,-0x80(%rdi,%rbx)
runtime.memclrNoHeapPointers+0x16a:     vzeroupper
runtime.memclrNoHeapPointers+0x16d:     ret

I initially tried to catch it partway through this loop after some number of iterations, then I just tried to get it on entry to the loop:

#!/usr/sbin/dtrace -Cs

/*
 * attempt to trigger "allocCount != nelems" panic in Go by raising a signal at
 * just the right time during mspan initialization.
 *
 * The idea here is that if we're in allocSpan() and call newAllocBits() and
 * wind up in the case where we grab an existing page and attempt to bzero it,
 * then we raise the signal inside memclrNoHeapPointers() some time after %ymm0
 * has been cleared.  (Ideally, we'd do this after a few iterations of the main
 * loop after that.  Then we could really see a bunch of zero'd data followed by
 * a bunch of ... not quite zero'd data.)
 */

#pragma D option destructive

pid$target::runtime.newMarkBits:entry
{
	self->in_mark_bits = 1;
}

pid$target::runtime.newMarkBits:return
{
	self->in_mark_bits = 0;
}

pid$target::runtime.memclrNoHeapPointers:11f
/self->in_mark_bits/
{
	@ = count();
	raise(SIGURG);
	self->in_mark_bits = 0;
}

and eventually I wound up instrumenting the instruction immediately after we cleared %ymm0:

loop_preheader_avx2:
	VPXOR Y0, Y0, Y0
	// For smaller sizes MOVNTDQ may be faster or slower depending on hardware.
	// For larger sizes it is always faster, even on dual Xeons with 30M cache.
	// TODO take into account actual LLC size. E. g. glibc uses LLC size/2.
	CMPQ    BX, $0x2000000
	JAE     loop_preheader_avx2_huge

which is a few instructions earlier than the above:

runtime.memclrNoHeapPointers+0x116:     jmp    -0x10e   <runtime.memclrNoHeapPointers+0xd>
runtime.memclrNoHeapPointers+0x11b:     vpxor  %ymm0,%ymm0,%ymm0
runtime.memclrNoHeapPointers+0x11f:     cmpq   $0x2000000,%rbx  <github.com/cockroachdb/cockroach/pkg/roachpb.skipErrors+0x3c0>
runtime.memclrNoHeapPointers+0x126:     jae    +0x46    <runtime.memclrNoHeapPointers+0x16e>
runtime.memclrNoHeapPointers+0x128:     vmovdqu %ymm0,(%rdi)
runtime.memclrNoHeapPointers+0x12c:     vmovdqu %ymm0,0x20(%rdi)

I was hoping this would reliably reproduce the problem...but it doesn't. I did manage to get some hits (failures with exactly this error message) -- probably 2-3 times out of a hundred runs. That's a lot more than it usually happens, but it wasn't the satisfying dead-reproducibility I'd hoped for. I relaxed the script even further:

#pragma D option destructive

pid$target::runtime.memclrNoHeapPointers:11f
{
        @ = count();
        raise(SIGURG);
}

That triggered quite a few of this new failure mode:

$ pfexec ../boom1.d -c 'cockroach version'
dtrace: script '../boom1.d' matched 1 probe
dtrace: allowing destructive actions
fatal error: addtimer called with initialized timer

goroutine 18 [running]:
runtime.throw({0x60d6720, 0x26})
	/opt/ooce/go-1.17/src/runtime/panic.go:1198 +0x74 fp=0xc0000aa6d0 sp=0xc0000aa6a0 pc=0x127d6b4
runtime.addtimer(0xc000214008)
	/opt/ooce/go-1.17/src/runtime/time.go:260 +0x145 fp=0xc0000aa710 sp=0xc0000aa6d0 pc=0x129f485
time.startTimer(0xc000214008)
	/opt/ooce/go-1.17/src/runtime/time.go:212 +0x2b fp=0xc0000aa728 sp=0xc0000aa710 pc=0x12b38cb
time.NewTicker(0x3b9aca00)
	/opt/ooce/go-1.17/src/time/tick.go:39 +0x105 fp=0xc0000aa768 sp=0xc0000aa728 pc=0x12dabc5
time.Tick(...)
	/opt/ooce/go-1.17/src/time/tick.go:68
github.com/cockroachdb/cockroach/pkg/util/log.flushDaemon()
	/home/dap/garbage-compactor/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:75 +0x2e fp=0xc0000aa7e0 sp=0xc0000aa768 pc=0x1d4934e
runtime.goexit()
	/opt/ooce/go-1.17/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000aa7e8 sp=0xc0000aa7e0 pc=0x12b6d41
created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
	/home/dap/garbage-compactor/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:41 +0x35

goroutine 1 [runnable, locked to thread]:
github.com/gogo/protobuf/proto.RegisterType({0x85bfc50, 0x0}, {0x618e149, 0x42})
	/home/dap/garbage-compactor/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/gogo/protobuf/proto/properties.go:555 +0x285
github.com/cockroachdb/cockroach/pkg/storage/enginepb.init.4()
	/home/dap/garbage-compactor/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/storage/enginepb/mvcc.pb.go:330 +0xab

goroutine 19 [runnable]:
os/signal.Notify.func1(0x1)
	/opt/ooce/go-1.17/src/os/signal/signal.go:138 +0xeb
os/signal.Notify(0xc0000da0c0, {0xc0000aafc0, 0x1, 0x1})
	/opt/ooce/go-1.17/src/os/signal/signal.go:165 +0x20c
github.com/cockroachdb/cockroach/pkg/util/sysutil.RefreshSignaledChan(...)
	/home/dap/garbage-compactor/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/sysutil/sysutil.go:59
github.com/cockroachdb/cockroach/pkg/util/log.signalFlusher()
	/home/dap/garbage-compactor/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:97 +0x9c
created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
	/home/dap/garbage-compactor/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:42 +0x4d
dtrace: pid 11676 has exited

                  295

which, based on the message alone, looks pretty expected for a problem where supposedly zeroed memory is not zeroed.

While doing this, I also saw an instance of the "marked free object in span" error. This too is expected -- see my separate comment about other failure modes below.

Between seeing that a typical run of cockroach version interrupts memclrNoHeapPointers a few times, and we're definitely not preserving %ymm0 when that happens, and it clearly breaks things, I feel pretty good about this being the problem.

Still, it'd be nice to convince ourselves that there's no other problem here. On Robert's suggestion, I tried disabling the kernel's support for avx2 by utting set xsave_force_disable = 0 into /etc/system.d and rebooting. I confirmed that the flag was read and processed, but we discovered that it doesn't appear to work as expected: isainfo -x still reports avx2 support and Go continues to use it. I undid this change.

I thought about building a new Go runtime with a modified memclrNoHeapPointers that doesn't use %ymm0 (i.e., branches to the non-avx2 loop, which uses the %xmm registers that work properly). But I wanted to make sure it was truly exactly the same except for this behavior. So instead of building a new Go binary and then a new Cockroach binary, I patched the cockroach binary that I had to achieve the same thing. Here are the instructions where it checks whether to use avx2 based on cpuid and OS support (which was determined previously):

CMPB	internal∕cpu·X86+const_offsetX86HasAVX2(SB), $1
JE loop_preheader_avx2

which is +0x7e here:

runtime.memclrNoHeapPointers+0x64:      jbe    +0x1ae   <runtime.memclrNoHeapPointers+0x218>
runtime.memclrNoHeapPointers+0x6a:      cmpq   $0x100,%rbx
runtime.memclrNoHeapPointers+0x71:      jbe    +0x1d5   <runtime.memclrNoHeapPointers+0x24c>
runtime.memclrNoHeapPointers+0x77:      cmpb   $0x1,+0xa12d425(%rip)    <0xb3e4dc3>
runtime.memclrNoHeapPointers+0x7e:      je     +0x97    <runtime.memclrNoHeapPointers+0x11b>
runtime.memclrNoHeapPointers+0x84:      movdqu %xmm15,(%rdi)
runtime.memclrNoHeapPointers+0x89:      movdqu %xmm15,0x10(%rdi)
runtime.memclrNoHeapPointers+0x8f:      movdqu %xmm15,0x20(%rdi)
runtime.memclrNoHeapPointers+0x95:      movdqu %xmm15,0x30(%rdi)

If I just nop out the six bytes at [ +0x7e , +0x84 ), then we'll skip the branch just as though avx2 support were not available. So I copied the binary and (thanks to a tip from @wesolows) did this:

> ::set -w
> 0x12b799e,0t6?B
runtime.memclrNoHeapPointers+0x7e:              f       84      97      0       0       0       
> 0x12b799e,0t6?v 0x90
runtime.memclrNoHeapPointers+0x7e:              0xf     =       0x90
runtime.memclrNoHeapPointers+0x7f:              0x84    =       0x90
runtime.memclrNoHeapPointers+0x80:              0x97    =       0x90
runtime.memclrNoHeapPointers+0x81:              0       =       0x90
runtime.memclrNoHeapPointers+0x82:              0       =       0x90
runtime.memclrNoHeapPointers+0x83:              0       =       0x90

Now it looks like this:

runtime.memclrNoHeapPointers+0x64:      jbe    +0x1ae   <runtime.memclrNoHeapPointers+0x218>
runtime.memclrNoHeapPointers+0x6a:      cmpq   $0x100,%rbx
runtime.memclrNoHeapPointers+0x71:      jbe    +0x1d5   <runtime.memclrNoHeapPointers+0x24c>
runtime.memclrNoHeapPointers+0x77:      cmpb   $0x1,+0xa12d425(%rip)    <0xb3e4dc3>
runtime.memclrNoHeapPointers+0x7e:      nop    
runtime.memclrNoHeapPointers+0x7f:      nop    
runtime.memclrNoHeapPointers+0x80:      nop    
runtime.memclrNoHeapPointers+0x81:      nop    
runtime.memclrNoHeapPointers+0x82:      nop    
runtime.memclrNoHeapPointers+0x83:      nop    
runtime.memclrNoHeapPointers+0x84:      movdqu %xmm15,(%rdi)
runtime.memclrNoHeapPointers+0x89:      movdqu %xmm15,0x10(%rdi)
runtime.memclrNoHeapPointers+0x8f:      movdqu %xmm15,0x20(%rdi)

With this in place, if there are no other issues here, then the test suite should run indefinitely. Recall that on my test machine, running this chunk of the Omicron test suite in a loop reliably fails within 3-4 minutes. I confirmed this with the usual binary just to be sure, then switched to my modified binary. It ran for 98m before dying on #1936. I reran it overnight and it ran for 736m (12h+) and then died again on the same issue. So I'm feeling pretty good about this! (And also eager to fix #1936 and see how long the test suite can run in a loop.)

@davepacheco
Copy link
Collaborator Author

davepacheco commented Dec 16, 2022

(part 4/5)

Explaining other failure modes

Above, I dissected a particular instance of this failure mode (fatal error: s.allocCount != s.nelems && freeIndex == s.nelems) and it's clear how this problem could cause this issue. What about the other failure modes?

fatal error: found pointer to free object

This was reported under #1223. It was also mentioned above as one of the ones I saw while trying to reproduce this problem by raising a signal at the right time. The failure looks like this:

runtime: marked free object in span 0xfffffc7fee33af40, elemsize=144 freeindex=44 (bad use of unsafe.Pointer? try -d=checkptr)
0xc000f20000 alloc marked
0xc000f20090 alloc marked
...
0xc000f21830 alloc marked
0xc000f218c0 free  marked   zombie
0x000000c000f218c0:  0x0000000000000000  0x0000000000000000
0x000000c000f218d0:  0x0000000000000000  0x0000000000000000
0x000000c000f218e0:  0x0000000000000000  0x0000000000000000
0x000000c000f218f0:  0x0000000000000000  0x0000000000000000
0x000000c000f21900:  0x0000000000000000  0x0000000000000000
0x000000c000f21910:  0x0000000000000000  0x0000000000000000
0x000000c000f21920:  0x0000000000000000  0x0000000000000000
0x000000c000f21930:  0x0000000000000000  0x0000000000000000
0x000000c000f21940:  0x0000000000000000  0x0000000000000000
0xc000f21950 free  marked   zombie
0x000000c000f21950:  0x0000000000000000  0x0000000000000000
0x000000c000f21960:  0x0000000000000000  0x0000000000000000
0x000000c000f21970:  0x0000000000000000  0x0000000000000000
0x000000c000f21980:  0x0000000000000000  0x0000000000000000
0x000000c000f21990:  0x0000000000000000  0x0000000000000000
0x000000c000f219a0:  0x0000000000000000  0x0000000000000000
0x000000c000f219b0:  0x0000000000000000  0x0000000000000000
0x000000c000f219c0:  0x0000000000000000  0x0000000000000000
0x000000c000f219d0:  0x0000000000000000  0x0000000000000000
0xc000f219e0 free  unmarked
0xc000f21a70 free  unmarked
...
fatal error: found pointer to free object

runtime stack:
runtime.throw({0x6097e1e, 0x1c})
	/opt/ooce/go-1.17/src/runtime/panic.go:1198 +0x74 fp=0xfffffc7fe9fffb40 sp=0xfffffc7fe9fffb10 pc=0x127d6b4
runtime.(*mspan).reportZombies(0xfffffc7fee33af40)
	/opt/ooce/go-1.17/src/runtime/mgcsweep.go:691 +0x345 fp=0xfffffc7fe9fffbc0 sp=0xfffffc7fe9fffb40 pc=0x1269505
runtime.(*sweepLocked).sweep(0xfffffc7fe9fffcc0, 0x0)
	/opt/ooce/go-1.17/src/runtime/mgcsweep.go:519 +0x35a fp=0xfffffc7fe9fffca8 sp=0xfffffc7fe9fffbc0 pc=0x126881a
runtime.(*mcentral).uncacheSpan(0xb3dc9e8, 0xfffffc7fee33af40)
	/opt/ooce/go-1.17/src/runtime/mcentral.go:223 +0xcf fp=0xfffffc7fe9fffcd8 sp=0xfffffc7fe9fffca8 pc=0x125a18f
runtime.(*mcache).releaseAll(0xfffffc7fef1f8108)
	/opt/ooce/go-1.17/src/runtime/mcache.go:279 +0x134 fp=0xfffffc7fe9fffd20 sp=0xfffffc7fe9fffcd8 pc=0x1259694
runtime.(*mcache).prepareForSweep(0xfffffc7fef1f8108)
	/opt/ooce/go-1.17/src/runtime/mcache.go:317 +0x46 fp=0xfffffc7fe9fffd48 sp=0xfffffc7fe9fffd20 pc=0x12597a6
...

(I've snipped a lot for brevity.) I reproduced this particular instance using the same DTrace instrumentation and also have a core file. I used similar ::typedef stuff to be able to view Go structures in the core file.

What's happening here is that we're sweeping a span (see the description above of what that entails), and we've found several bits in gcmarkBits that indicate that an object was marked (found by GC) but not allocated (it's greater than freeIndex and its "allocBit" is clear). That should be impossible, of course. In this failure mode, we have an explicit span on the stack (0xfffffc7fee33af40). Its startAddr/limit, elemsize, and freeindex all match the message:

> 0xfffffc7fee33af40::print -at mspan_t
fffffc7fee33af40 mspan_t {
    fffffc7fee33af40 void *next = 0
    fffffc7fee33af48 void *prev = 0
    fffffc7fee33af50 void *list = 0
    fffffc7fee33af58 uintptr_t startAddr = 0xc000f20000
    fffffc7fee33af60 uintptr_t npages = 0x1
    fffffc7fee33af68 void *manualFreeList = 0
    fffffc7fee33af70 uintptr_t freeindex = 0x2c
    fffffc7fee33af78 uintptr_t nelems = 0x38
    fffffc7fee33af80 uint64_t allocCache = 0xfffff
    fffffc7fee33af88 void *allocBits = 0xfffffc7fe60900a8
    fffffc7fee33af90 void *gcmarkBits = 0xfffffc7fe60900a0
    fffffc7fee33af98 uint32_t sweepgen = 0x9
    fffffc7fee33af9c uint32_t divMul = 0x1c71c72
    fffffc7fee33afa0 uint16_t allocCount = 0x2c
    fffffc7fee33afa2 uint8_t spanclass = 0x16
    fffffc7fee33afa3 uint8_t state = 0x1
    fffffc7fee33afa4 uint8_t needzero = 0
    fffffc7fee33afa6 uint16_t allocCountBeforeCache = 0
    fffffc7fee33afa8 uintptr_t elemsize = 0x90
    fffffc7fee33afb0 uintptr_t limit = 0xc000f21f80
}

The DTrace output shows:

$ grep fffffc7fee33af40 dtrace.out
dap: span fffffc7fee33af40: begin sweep
dap: span fffffc7fee33af40: begin sweep: allocCount = 44 (0x2c)
dap: span fffffc7fee33af40: begin sweep: freeindex = 44 (0x2c)
dap: span fffffc7fee33af40: begin sweep: sweepgen = 9 (0x9)
dap: span fffffc7fee33af40: begin sweep: state = 1 (0x1)
dap: span fffffc7fee33af40: begin sweep: allocCache = 0xfffff
dap: span fffffc7fee33af40: begin sweep: range [ c000f20000, c000f21f80 )
dap: span fffffc7fee33af40: begin sweep: nelems = 56 (0x38)
dap: span fffffc7fee33af40: begin sweep: elemsize = 144 (0x90)
dap: span fffffc7fee33af40: begin sweep: npages = 1
dap: span fffffc7fee33af40: allocBits:
dap: span fffffc7fee33af40: gcmarkBits:
$

Okay, we appear to be in the middle of the first ever sweep of this span. Now, we've supposedly allocated 0x2c = 44 items. So we'd expect the first 44 144-byte objects from c000f20000 to be allocated. Confirmed, this matches the Go error message and the DTrace output:

$ awk '/alloc marked/{ print $1 }' cockroach-version.out  | awk -Fx '{print $2}' | while read addr; do echo -n "check: $addr: "; grep $addr dtrace.out ; done
check: c000f20000: dap: alloc size 0x88 = 0xc000f20000
dap: span fffffc7fee33af40: begin sweep: range [ c000f20000, c000f21f80 )
check: c000f20090: dap: alloc size 0x88 = 0xc000f20090
check: c000f20120: dap: alloc size 0x88 = 0xc000f20120
check: c000f201b0: dap: alloc size 0x88 = 0xc000f201b0
check: c000f20240: dap: alloc size 0x88 = 0xc000f20240
check: c000f202d0: dap: alloc size 0x88 = 0xc000f202d0
check: c000f20360: dap: alloc size 0x88 = 0xc000f20360
check: c000f203f0: dap: alloc size 0x88 = 0xc000f203f0
check: c000f20480: dap: alloc size 0x88 = 0xc000f20480
check: c000f20510: dap: alloc size 0x88 = 0xc000f20510
check: c000f205a0: dap: alloc size 0x88 = 0xc000f205a0
check: c000f20630: dap: alloc size 0x88 = 0xc000f20630
check: c000f206c0: dap: alloc size 0x88 = 0xc000f206c0
check: c000f20750: dap: alloc size 0x88 = 0xc000f20750
check: c000f207e0: dap: alloc size 0x88 = 0xc000f207e0
check: c000f20870: dap: alloc size 0x88 = 0xc000f20870
check: c000f20900: dap: alloc size 0x88 = 0xc000f20900
check: c000f20990: dap: alloc size 0x88 = 0xc000f20990
check: c000f20a20: dap: alloc size 0x88 = 0xc000f20a20
check: c000f20ab0: dap: alloc size 0x88 = 0xc000f20ab0
check: c000f20b40: dap: alloc size 0x88 = 0xc000f20b40
check: c000f20bd0: dap: alloc size 0x88 = 0xc000f20bd0
check: c000f20c60: dap: alloc size 0x88 = 0xc000f20c60
check: c000f20cf0: dap: alloc size 0x88 = 0xc000f20cf0
check: c000f20d80: dap: alloc size 0x88 = 0xc000f20d80
check: c000f20e10: dap: alloc size 0x88 = 0xc000f20e10
check: c000f20ea0: dap: alloc size 0x88 = 0xc000f20ea0
check: c000f20f30: dap: alloc size 0x88 = 0xc000f20f30
check: c000f20fc0: dap: alloc size 0x88 = 0xc000f20fc0
check: c000f21050: dap: alloc size 0x88 = 0xc000f21050
check: c000f210e0: dap: alloc size 0x88 = 0xc000f210e0
check: c000f21170: dap: alloc size 0x88 = 0xc000f21170
check: c000f21200: dap: alloc size 0x88 = 0xc000f21200
check: c000f21290: dap: alloc size 0x88 = 0xc000f21290
check: c000f21320: dap: alloc size 0x88 = 0xc000f21320
check: c000f213b0: dap: alloc size 0x88 = 0xc000f213b0
check: c000f21440: dap: alloc size 0x88 = 0xc000f21440
check: c000f214d0: dap: alloc size 0x88 = 0xc000f214d0
check: c000f21560: dap: alloc size 0x88 = 0xc000f21560
check: c000f215f0: dap: alloc size 0x88 = 0xc000f215f0
check: c000f21680: dap: alloc size 0x88 = 0xc000f21680
check: c000f21710: dap: alloc size 0x88 = 0xc000f21710
check: c000f217a0: dap: alloc size 0x88 = 0xc000f217a0
check: c000f21830: dap: alloc size 0x88 = 0xc000f21830

There are 5 in the output that are free and unmarked. And there are 7 that are free but marked. Those 7 have never been allocated:

$ for addr in c000f218c0 c000f21950 c000f21b00 c000f21c20 c000f21d40 c000f21e60 c000f21ef0; do echo "CHECK: $addr"; grep $addr dtrace.out ; done
CHECK: c000f218c0
CHECK: c000f21950
CHECK: c000f21b00
CHECK: c000f21c20
CHECK: c000f21d40
CHECK: c000f21e60
CHECK: c000f21ef0

Again, I dumped the entire memory contents from the core file to a text file so I could grep for references:

> ::mappings ! awk '/\[/{ print $1",",$3"::dump -g8 -e" }' > memory-dump-commands.txt
$ mdb core.cockroach.1159 < memory-dump-commands.txt > memory.txt
mdb: failed to read data at 0xfffffc7fef3f0000: no mapping for address
$

Now, are any of those seven pointers referenced in memory?

$ for addr in c000f218c0 c000f21950 c000f21b00 c000f21c20 c000f21d40 c000f21e60 c000f21ef0; do echo "CHECK: $addr"; grep $addr memory.txt ; done
CHECK: c000f218c0
c000f218c0:  0000000000000000 0000000000000000
CHECK: c000f21950
c000f21950:  0000000000000000 0000000000000000
CHECK: c000f21b00
c000f21b00:  0000000000000000 0000000000000000
CHECK: c000f21c20
c000f21c20:  0000000000000000 0000000000000000
CHECK: c000f21d40
c000f21d40:  0000000000000000 0000000000000000
CHECK: c000f21e60
c000f21e60:  0000000000000000 0000000000000000
CHECK: c000f21ef0
c000f21ef0:  0000000000000000 0000000000000000
fffffc7fe9fffb60:  0000000000000037 000000c000f21ef0

Only the last one. (Those other matching lines are just reporting the value at that address.) I poked around at fffffc7fe9fffb68 and found this looks like the stack of the Goroutine that panicked:

# attempt to dump a stack trace as though fffffc7fe9fffb00 were a frame pointer
> 0xfffffc7fe9fffb00$C
fffffc7fe9fffb30 runtime.throw+0x74()
fffffc7fe9fffbb0 runtime.(*mspan).reportZombies+0x345()
fffffc7fe9fffc98 runtime.(*sweepLocked).sweep+0x35a()
fffffc7fe9fffcc8 runtime.(*mcentral).uncacheSpan+0xcf()
fffffc7fe9fffd10 runtime.(*mcache).releaseAll+0x134()
fffffc7fe9fffd38 runtime.(*mcache).prepareForSweep+0x46()
fffffc7fe9fffd50 runtime.acquirep+0x3d()
fffffc7fe9fffd78 runtime.stopm+0xab()
fffffc7fe9fffda0 runtime.gcstopm+0xcc()
fffffc7fe9fffe98 runtime.findrunnable+0x59()
fffffc7fe9fffef8 runtime.schedule+0x297()
fffffc7fe9ffff28 runtime.park_m+0x18e()
000000c0001acdd8 runtime.mcall+0x63()
000000c0001ace68 runtime.chanrecv+0x5f7()
000000c0001ace98 runtime.chanrecv1+0x2b()
000000c0001acfd0 github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0.func1+0x1de()
0000000000000000 runtime.goexit+1()

So I think that explains that reference, and the conclusion is that these seven addresses aren't referenced anywhere else. So why did they get marked? (Note that it's possible there were references when they were marked, and they've since gone away. I'm not sure how much time has elapsed between marking and reaching this point. But recall that we also have no record of these objects ever having been allocated. So if they were in memory somewhere, something would have had to make up these values out of whole cloth, which seems unlikely.)

Well, the %ymm0 bug could explain this, right? The gcmarkBits are allocated the same way that the allocBits are. So if they were not zero when GC began setting bits to mark found objects, then we would erroneously report that some objects were marked, resulting in exactly this failure. It's a bit more work to find a smoking gun here because this bitmap would have been modified from its wrongly-initialized value, so we can't just look for the same bit pattern. But we know what mapping it's from:

> 0xfffffc7fe60900a0$m
            BASE            LIMIT             SIZE NAME
fffffc7fe6090000 fffffc7fe60a0000            10000 [ unknown ]

and we can look for a repeating bit pattern in it:

$ awk '/^fffffc7fe6090000:/,/^fffffc7fe609fff0:/' memory.txt
fffffc7fe6090000:  0000000000000a70 0000000000000000
fffffc7fe6090010:  000000001fffffff 0000000000000000
fffffc7fe6090020:  3fd53ec2a68468a7 0000000000000000
fffffc7fe6090030:  00000000000fffff 0000000000000000
fffffc7fe6090040:  3fffffffffffffff 0000000000000000
fffffc7fe6090050:  000000000000003f 0000000000000000
fffffc7fe6090060:  3fffffffffffffff 0000000000000000
fffffc7fe6090070:  000000000000000f 0000000000000000
fffffc7fe6090080:  3fffffffffffffff 0000000000000000
fffffc7fe6090090:  0000000000000001 0000000000000000
...
fffffc7fe609ff60:  3fd53ec2a68468a6 0000000000000000
fffffc7fe609ff70:  0000000000000000 0000000000000000
fffffc7fe609ff80:  3fd53ec2a68468a6 0000000000000000
fffffc7fe609ff90:  0000000000000000 0000000000000000
fffffc7fe609ffa0:  3fd53ec2a68468a6 0000000000000000
fffffc7fe609ffb0:  0000000000000000 0000000000000000
fffffc7fe609ffc0:  3fd53ec2a68468a6 0000000000000000
fffffc7fe609ffd0:  0000000000000000 0000000000000000
fffffc7fe609ffe0:  3fd53ec2a68468a6 0000000000000000
fffffc7fe609fff0:  0000000000000000 0000000000000000

Bingo. Our gcmarkBits (at fffffc7fe60900a0) sure look like they're in the affected range:

fffffc7fe6090060:  3fffffffffffffff 0000000000000000
fffffc7fe6090070:  000000000000000f 0000000000000000
fffffc7fe6090080:  3fffffffffffffff 0000000000000000
fffffc7fe6090090:  0000000000000001 0000000000000000
fffffc7fe60900a0:  3fd53fffffffffff 0000000000000000
fffffc7fe60900b0:  0000000000000001 0000000000000000
fffffc7fe60900c0:  3fd53ec2a68468a6 0000000000000000
fffffc7fe60900d0:  0000000000000000 0000000000000000
fffffc7fe60900e0:  3fd53ec2a68468a6 0000000000000000
fffffc7fe60900f0:  0000000000000000 0000000000000000

It's a little surprising this one has bits unset that previous blocks have set. I'm not exactly sure why it looks like a hybrid of the blocks before and after it...but it sure doesn't look like this was zero-initialized the way GC would have expected. So I think this can explain this failure mode.

fatal error: sweep increased allocation count

This one is also reported under #1223. This failure mode is similar to the previous one in that it happens while sweeping a span and we've identified an inconsistency between the alloc/marked state. The difference is that in this case, we didn't find any specific objects marked and not allocated (if we had, we would have died on the previous failure mode). But when we recalculated the number of allocated objects, it went up -- which obviously shouldn't happen during GC. This is explainable by the same bug: if we run into this bug during span creation as in the first case above, allocBits and gcmarkBits would both be non-zero and have the same bits set (since they're allocated from the same place at that point). If we immediately sweep the span (say), we'd think there were as many allocated objects as there are bits set in the bitmaps, but no individual object would be marked-but-not-allocated. So we'd fail just like this.

You don't have to sweep the span right away to hit this. And allocBits and gcmarkBits don't both have to be the same. They just need one overlapping bit that doesn't represent an actual allocation (because of this bug).

Other failure modes

There are a bunch of other failure modes I suspect are candidates for this issue. In some cases, it may be worth spending a few minutes to convince ourselves that it's a possible cause. But based on what we've found, any Go memory that was supposed to be initialized with memclrNoHeapPointers could wind up being non-zero. This is essentially arbitrary, random data corruption and it could explain almost any bug. There are a number of very strange cases we've seen that I suspect are probably worth closing as "potentially caused by this issue, not worth proving, but if we see it again, we should reopen and investigate".

Here are some candidates (some of these may well be unrelated -- I haven't gone and re-evaluated all of them yet):

I plan to go through these, evaluate each one (not necessarily in depth), and update them individually with something on a spectrum from "very likely caused by this" to "be aware of this bug, which may or may not be related":

@davepacheco
Copy link
Collaborator Author

davepacheco commented Dec 16, 2022

(part 5/5)

Explaining other data points

While debugging this problem, I took a lot of wrong turns. But this resulted in some data that's worth reviewing in light of this explanation.

We had trouble reproducing this problem on Linux. Well, that checks out. (We now think this is an illumos bug.)

The problem reproduces on Go 1.19. Go 1.19 appears to still use %ymm0 in memclrNoHeapPointers.

We don't seem to see this problem in CI. Helios CI runs on AWS hosts, which tend to run on older instance types, which may well not support avx2. (thanks @jclulow)

We had trouble reproducing this problem on Intel systems. I don't have much data about this except what's reported in golang/go#53289, which does not say much about the systems. It's quite possible they don't support avx2.

Thanks!

It feels cheesy to say, but debugging this one has been quite a saga and I had a lot of help over the last several months from talking this over with so many people. Thanks everyone who helped!

@Zariel
Copy link

Zariel commented Dec 16, 2022

The signal memory corruption reminds me of golang/go#35777.

Great write up and very informative!

@davepacheco
Copy link
Collaborator Author

To help look for other instances of this, I wrote a Rust program to scan through a memory dump (created as described above) looking for runs consistent with illumos issue 15254: https://github.com/oxidecomputer/illumos-15254-look-for-runs.

On this dump, it emits:

$ cargo run < ~/issues/go-memory/repro-2-omicron-test/memory.txt  2>/dev/null 
found run of 21 at addr c000956280:
found run of 34 at addr c000956c80:
found run of 1920 at addr fffffc7feeb11000:

That 1920-count run is exactly what we'd hope it would find, and there are few false positives.

@davepacheco
Copy link
Collaborator Author

Robert discovered that in order to trigger the problem where %ymm0 is corrupted, there needs to be another context switch (e.g., going off-CPU in the signal handler). This may explain why it wasn't so readily reproducible by just raising a signal during the vulnerable instruction window. It may also explain why anecdotally it seemed like cockroach version reproduced this more often when there was a lot of other load on the system.

@davepacheco
Copy link
Collaborator Author

davepacheco commented Jan 6, 2023

Since my previous comments included quite a lot of detail, here's a more concise, organized summary. For more details, see above.

Symptoms

The most common failure modes for this issue are fatal errors from the Go runtime reporting:

  • fatal error: s.allocCount != s.nelems && freeIndex == s.nelems
  • fatal error: found pointer to free object
  • fatal error: sweep increased allocation count

The underlying problem can in principle cause all kinds of memory corruption, particularly in Go-allocated / Go-managed memory. As with other kinds of memory corruption, the first report of trouble may be far away from where the problem happened and a long time later. So all sorts of failure modes are possible, including SIGSEGV, other fatal runtime errors, and non-fatal errors.

Diagnosing a failure as this issue

Although this bug could be responsible for almost any unexpected program behavior, it's not clear how common the issue is, so it's not necessarily a safe bet that any baffling behavior could be attributed to this issue. If you're running a Go program, you probably want to err on the side of applying the workaround below and seeing if the problem goes away.

If you have a core dump from a specific failure and want to determine if it's this issue, you can look for large regions of memory that you'd expected to be zeros that instead contain a repeating 32-byte pattern of which the low 16 bytes are zeroes but the high 16 bytes are not all zeros. I wrote a very simple tool that accepts a memory dump in the form output above and looks for this pattern. If you find this, particularly if a pointer in this region that led to the fatal failure (e.g., a SIGSEGV, or the allocBits/gcmarkBits of whatever mspan triggered the above fatal errors), that's a good sign that this problem is to blame.

There are other uses of %ymm registers in the runtime, too, and it's possible there are other failures that could be caused by this without that signature. I don't have any shortcuts for diagnosing them.

Root cause

The underlying problem relates to two issues. The first is illumos bug 15254, which causes corruption of the %ymm (AVX) CPU registers in some pretty specific conditions. You have to have:

  • regular code that uses %ymm registers
  • that code is interrupted by a signal handler
  • the signal handler also uses the same %ymm registers
  • the signal handler is interrupted by a context switch causing it to go off-CPU (i.e., not a syscall, but a case where the CPU switches to running a different userland thread)

If you have all of these things, then the regular code can find the high 128 bits of a %ymm register clobbered by bits stored in that register by the signal handler. This manifests in Go because memclr() stores 0 into the 256-bit register, then movs that over a large region of memory, expecting that to zero it. Instead, if you hit this bug, then some of the high 128 bits are set, and you wind up with the 32-byte pattern described above.

These registers are also used in the analog of memcpy (runtime.memmove) and other functions in the Go runtime. So it's possible this could cause other kinds of corruption.

A second issue (that likely affects fewer functions in the Go runtime, but does affect memclrNoHeapPointers) is illumos#15367. This is an even more obscure set of conditions that's been observed to result in similar corruption in memclrNoHeapPointers.

Workaround

An easy workaround that appears pretty safe is to set GODEBUG=asyncpreemptoff=1 when running Go programs. This disables Go's (heavy) use of signals to preempt long-running Goroutines. The consequence of disabling this is that Goroutines can run on CPU for longer than they otherwise might, which can create latency problems by starving other Goroutines.

Note that this does not guarantee that these corruption problems can't happen. Both issues an still happen if the Go program takes a signal at the wrong time. But the flag makes it much less likely by stopping Go from sending itself multiple signals per second.

Fix

The proper fix will involve an illumos update (which includes the kernel and userland libraries). This work is in progress as of this writing.

@smklein
Copy link
Collaborator

smklein commented Jan 6, 2023

Should we force GODEBUG=asyncpreemptoff=1 by committing it to the .envrc file in Omicron, and setting it in the cockroachdb manifest.xml file?

Otherwise, it seems very likely we'll keep hitting this issue, especially since the underlying illumos bug has not been resolved yet.

@davepacheco
Copy link
Collaborator Author

Should we force GODEBUG=asyncpreemptoff=1 by committing it to the .envrc file in Omicron, and setting it in the cockroachdb manifest.xml file?
Otherwise, it seems very likely we'll keep hitting this issue, especially since the underlying illumos bug has not been resolved yet.

This is a good question. When we believe a fix is forthcoming and the interim cost isn't too painful, I usually don't commit a workaround because it's one more thing we have to remember to remove before we ship. If we forget, we might wind up with different problems (like Goroutine starvation) or leave a time bomb for ourselves (if for some reason the fix is not correct but we don't notice because of the workaround). In this case, I think the fix is expected on the order of weeks from now. So for me the problem isn't painful enough to warrant committing the workaround. I don't feel strongly about it.

If we do decide to commit the workaround, I think we should put that into the SMF manifest, yes, and into test-utils/dev/db rather than .envrc. As far as I know, .envrc is only used when individual developers set up direnv for their interactive shells. I think that's just for convenience -- we can't really assume it's being used.

@davepacheco
Copy link
Collaborator Author

More to the problem

With a fix in hand for illumos#15254, @rmustacc and I set out to confirm that it would address this issue. To our (deep) dismay, it did not! Though we didn't do any sort of statistical comparison, the problem appeared as readily reproducible as before. The problem we found was a problem, and very near the scene of the crime, but there was something we were still missing.

After several more rounds of tracing and reproduction, Robert identified a separate bug: illumos#15367. While very much in the same area, the bug predates the work on 15254. One big plus: the behavior of the system after hitting #15367 does differ in a meaningful way between the AMD and Intel CPUs that Robert tested, which means this problem explains the data point that we were only able to reproduce this on AMD systems. (The reason has to do with how the AMD and Intel CPUs decide whether to save the XMM region of the FPU during xsave when the FPU is in the processor-defined default state.)

To summarize #15367 as I understand it, if any thread:

  • stores non-zero values into XMM registers
  • does anything that requires a context switch with those registers having those non-zero values (which would include taking a signal, making a syscall, getting switched off-CPU, or probably any other context switch to the kernel)
  • subsequently puts the FPU into the processor-defined default state
  • takes a signal

then after the signal handler returns, the thread will find the xmm registers contain the last-saved non-zero values of these registers. This complex set of conditions explains why this wasn't readily reproducible when I tried to just raise a signal at a time I thought would result in corruption.

We have not yet confirmed that the fix for #15367 makes this problem go away. That's one of the next steps.

@davepacheco
Copy link
Collaborator Author

We remain hopeful that #15367 was a big part of the underlying cause here:

  • With fixed bits and a nearly-stock Go runtime and cockroach, I ran the relevant part of the omicron test suite in a loop for 25 hours (761 iterations times 93 tests, many (but not all) of which run cockroach) before hitting unrelated problem cockroach panicked after receiving SIGHUP too early #2334. With my tracing enabled, the longest time this had run before without reprodicing the issue was 4 hours.
  • Robert similarly ran 3200 iterations. The max he'd gone without the fix in place was 11. His too was terminated by an unrelated issue.

It'd be nice to explain why my previous workaround (sending memclrNoHeapPointers down the XMM path instead of the YMM one) worked. Note that the %ymm path uses VZEROUPPER, which zeroes the %ymm and %zmm ranges of all the AVX registers. If %xmm1 through %xmm15 were zero before we entered this
function, and we know that the function sets %ymm0 = 0 (and so %xmm0 = 0), the VZEROUPPER will clear all the %ymm and %zmm registers, which could well put this unit of the FPU into its default state. (That's good: it's consistent with #15367.) On the other hand, the XMM path only zeroes %xmm15 and never invokes VZEROUPPER. That means it wouldn't reset this unit to its default state if any of the other %xmm or %ymm registers were non-zero. All this is to say: it's definitely possible for this bug to explain this behavior and for the workaround to avoid reproducing the behavior. It's hard to say how likely this is. But it's not clear that's worth digging further into. We'll continue testing with the bugs fixed and see if we see the issue again.

@davepacheco
Copy link
Collaborator Author

Closing this because this appears resolved by the above fixes. We can reopen if we start seeing it again.

@bytesbymike
Copy link

Great write-up! I just listened to the podcast episode that references this issue.

If I've understood correctly, shouldn't allocCount = 3 in step 6 of this comment? I.e. allocCount is incremented after an allocation, even though allocBits is not.
#1146 (comment)

@davepacheco
Copy link
Collaborator Author

Thanks! Yes I think that's right. I've updated the comment. Thanks for the correction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

4 participants