Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: memory corruption when forking on OpenBSD, NetBSD, AIX, and Solaris #34988

Open
jrick opened this issue Oct 18, 2019 · 56 comments
Open
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-AIX OS-NetBSD OS-OpenBSD OS-Solaris
Milestone

Comments

@jrick
Copy link
Contributor

jrick commented Oct 18, 2019

#!watchflakes
default <- `fatal error: (?:.*\n\s*)*syscall\.forkExec` && (goos == "aix" || goos == "netbsd" || goos == "openbsd" || goos == "solaris")

What version of Go are you using (go version)?

$ go version
go version go1.13.2 openbsd/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/jrick/.cache/go-build"
GOENV="/home/jrick/.config/go/env"
GOEXE=""
GOFLAGS="-tags=netgo -ldflags=-extldflags=-static"
GOHOSTARCH="amd64"
GOHOSTOS="openbsd"
GONOPROXY=""
GONOSUMDB=""
GOOS="openbsd"
GOPATH="/home/jrick/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/jrick/src/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/jrick/src/go/pkg/tool/openbsd_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"

What did you do?

I observed these issues in one of my applications, and assumed it was a race or invalid unsafe.Pointer usage or some other fault of the application code. When the 1.13.2 release dropped yesterday I built it from source and observed a similar issue running the regression tests. The failed regression test does not look related to the memory corruption, but I can reproduce the problem by repeatedly running the test in a loop:

$ cd test # from go repo root
$ while :; do go run run.go -- fixedbugs/issue27829.go || break; done >go.panic 2>&1

It can take several minutes to observe the issue but here are some of the captured panics and fatal runtime errors:

https://gist.githubusercontent.com/jrick/f8b21ecbfbe516e1282b757d1bfe4165/raw/6cf0efb9ba47ba869f98817ce945971f2dff47d6/gistfile1.txt

https://gist.githubusercontent.com/jrick/9a54c085b918aa32910f4ece84e5aa21/raw/91ec29275c2eb1be49f62ad8a01a5317ad168c94/gistfile1.txt

https://gist.githubusercontent.com/jrick/8faf088593331c104cc0da0adb3f24da/raw/7c92e7e7d60d426b2156fd1bdff42e0717b708f1/gistfile1.txt

https://gist.githubusercontent.com/jrick/4645316444c12cd815fb71874f6bdfc4/raw/bffac2a448b07242a538b77a2823c9db34b6ef6f/gistfile1.txt

https://gist.githubusercontent.com/jrick/3843b180670811069319e4122d32507a/raw/0d1f897aa25d91307b04ae951f1b260f33246b61/gistfile1.txt

https://gist.githubusercontent.com/jrick/99b7171c5a49b4b069edf06884ad8e17/raw/740c7b9e8fa64d9ad149fd2669df94e89c466927/gistfile1.txt

Additionally, I observed go run hanging (no runtime failure due to deadlock) and it had to be killed with SIGABRT to get a trace: https://gist.githubusercontent.com/jrick/d4ae1e4355a7ac42f1910b7bb10a1297/raw/54e408c51a01444abda76dc32ac55c2dd217822b/gistfile1.txt

It may not matter which regression test is run as the errors also occur in run.go.

@jrick
Copy link
Contributor Author

jrick commented Oct 18, 2019

I missed that 1.13.3 was also released yesterday. Currently updating to that and will report whether this is still an issue.

@randall77
Copy link
Contributor

This looks like cmd/go crashing while building the test, not the test itself.
The errors look heap realated. @mknyszek

@mknyszek
Copy link
Contributor

@jrick maybe you meant this in your original post, but I just want to be clear. Does this reproduce with Go 1.12.X or older versions of Go?

Since we have a reasonable reproducer, the next step to me would be to just bisect what went into Go 1.13, if we know it isn't reproducing in Go 1.12. I genuinely have no idea what this could be. I thought at first that it could be scavenging related but that's highly unlikely for a number of reasons. I won't rule it out yet, though.

@jrick
Copy link
Contributor Author

jrick commented Oct 18, 2019

I haven't tested 1.12.x but will follow up testing that next. Currently hammering this test with 1.13.3 and so far it has not failed, but my application built with 1.13.3 still fails with SIGBUS (could be unrelated).

@jrick
Copy link
Contributor Author

jrick commented Oct 18, 2019

@mknyszek it still hasn't failed on 1.13.3 (running close to an hour now) but quickly failed on 1.12.12.

https://gist.githubusercontent.com/jrick/bb5a493e6ebd88e1e846f1c5c09c9e9a/raw/e82b0136b0826581f6e591915d3a634112f323a1/gistfile1.txt

@dmitshur dmitshur added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-OpenBSD labels Oct 21, 2019
@jrick
Copy link
Contributor Author

jrick commented Dec 6, 2019

This remains a problem in 1.13.5, so it's not addressed by the recent fixes to the go tool.

https://gist.githubusercontent.com/jrick/a2499b2ae10b4c63359174e26c0fd936/raw/b233f14a518ca828c4416d803f81b1e8ca34d073/gistfile1.txt

@jrick
Copy link
Contributor Author

jrick commented May 20, 2020

This may be fork/exec related. This program exhibits similar crashes on OpenBSD 6.7 and Go 1.14.3.

package main

import (
        "os/exec"
)

func main() {
        sem := make(chan struct{}, 100)
        for {
                sem <- struct{}{}
                go func() {
                        err := exec.Command("/usr/bin/true").Run()
                        if err != nil {
                                panic(err)
                        }
                        <-sem
                }()
        }
}

crash trace: https://gist.github.com/jrick/8d6ef72796a772668b891310a18dd805

Synchronizing the os/exec call with an additional mutex appears to remove the crash.

@ianlancetaylor ianlancetaylor changed the title Memory corruption on OpenBSD/amd64 syscall: memory corruption on OpenBSD/amd64 when forking May 20, 2020
@ianlancetaylor
Copy link
Contributor

Thanks for the stack trace. That looks very much like a forked child process is changing the memory seen by the parent process. Which should of course be impossible. Specifically it seems that sched.lock.key is being set to zero while the lock is held during goschedImpl.

@jrick
Copy link
Contributor Author

jrick commented May 22, 2020

I'm seeing another strange thing in addition to that crash. Sometimes the program will run forever, spinning cpu, but appears to be deadlocked because none of the pids of those true processes are ever changing. Here's the trace after sending sigquit: https://gist.github.com/jrick/74aaa63624961145b7bc7b9518da75e1

@jrick
Copy link
Contributor Author

jrick commented Sep 14, 2020

@bcmills
Copy link
Contributor

bcmills commented May 14, 2021

https://build.golang.org/log/3f45171bc52a0a86435abb9f795c0e8a45c4a0b0 looks similar:

haserrors/haserrors.go:3:18: undeclared name: undefined
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x4d578d58 pc=0x804a257]

runtime stack:
runtime/internal/atomic.Xadd64(0x83de928, 0x20360)
	/tmp/workdir/go/src/runtime/internal/atomic/atomic_386.s:145 +0x27

…

goroutine 4371 [runnable]:
syscall.syscall(0x80b4b40, 0x12, 0x0, 0x0)
	/tmp/workdir/go/src/runtime/sys_openbsd3.go:22 +0x20
syscall.Close(0x12)
	/tmp/workdir/go/src/syscall/zsyscall_openbsd_386.go:513 +0x39
syscall.forkExec({0xa630a408, 0x16}, {0xa6158a80, 0xe, 0xe}, 0x6bc0cbc0)
	/tmp/workdir/go/src/syscall/exec_unix.go:227 +0x4cc
syscall.StartProcess(...)
	/tmp/workdir/go/src/syscall/exec_unix.go:264
os.startProcess({0xa630a408, 0x16}, {0xa6158a80, 0xe, 0xe}, 0x6bc0cc84)
	/tmp/workdir/go/src/os/exec_posix.go:55 +0x256
os.StartProcess({0xa630a408, 0x16}, {0xa6158a80, 0xe, 0xe}, 0x6bc0cc84)
	/tmp/workdir/go/src/os/exec.go:106 +0x57
os/exec.(*Cmd).Start(0xa3537b80)
	/tmp/workdir/go/src/os/exec/exec.go:422 +0x588
os/exec.(*Cmd).Run(0xa3537b80)
	/tmp/workdir/go/src/os/exec/exec.go:338 +0x1b
golang.org/x/tools/go/internal/cgo.Run(0xa26a8b40, {0xa2764090, 0x17}, {0xa60778e0, 0x20}, 0x0)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/internal/cgo/cgo.go:172 +0xc74
golang.org/x/tools/go/internal/cgo.ProcessFiles(0xa26a8b40, 0x837eecc0, 0x0, 0x0)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/internal/cgo/cgo.go:85 +0x1a1
golang.org/x/tools/go/loader.(*Config).parsePackageFiles(0x9d428420, 0xa26a8b40, 0x67)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/loader/loader.go:758 +0x232
golang.org/x/tools/go/loader.(*importer).load(0x837ed770, 0xa26a8b40)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/loader/loader.go:976 +0x68
golang.org/x/tools/go/loader.(*importer).startLoad.func1(0x837ed770, 0xa26a8b40, 0xa2afe0e0)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/loader/loader.go:962 +0x23
created by golang.org/x/tools/go/loader.(*importer).startLoad
	/tmp/workdir/gopath/src/golang.org/x/tools/go/loader/loader.go:961 +0x174

@bcmills
Copy link
Contributor

bcmills commented Dec 1, 2021

https://storage.googleapis.com/go-build-log/abee19ae/openbsd-amd64-68_0f13ec3d.log (a TryBot) looks like it could plausibly be from a fork syscall.

@jrick
Copy link
Contributor Author

jrick commented Dec 1, 2021

I'm not sure when this changed but since returning to this issue I haven't been able to reproduce with my minimal test case again on the same hardware with OpenBSD 7.0-current and Go 1.17.3.

I suspect it's due to some OpenBSD fix if the 6.8 builders are still hitting this.

(also 6.8 is no longer a supported OpenBSD version; i don't think it makes much sense to continue testing with it)

@jrick
Copy link
Contributor Author

jrick commented Dec 1, 2021

spoke too soon:

package main

import "os/exec"

func main() {
	loop := func() {
		for {
			err := exec.Command("/usr/bin/true").Run()
			if err != nil {
				panic(err)
			}
		}
	}
	for i := 0; i < 100; i++ {
		go loop()
	}
	select {}
}

https://gist.githubusercontent.com/jrick/a071767cde2d2d71b210135cf8282b04/raw/6fcd814e5a93a6a1d204c2d00b0a1f4195664d61/gistfile1.txt

@jrick
Copy link
Contributor Author

jrick commented Dec 2, 2021

and it took far longer than 1.17.3 but a very similar crash (in scanstack) still occurs with

$ gotip version
go version devel go1.18-931d80ec1 Tue Nov 30 18:09:02 2021 +0000 openbsd/amd64

https://gist.githubusercontent.com/jrick/a13403d1a934f2cc5fedf7c2e2d50546/raw/7389fee0a5a35f40122a847206b6dbd7304b0fa0/gistfile1.txt

@prattmic
Copy link
Member

prattmic commented Dec 2, 2021

I can also reproduce crashes on netbsd-386 and netbsd-amd64 with #34988 (comment) on AMD, of the form:

buildlet-netbsd-386-9-0-n2d-rncb33943# ./loop 
fatal error: runtime·unlock: lock count   
<hang>

as well as #49453

@prattmic
Copy link
Member

prattmic commented Dec 3, 2021

Some observations I've made (from netbsd-amd64):

The crashes still seem to occur with GOMAXPROCS=1, however Go still has some background threads in this case. Disabling sysmon and GC makes this program truly single-threaded:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index a238ea77f3..ee18169920 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -170,10 +170,10 @@ func main() {
                // For runtime_syscall_doAllThreadsSyscall, we
                // register sysmon is not ready for the world to be
                // stopped.
-               atomic.Store(&sched.sysmonStarting, 1)
-               systemstack(func() {
-                       newm(sysmon, nil, -1)
-               })
+               //atomic.Store(&sched.sysmonStarting, 1)
+               //systemstack(func() {
+               //      newm(sysmon, nil, -1)
+               //})
        }
 
        // Lock the main goroutine onto this, the main OS thread,
@@ -211,7 +211,7 @@ func main() {
                }
        }()
 
-       gcenable()
+       //gcenable()
 
        main_init_done = make(chan bool)
        if iscgo {

Once the program is truly single-threaded, the crashes disappear. Setting GOMAXPROCS=2 with this patch brings the crashes back.

Here is a slightly simplified reproducer:

package main

import (
        "os/exec"
        "runtime"
)

func main() {
        go func() {
                for {
                        err := exec.Command("/usr/bin/true").Run()
                        if err != nil {
                                panic(err)
                        }
                }
        }()

        for {
                runtime.Gosched()
        }
}

This version has only a single forker, but crashes about as quickly. The Gosched is required. Neither an empty loop or a loop checking a package global atomic is sufficient to crash. (N.B. the original reproducer above was also occasionally effectively doing Gosched to context switch between the 100 forker goroutines).

(cc @aclements @mknyszek)

@prattmic
Copy link
Member

prattmic commented Dec 3, 2021

More observations:

  • Crashes still occur if the child exits almost immediately after fork, here.
  • Crashes do not occur if RawSyscall(FORK) is replaced with simply returning an error.

I've simplified that repro even further:

loop.go:

package main

import (
        //"runtime"
        "syscall"
)

func fork() int32

func main() {
        go func() {
                for {
                        pid := fork()
                        syscall.Syscall6(syscall.SYS_WAIT4, uintptr(pid), 0, 0, 0, 0, 0)
                        //syscall.RawSyscall6(syscall.SYS_WAIT4, uintptr(pid), 0, 0, 0, 0, 0)
                }
        }()

        for {
                syscall.Syscall(syscall.SYS_GETPID, 0, 0, 0)
                //runtime.Gosched()
        }
}

fork_netbsd_amd64.s:

#include "textflag.h"

#define SYS_EXIT        1
#define SYS_FORK        2

// func fork() int32
TEXT ·fork(SB),NOSPLIT,$0-4
        MOVQ    $SYS_FORK, AX
        SYSCALL

        CMPQ    AX, $0
        JNE     parent

        // Child.
        MOVQ    $0, DI
        MOVQ    $SYS_EXIT, AX
        SYSCALL
        HLT

parent:
        MOVL    AX, ret+0(FP)
        RET

The key parts here:

  • We are now making a direct fork system call without any of the extra runtime behavior inside os/exec. Barring some (AFAICT undocumented) requirement on how to use fork(), there is simply no way this assembly function should be able to cause corruption in the parent. So I think this has to an OS bug.
  • We don't need runtime.Gosched() anymore, however switching the GETPID back to Gosched does trigger crashes faster.
  • Note that syscall.Syscall goes through runtime.entersyscall / runtime.exitsyscall, so there is some level of runtime interaction still, though I've verified we don't go through the slow path into the full scheduler like Gosched does.
  • Both goroutines must use syscall.Syscall. Switching either side to syscall.RawSyscall (which avoids runtime interaction) makes the crashes go away.
  • By best guess of the interesting pattern here is that both threads are fiddling around with TLS variables (the g).

The crashes I get with this look like (source):

entersyscall inconsistent 0xc00003a778 [0xc00003a000,0xc00003a800]                                                                           
fatal error: entersyscall 

This is complaining that the assertion 0xc00003a000 < 0xc00003a778 < 0xc00003a800 fails (it does not).

The one case I've caught in GDB looks like (stopped just inside the failing branch):

   0x0000000000435c77 <+55>:      callq  0x435be0 <runtime.save>
   0x0000000000435c7c <+60>:      mov    0x60(%rsp),%rcx
   0x0000000000435c81 <+65>:      mov    0x20(%rsp),%rax
   0x0000000000435c86 <+70>:      mov    %rcx,0x70(%rax)
   0x0000000000435c8a <+74>:      mov    0x58(%rsp),%rdx
   0x0000000000435c8f <+79>:      mov    %rdx,0x78(%rax)
   0x0000000000435c93 <+83>:      mov    $0x2,%ebx
   0x0000000000435c98 <+88>:      mov    $0x3,%ecx
   0x0000000000435c9d <+93>:      nopl   (%rax)   
   0x0000000000435ca0 <+96>:      callq  0x4306e0 <runtime.casgstatus>                                                                       
   0x0000000000435ca5 <+101>:     mov    0x20(%rsp),%rcx
   0x0000000000435caa <+106>:     mov    (%rcx),%rdx
   0x0000000000435cad <+109>:     mov    %rdx,0x10(%rsp)
   0x0000000000435cb2 <+114>:     mov    0x8(%rcx),%rsi
   0x0000000000435cb6 <+118>:     mov    %rsi,0x18(%rsp)
   0x0000000000435cbb <+123>:     mov    0x70(%rcx),%rdi
   0x0000000000435cbf <+127>:     nop
   0x0000000000435cc0 <+128>:     cmp    %rdx,%rdi
   0x0000000000435cc3 <+131>:     jb     0x435cca <runtime.reentersyscall+138>
   0x0000000000435cc5 <+133>:     cmp    %rsi,%rdi
   0x0000000000435cc8 <+136>:     jbe    0x435d2a <runtime.reentersyscall+234>
=> 0x0000000000435cca <+138>:     mov    $0x6,%eax   
(gdb) i r
rax            0x2                 2
rbx            0x2                 2
rcx            0xc000002820        824633731104
rdx            0xc00003a000        824633958400
rsi            0xc00003a800        824633960448
rdi            0x0                 0
rbp            0xc00003a748        0xc00003a748
rsp            0xc00003a700        0xc00003a700
r8             0x1                 1
r9             0x0                 0
r10            0x0                 0
r11            0x212               530
r12            0xc000028a00        824633887232
r13            0x0                 0
r14            0xc000002820        824633731104
r15            0x7f7fd135111a      140186947490074
rip            0x435cca            0x435cca <runtime.reentersyscall+138>
eflags         0x287               [ CF PF SF IF ]
cs             0x47                71
ss             0x3f                63
ds             0x23                35
es             0x23                35
fs             0x0                 0
gs             0x0                 0
fs_base        <unavailable>
gs_base        <unavailable>

From the assembly, _g_.stack.lo and _g_.stack.hi should be rdx and rsi, which look OK. _g_.syscallsp should be rdi, which is 0. This value was recently loaded from rcx + 0x70, which looks fine:

(gdb) x/xg $rcx + 0x70
0xc000002890:     0x000000c00003a778

Of course, I can't really tell if that memory location read as zero, or if the register was cleared after the load somehow.

@prattmic prattmic changed the title syscall: memory corruption on OpenBSD/amd64 when forking syscall: memory corruption on OpenBSD/amd64 and NetBSD/amd64,386 when forking Dec 3, 2021
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/439196 mentions this issue: os/exec: parallelize more tests

gopherbot pushed a commit that referenced this issue Oct 6, 2022
This cuts the wall duration for 'go test os/exec' and
'go test -race os/exec' roughly in half on my machine,
which is an even more significant speedup with a high '-count'.

For better or for worse, it may also increase the repro rate
of #34988.

Tests that use Setenv or Chdir or check for FDs opened during the test
still cannot be parallelized, but they are only a few of those.

Change-Id: I8d284d8bff05787853f825ef144aeb7a4126847f
Reviewed-on: https://go-review.googlesource.com/c/go/+/439196
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
Run-TryBot: Bryan Mills <[email protected]>
Auto-Submit: Bryan Mills <[email protected]>
@bcmills
Copy link
Contributor

bcmills commented Nov 3, 2022

@davepacheco
Copy link

While debugging oxidecomputer/omicron#1146 I saw that this bug mentions Solaris and wondered if it might affect illumos as well, since the failure modes look the same for my issue. For the record, I don't think my issue was caused by this one. I ran the Go and C test programs for several days without issue, and I ultimately root-caused my issue to illumos#15254. I mention this in case anyone in the future is wondering if illumos is affected by this. I don't know whether Solaris (or any other system) has the same issue with preserving the %ymm registers across signal handlers, but that can clearly cause the same failure modes shown here.

@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
default <- `fatal error: (?:.*\n\s*)*syscall\.forkExec`
2023-01-06 17:30 netbsd-amd64-9_3 tools@36bd3dbc go@476384ec x/tools/gopls/internal/regtest/workspace.TestReloadOnlyOnce (log)
wirep: p->m=824637390848(7) p->status=1
fatal error: wirep: invalid p state
wirep: p->m=0(0) p->status=2
fatal error: wirep: invalid p state

runtime stack:
runtime.throw({0xf1b04b?, 0x0?})
	/tmp/workdir/go/src/runtime/panic.go:992 +0x71
runtime.wirep(0xc0004524e0?)
	/tmp/workdir/go/src/runtime/proc.go:4903 +0x105
...

testing.(*T).Run(0xc00a402680, {0xeefab7?, 0xc00a351f40?}, 0xc00a3f29b0)
	/tmp/workdir/go/src/testing/testing.go:1487 +0x37a
golang.org/x/tools/gopls/internal/lsp/regtest.(*Runner).Run(0xc00044ba40, 0xc00a402680, {0xf6513a, 0x341}, 0x1062c48, {0xc000077f40, 0x2, 0x1553e18?})
	/tmp/workdir/gopath/src/golang.org/x/tools/gopls/internal/lsp/regtest/runner.go:171 +0x405
golang.org/x/tools/gopls/internal/lsp/regtest.configuredRunner.Run(...)
	/tmp/workdir/gopath/src/golang.org/x/tools/gopls/internal/lsp/regtest/regtest.go:67
golang.org/x/tools/gopls/internal/regtest/workspace.TestReloadOnlyOnce(0xc00626f8f0?)
	/tmp/workdir/gopath/src/golang.org/x/tools/gopls/internal/regtest/workspace/workspace_test.go:174 +0x115
testing.tRunner(0xc00a402680, 0x1062c50)

watchflakes

@gopherbot

This comment was marked as off-topic.

@gopherbot

This comment was marked as off-topic.

@gopherbot

This comment was marked as off-topic.

@gopherbot

This comment was marked as off-topic.

@gopherbot

This comment was marked as off-topic.

netbsd-srcmastr pushed a commit to NetBSD/src that referenced this issue Aug 14, 2023
When a thread takes a page fault which results in COW resolution,
other threads in the same process can be concurrently accessing that
same mapping on other CPUs.  When the faulting thread updates the pmap
entry at the end of COW processing, the resulting TLB invalidations to
other CPUs are not done atomically, so another thread can write to the
new writable page and then a third thread might still read from the
old read-only page, resulting in inconsistent views of the page by the
latter two threads.  Fix this by removing the pmap entry entirely for
the original page before we install the new pmap entry for the new
page, so that the new page can only be modified after the old page is
no longer accessible.

This fixes PR 56535 as well as the netbsd versions of problems
described in various bug trackers:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
https://reviews.freebsd.org/D14347
golang/go#34988
netbsd-srcmastr pushed a commit to NetBSD/src that referenced this issue Aug 15, 2023
	sys/uvm/uvm_fault.c: revision 1.234

uvm: prevent TLB invalidation races during COW resolution

When a thread takes a page fault which results in COW resolution,
other threads in the same process can be concurrently accessing that
same mapping on other CPUs.  When the faulting thread updates the pmap
entry at the end of COW processing, the resulting TLB invalidations to
other CPUs are not done atomically, so another thread can write to the
new writable page and then a third thread might still read from the
old read-only page, resulting in inconsistent views of the page by the
latter two threads.  Fix this by removing the pmap entry entirely for
the original page before we install the new pmap entry for the new
page, so that the new page can only be modified after the old page is
no longer accessible.

This fixes PR 56535 as well as the netbsd versions of problems
described in various bug trackers:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
https://reviews.freebsd.org/D14347
golang/go#34988
netbsd-srcmastr pushed a commit to NetBSD/src that referenced this issue Aug 15, 2023
	sys/uvm/uvm_fault.c: revision 1.234

uvm: prevent TLB invalidation races during COW resolution

When a thread takes a page fault which results in COW resolution,
other threads in the same process can be concurrently accessing that
same mapping on other CPUs.  When the faulting thread updates the pmap
entry at the end of COW processing, the resulting TLB invalidations to
other CPUs are not done atomically, so another thread can write to the
new writable page and then a third thread might still read from the
old read-only page, resulting in inconsistent views of the page by the
latter two threads.  Fix this by removing the pmap entry entirely for
the original page before we install the new pmap entry for the new
page, so that the new page can only be modified after the old page is
no longer accessible.

This fixes PR 56535 as well as the netbsd versions of problems
described in various bug trackers:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
https://reviews.freebsd.org/D14347
golang/go#34988
netbsd-srcmastr pushed a commit to NetBSD/src that referenced this issue Aug 20, 2023
	sys/uvm/uvm_fault.c: revision 1.234

uvm: prevent TLB invalidation races during COW resolution

When a thread takes a page fault which results in COW resolution,
other threads in the same process can be concurrently accessing that
same mapping on other CPUs.  When the faulting thread updates the pmap
entry at the end of COW processing, the resulting TLB invalidations to
other CPUs are not done atomically, so another thread can write to the
new writable page and then a third thread might still read from the
old read-only page, resulting in inconsistent views of the page by the
latter two threads.  Fix this by removing the pmap entry entirely for
the original page before we install the new pmap entry for the new
page, so that the new page can only be modified after the old page is
no longer accessible.

This fixes PR 56535 as well as the netbsd versions of problems
described in various bug trackers:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
https://reviews.freebsd.org/D14347
golang/go#34988
rokuyama pushed a commit to IIJ-NetBSD/netbsd-src that referenced this issue Oct 26, 2023
When a thread takes a page fault which results in COW resolution,
other threads in the same process can be concurrently accessing that
same mapping on other CPUs.  When the faulting thread updates the pmap
entry at the end of COW processing, the resulting TLB invalidations to
other CPUs are not done atomically, so another thread can write to the
new writable page and then a third thread might still read from the
old read-only page, resulting in inconsistent views of the page by the
latter two threads.  Fix this by removing the pmap entry entirely for
the original page before we install the new pmap entry for the new
page, so that the new page can only be modified after the old page is
no longer accessible.

This fixes PR 56535 as well as the netbsd versions of problems
described in various bug trackers:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
https://reviews.freebsd.org/D14347
golang/go#34988
rokuyama pushed a commit to IIJ-NetBSD/netbsd-src that referenced this issue Oct 26, 2023
	sys/uvm/uvm_fault.c: revision 1.234

uvm: prevent TLB invalidation races during COW resolution

When a thread takes a page fault which results in COW resolution,
other threads in the same process can be concurrently accessing that
same mapping on other CPUs.  When the faulting thread updates the pmap
entry at the end of COW processing, the resulting TLB invalidations to
other CPUs are not done atomically, so another thread can write to the
new writable page and then a third thread might still read from the
old read-only page, resulting in inconsistent views of the page by the
latter two threads.  Fix this by removing the pmap entry entirely for
the original page before we install the new pmap entry for the new
page, so that the new page can only be modified after the old page is
no longer accessible.

This fixes PR 56535 as well as the netbsd versions of problems
described in various bug trackers:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
https://reviews.freebsd.org/D14347
golang/go#34988
rokuyama pushed a commit to IIJ-NetBSD/netbsd-src that referenced this issue Oct 27, 2023
	sys/uvm/uvm_fault.c: revision 1.234

uvm: prevent TLB invalidation races during COW resolution

When a thread takes a page fault which results in COW resolution,
other threads in the same process can be concurrently accessing that
same mapping on other CPUs.  When the faulting thread updates the pmap
entry at the end of COW processing, the resulting TLB invalidations to
other CPUs are not done atomically, so another thread can write to the
new writable page and then a third thread might still read from the
old read-only page, resulting in inconsistent views of the page by the
latter two threads.  Fix this by removing the pmap entry entirely for
the original page before we install the new pmap entry for the new
page, so that the new page can only be modified after the old page is
no longer accessible.

This fixes PR 56535 as well as the netbsd versions of problems
described in various bug trackers:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225584
https://reviews.freebsd.org/D14347
golang/go#34988
@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
default <- `fatal error: (?:.*\n\s*)*syscall\.forkExec` && (goos == "aix" || goos == "netbsd" || goos == "openbsd" || goos == "solaris")
2024-03-20 14:17 netbsd-amd64-9_3 go@e39af550 cmd/go.TestScript (log)
vcs-test.golang.org rerouted to http://127.0.0.1:54676
https://vcs-test.golang.org rerouted to https://127.0.0.1:54675
go test proxy running at GOPROXY=http://127.0.0.1:54674/mod
--- FAIL: TestScript (0.15s)
    --- FAIL: TestScript/test_match_only_tests (0.68s)
        script_test.go:136: 2024-03-20T14:33:27Z
        script_test.go:138: $WORK=/tmp/workdir/tmp/cmd-go-test-3941342647/tmpdir2789106918/test_match_only_tests3709676204
        script_test.go:160: 
            # Matches only tests (0.672s)
            > go test -run Test standalone_test.go
...
            fatal error: workbuf is empty

            runtime stack:
            runtime.throw({0xbe8a17?, 0x71bed46dfe58?})
            	/tmp/workdir/go/src/runtime/panic.go:1021 +0x5c fp=0x71bed46dfe08 sp=0x71bed46dfdd8 pc=0x43f81c
            runtime.(*workbuf).checknonempty(0xc00016c480?)
            	/tmp/workdir/go/src/runtime/mgcwork.go:338 +0x2c fp=0x71bed46dfe28 sp=0x71bed46dfe08 pc=0x42ed0c
            runtime.trygetfull()
            	/tmp/workdir/go/src/runtime/mgcwork.go:430 +0x53 fp=0x71bed46dfe48 sp=0x71bed46dfe28 pc=0x42f0d3
            runtime.(*gcWork).tryGet(0xc000059758)
...
            	/tmp/workdir/go/src/cmd/go/internal/work/action.go:76 +0x2d fp=0xc0009a7dc8 sp=0xc0009a7d98 pc=0x9ac24d
            cmd/go/internal/work.(*Builder).Do.func3({0xd2b340, 0x13b5b60}, 0xc0001f98c0)
            	/tmp/workdir/go/src/cmd/go/internal/work/exec.go:152 +0x7af fp=0xc0009a7f20 sp=0xc0009a7dc8 pc=0x9bb8ef
            cmd/go/internal/work.(*Builder).Do.func4()
            	/tmp/workdir/go/src/cmd/go/internal/work/exec.go:221 +0xb9 fp=0xc0009a7fe0 sp=0xc0009a7f20 pc=0x9baf79
            runtime.goexit({})
            	/tmp/workdir/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0009a7fe8 sp=0xc0009a7fe0 pc=0x479721
            created by cmd/go/internal/work.(*Builder).Do in goroutine 1
            	/tmp/workdir/go/src/cmd/go/internal/work/exec.go:207 +0x3fe
        script_test.go:160: FAIL: testdata/script/test_match_only_tests.txt:2: go test -run Test standalone_test.go: exit status 2

watchflakes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-AIX OS-NetBSD OS-OpenBSD OS-Solaris
Projects
Status: Triage Backlog
Status: No status
Development

No branches or pull requests