Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: ycsb/F/nodes=3/cpu=32 failed #70019

Closed
cockroach-teamcity opened this issue Sep 10, 2021 · 24 comments · Fixed by #70947
Closed

roachtest: ycsb/F/nodes=3/cpu=32 failed #70019

cockroach-teamcity opened this issue Sep 10, 2021 · 24 comments · Fixed by #70947
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) T-sql-queries SQL Queries Team

Comments

@cockroach-teamcity
Copy link
Member

roachtest.ycsb/F/nodes=3/cpu=32 failed with artifacts on release-21.2 @ 99a4816fc272228a63df20dae3cc41d235e705f3:

		  |   113.0s        0         6523.2         4500.1      1.0     26.2    184.5    285.2 read
		  |   113.0s        0         6384.2         4499.8     10.0     22.0    109.1    352.3 readModifyWrite
		  |   114.0s        0         6008.2         4513.3      1.0     18.9    209.7    302.0 read
		  |   114.0s        0         6136.2         4514.2      9.4     19.9    125.8    419.4 readModifyWrite
		  |   115.0s        0         6031.3         4526.6      1.0     22.0    192.9    268.4 read
		  |   115.0s        0         6008.3         4527.2      9.4     22.0    125.8    671.1 readModifyWrite
		  |   116.0s        0         6414.7         4542.8      1.0     21.0    176.2    285.2 read
		  |   116.0s        0         6269.7         4542.2      9.4     21.0    121.6    771.8 readModifyWrite
		  |   117.0s        0         6044.7         4555.7      1.0     18.9    201.3    302.0 read
		  |   117.0s        0         6015.7         4554.8      9.4     21.0    142.6    838.9 readModifyWrite
		  |   118.0s        0         6215.2         4569.7      1.0     24.1    184.5    260.0 read
		  |   118.0s        0         6105.2         4567.9      9.4     22.0    151.0    838.9 readModifyWrite
		  |   119.0s        0         6350.0         4584.7      1.0     23.1    201.3    302.0 read
		  |   119.0s        0         6230.1         4581.9      9.4     21.0    142.6    671.1 readModifyWrite
		  |   120.0s        0         6071.1         4597.1      1.0     14.7    234.9    369.1 read
		  |   120.0s        0         6283.9         4596.1      9.4     21.0    176.2    704.6 readModifyWrite
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,ycsb.go:73,ycsb.go:90,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:73
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:90
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-release-21.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 10, 2021
@tbg
Copy link
Member

tbg commented Sep 13, 2021

	  | Error: driver: bad connection

@tbg
Copy link
Member

tbg commented Sep 13, 2021

Moving to SQL-Queries for the same reasons as the master sibling, see #69545 (comment)

@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Sep 13, 2021
@rytaft
Copy link
Collaborator

rytaft commented Sep 13, 2021

This does not seem like a blocker, but I'm also not sure this is a SQL Queries issue. @rafiss you seem to have the most familiarity with this issue -- can you comment on this?

@tbg tbg added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 13, 2021
@rafiss
Copy link
Collaborator

rafiss commented Sep 14, 2021

Sure - we have seen the same type of failure in #68574. That was (supposed to be) fixed by #68665 which is because of a lib/pq issue. I don't yet know how it still could be happening, but I'll look more closely again.

@rytaft
Copy link
Collaborator

rytaft commented Sep 20, 2021

@rafiss did you get a chance to take a look at this at all?

@rafiss
Copy link
Collaborator

rafiss commented Sep 20, 2021

I did look for a few hours and couldn't repro or find anything helpful yet.

@tbg
Copy link
Member

tbg commented Sep 20, 2021

Have you tried reproducing via something like GCE_PROJECT=andrei-jepsen COUNT=50 SHORT=y LOCAL=n pkg/cmd/roachtest/roachstress.sh while on 99a4816? Anecdotally this seems to happen every ~10-20 days so 50 iterations should give you something (or suggest that somehow, the problem has disappeared).

@rafiss
Copy link
Collaborator

rafiss commented Sep 20, 2021

I ran with roachstress locally and with a smaller count; I didn't know I could use a GCE project. I'll try that

@cucaroach
Copy link
Contributor

I spent a couple hours pouring over the artifacts but couldn't find anything. The only idea I had was to come up with some way for the client to signal all the nodes in the database to panic all at once when an unexpected error like this occurs, with the understanding that panic will dump a core when GOTRACEBACK=crash is specified.

@tbg
Copy link
Member

tbg commented Sep 20, 2021

FYI using this test as a guinea pig for stressing a roachtest in CI: #70435 (comment)

Will let you know if that turns up a repro.

@cucaroach when the roachtest times out, the test harness sends SIGSEGV to ~everything to get stack traces, and we already collect the cores (though not as artifacts; with --debug you'll find them in /mnt/data1/cores or thereabouts).
Seems difficult to trigger this at the right time though, won't the connection already be gone by the time you realize something's amiss? Maybe running workload with pretty intense debugging can help? Anyway, first order of the day is to see if this even repros any more I guess.

@cucaroach
Copy link
Contributor

@tbg, thanks I didn't know that. I have some changes I'm playing with to re-enable core gathering and enable gzip compression of cores in order to streamline OOM investigations (#69317). Maybe --debug could drive core gathering as well if we don't want to do it all the time? They usually compress very well. I'm assuming that in cases like this even thought we got a "bad connection" the node is still running because there doesn't appear to be anything amiss in the logs but yeah any hints about why the connection died could be long gone by the the core is generated but who knows we might get lucky.

@rafiss
Copy link
Collaborator

rafiss commented Sep 20, 2021

With GCE_PROJECT=andrei-jepsen COUNT=50 SHORT=y LOCAL=n pkg/cmd/roachtest/roachstress.sh I finally got a repro on the second to last test run (out of 50 total). So @tbg's estimate of the flake rate is pretty spot on. Will try to look at the artifacts again to figure anything out

@cockroach-teamcity
Copy link
Member Author

roachtest.ycsb/F/nodes=3 failed with artifacts on release-21.2 @ c46f5a5a098577b936e56f03d20c97300b4cce61:

		  |   113.0s        0         3490.9         2734.8      1.8     39.8    113.2    218.1 read
		  |   113.0s        0         3525.9         2737.1     15.2     41.9     75.5    285.2 readModifyWrite
		  |   114.0s        0         3548.0         2741.9      1.7     31.5    121.6    192.9 read
		  |   114.0s        0         3646.0         2745.1     14.2     41.9    100.7    243.3 readModifyWrite
		  |   115.0s        0         3466.5         2748.2      2.0     31.5    121.6    184.5 read
		  |   115.0s        0         3471.5         2751.4     16.8     41.9     79.7    243.3 readModifyWrite
		  |   116.0s        0         3219.9         2752.3      2.2     39.8     96.5    151.0 read
		  |   116.0s        0         3255.9         2755.7     18.9     48.2     71.3    243.3 readModifyWrite
		  |   117.0s        0         3455.5         2758.3      1.8     29.4    134.2    251.7 read
		  |   117.0s        0         3517.5         2762.2     15.7     41.9     88.1    352.3 readModifyWrite
		  |   118.0s        0         3765.9         2766.8      1.9     25.2    100.7    218.1 read
		  |   118.0s        0         3700.9         2770.2     16.3     41.9     75.5    251.7 readModifyWrite
		  |   119.0s        0         3563.3         2773.5      1.8     39.8    125.8    226.5 read
		  |   119.0s        0         3655.3         2777.6     15.2     41.9     83.9    218.1 readModifyWrite
		  |   120.0s        0         3635.6         2780.7      2.0     35.7     92.3    151.0 read
		  |   120.0s        0         3704.6         2785.4     16.8     44.0     65.0    184.5 readModifyWrite
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,ycsb.go:73,ycsb.go:90,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:73
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:90
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Sep 24, 2021

Something I noticed the other day while using the debugger on some local unit test was that if you pause the process for a while, you can also get a bad connection from pgx (I think). So I wonder if there's some timeout in pgx that, if passes, abandons the connection.

@rafiss
Copy link
Collaborator

rafiss commented Sep 24, 2021

Worth looking into -- though to clarify, our ycsb workload does not use pgx. Only database/sql+lib/pq.

@cockroach-teamcity
Copy link
Member Author

roachtest.ycsb/F/nodes=3 failed with artifacts on release-21.2 @ 24021ba163e4ac438b169d575cf1527a4aae394d:

		  |   113.0s        0         4498.9         3766.9      2.0     15.2     35.7     60.8 read
		  |   113.0s        0         4608.9         3757.8     14.7     35.7     46.1     96.5 readModifyWrite
		  |   114.0s        0         4275.2         3771.3      2.1     14.7     30.4     79.7 read
		  |   114.0s        0         4286.2         3762.5     16.3     39.8     52.4    121.6 readModifyWrite
		  |   115.0s        0         4659.6         3779.0      2.2     18.9     44.0     83.9 read
		  |   115.0s        0         4606.6         3769.8     14.2     33.6     48.2     79.7 readModifyWrite
		  |   116.0s        0         4673.0         3786.7      2.2     17.8     39.8     71.3 read
		  |   116.0s        0         4769.0         3778.4     14.2     32.5     44.0     75.5 readModifyWrite
		  |   117.0s        0         4760.2         3795.1      2.1     17.8     39.8     71.3 read
		  |   117.0s        0         4729.2         3786.6     14.2     33.6     46.1     71.3 readModifyWrite
		  |   118.0s        0         4603.0         3801.9      2.2     16.8     32.5     65.0 read
		  |   118.0s        0         4775.9         3794.9     14.7     33.6     46.1     67.1 readModifyWrite
		  |   119.0s        0         4253.1         3805.7      2.4     17.8     39.8     83.9 read
		  |   119.0s        0         4334.1         3799.5     15.7     39.8     56.6     79.7 readModifyWrite
		  |   120.0s        0         4339.4         3810.2      2.4     16.3     37.7     79.7 read
		  |   120.0s        0         4280.5         3803.5     16.8     39.8     50.3     75.5 readModifyWrite
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,ycsb.go:73,ycsb.go:90,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:73
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:90
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@rytaft
Copy link
Collaborator

rytaft commented Sep 27, 2021

Any updates to report for the triage meeting?

@rafiss
Copy link
Collaborator

rafiss commented Sep 27, 2021

I'm wondering if it still could be lib/pq#1000

I previously made #68665 to handle this, but maybe the

if ctx.Err() != nil && (errors.Is(err, ctx.Err()) || errors.Is(err, driver.ErrBadConn))

check is too strict. perhaps it should be

if (ctx.Err() != nil && errors.Is(err, ctx.Err()) || errors.Is(err, driver.ErrBadConn)

@rafiss
Copy link
Collaborator

rafiss commented Sep 27, 2021

The reason it could be that is because the error always happens at 120.0s after the test starts -- which is how long the ramp-up period for this test is. After the ramp period, the context is cancelled and a new context is created. see

var rampDone chan struct{}
if *ramp > 0 {
// Create a channel to signal when the ramp period finishes. Will
// be reset to nil when consumed by the process loop below.
rampDone = make(chan struct{})
}
workersCtx, cancelWorkers := context.WithCancel(ctx)
defer cancelWorkers()
var wg sync.WaitGroup
wg.Add(len(ops.WorkerFns))
go func() {
// If a ramp period was specified, start all of the workers gradually
// with a new context.
var rampCtx context.Context
if rampDone != nil {
var cancel func()
rampCtx, cancel = context.WithTimeout(workersCtx, *ramp)
defer cancel()
}
for i, workFn := range ops.WorkerFns {
go func(i int, workFn func(context.Context) error) {
// If a ramp period was specified, start all of the workers
// gradually with a new context.
if rampCtx != nil {
rampPerWorker := *ramp / time.Duration(len(ops.WorkerFns))
time.Sleep(time.Duration(i) * rampPerWorker)
workerRun(rampCtx, errCh, nil /* wg */, limiter, workFn)
}
// Start worker again, this time with the main context.
workerRun(workersCtx, errCh, &wg, limiter, workFn)
}(i, workFn)
}

@tbg
Copy link
Member

tbg commented Sep 28, 2021

So you're suggesting we ignore ErrBadConn "always"? That seems wrong, won't this be the error you get if the node is down etc? If I understand the issue correctly it also seems like the best we can do, right? When the soon-to-be-canceled connection gets returned to the connection pool, whoever picks it up will not have a canceled context, so we're essentially checking the wrong context for ctx.Err(). Silly question, why can't we pick up the PR you linked in lib/pq and be done with it?

@rafiss
Copy link
Collaborator

rafiss commented Sep 28, 2021

So you're suggesting we ignore ErrBadConn "always"?

Yeah, my thinking was that that there might be a weird case where the client hasn't seen the context cancelled but the connection has been closed already. Wasn't able to repro that in any smaller examples though.

That seems wrong, won't this be the error you get if the node is down etc?

ErrBadConn is the error you get if the connection is unexpectedly broken, which maybe could include the node being down. But I'm pretty sure you'd see a timeout first.

Silly question, why can't we pick up the PR you linked in lib/pq and be done with it?

We already have picked up the linked PR. lib/pq#1000 first appears in v1.9.0 and we are now using v1.10.2. To be clear, lib/pq#1000 is the PR that caused the behavior of ErrBadConn getting returned when the context has been canceled and you try to use a connection (see lib/pq#1000 (comment))

@rafiss
Copy link
Collaborator

rafiss commented Sep 28, 2021

I added the following diff so I could get a stack trace of which worker was actually getting ErrBadConn:

diff --git a/pkg/workload/cli/run.go b/pkg/workload/cli/run.go
index c1d034acb1..6c842f19ce 100644
--- a/pkg/workload/cli/run.go
+++ b/pkg/workload/cli/run.go
@@ -277,6 +277,8 @@ func workerRun(
                                // that has been canceled. See https://github.com/lib/pq/pull/1000
                                return
                        }
+                       log.Infof(ctx, "err: %+v\nctxErr: %+v", err, ctx.Err())
+                       panic(err)
                        errCh <- err
                        continue
                }

After running roachstress.sh for a while:

I210928 19:30:48.157155 1821 workload/cli/run.go:280  [-] 6  err: driver: bad connection
I210928 19:30:48.157155 1821 workload/cli/run.go:280  [-] 6 +(1) driver: bad connection
I210928 19:30:48.157155 1821 workload/cli/run.go:280  [-] 6 +Error types: (1) *errors.errorString
I210928 19:30:48.157155 1821 workload/cli/run.go:280  [-] 6 +ctxErr: <nil>
panic: driver: bad connection

goroutine 1821 [running]:
panic(0x20d24e0, 0xc0000ca270)
        /usr/local/go/src/runtime/panic.go:1065 +0x565 fp=0xc000363eb0 sp=0xc000363de8 pc=0x462d45
github.com/cockroachdb/cockroach/pkg/workload/cli.workerRun(0x4a6f8b0, 0xc001110740, 0xc0004e6240, 0xc00073fb00, 0x0, 0xc000e36620)
        /go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:281 +0x3f0 fp=0xc000363f58 sp=0xc000363eb0 pc=0x18683d0
github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func3.1(0xc00012a010, 0xc000996040, 0xc0004e6240, 0x0, 0x4a6f8b0, 0xc001110740, 0xc00073fb00, 0x74
, 0xc000e36620)
        /go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:477 +0xec fp=0xc000363f98 sp=0xc000363f58 pc=0x186c9ec
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc000363fa0 sp=0xc000363f98 pc=0x49d281
created by github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func3
        /go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:467 +0x111

That corresponds to this line:

workerRun(workersCtx, errCh, &wg, limiter, workFn)

The notable thing here is that it's coming from the non-cancelled workersCtx, so it makes sense that ctx.Err() would be nil. It seems like a bad connection remained in the pool, and for some reason when the worker tried using it, it couldn't try to find a different connection right away. (As per the sql/driver docs, the sql package is supposed to retry on a new connection when it sees ErrBadConn.

I'm going to try to see where exactly ErrBadConn is coming from...

@rafiss
Copy link
Collaborator

rafiss commented Sep 29, 2021

I wrote this quick test based on what workload is doing https://gist.github.com/rafiss/698d0bdbeacfd225437db631d1a95e85

It has non-deterministic behavior. Sometimes the test passes, but sometimes it gets a transaction rolled back error where it's expecting ErrBadConn. (It changes when I enable debugging and add breakpoints.) Also, if I run it with make stress, there's a different failure entirely (it gets a "read timed out" error).

@rafiss
Copy link
Collaborator

rafiss commented Sep 30, 2021

I have a non-YCSB repro and was able to reproduce the same behavior on 21.1. So this is almost certainly a workload-side problem, and if not that it is definitely not a new CRDB v21.2 bug.

It seems like a race condition in lib/pq. The test below fails somewhere around 50% of the time. It fails much less frequently when I run against a local CRDB single-node cluster. But it fails more reliably if I run against a 3-node roachprod cluster. If I add a time.Sleep after the workerRun(rampCtx, errCh, nil /* wg */, nil, workFn, debugFn) line, then I'm not able to get it to fail at all.

So I'll remove the GA-blocker tag since it shouldn't block any release. But I'll keep working on this to prevent this from flaking and will backport whatever workload fix I make to release-21.2

Expand code
func workerRun(
	ctx context.Context,
	errCh chan<- error,
	wg *sync.WaitGroup,
	limiter *rate.Limiter,
	workFn func(context.Context) error,
	debugFn func(),
) {
	if wg != nil {
		defer wg.Done()
	}

	for {
		if ctx.Err() != nil {
			return
		}

		// Limit how quickly the load generator sends requests based on --max-rate.
		if limiter != nil {
			if err := limiter.Wait(ctx); err != nil {
				return
			}
		}

		if err := workFn(ctx); err != nil {
			if ctx.Err() != nil && (errors.Is(err, ctx.Err()) || errors.Is(err, driver.ErrBadConn)) {
				// lib/pq may return either the `context canceled` error or a
				// `bad connection` error when performing an operation with a context
				// that has been canceled. See https://github.com/lib/pq/pull/1000
				return
			}
			// blah
			debugFn()
			panic(err)
			errCh <- err
			continue
		}
	}
}


// textFormatter produces output meant for quick parsing by humans. The
// data is printed as fixed-width columns. Summary rows
// are printed at the end.
type textFormatter struct {
	i      int
	numErr int
}

func (f *textFormatter) rampDone() {
	f.i = 0
}

func (f *textFormatter) outputError(_ error) {
	f.numErr++
}

func (f *textFormatter) outputTick(startElapsed time.Duration, t histogram.Tick) {
	if f.i%20 == 0 {
		fmt.Println("_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)")
	}
	f.i++
	fmt.Printf("%7.1fs %8d %14.1f %14.1f %8.1f %8.1f %8.1f %8.1f %s\n",
		startElapsed.Seconds(),
		f.numErr,
		float64(t.Hist.TotalCount())/t.Elapsed.Seconds(),
		float64(t.Cumulative.TotalCount())/startElapsed.Seconds(),
		time.Duration(t.Hist.ValueAtQuantile(50)).Seconds()*1000,
		time.Duration(t.Hist.ValueAtQuantile(95)).Seconds()*1000,
		time.Duration(t.Hist.ValueAtQuantile(99)).Seconds()*1000,
		time.Duration(t.Hist.ValueAtQuantile(100)).Seconds()*1000,
		t.Name,
	)
}

const totalHeader = "\n_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)"

func (f *textFormatter) outputTotal(startElapsed time.Duration, t histogram.Tick) {
	f.outputFinal(startElapsed, t, "__total")
}

func (f *textFormatter) outputResult(startElapsed time.Duration, t histogram.Tick) {
	f.outputFinal(startElapsed, t, "__result")
}

func (f *textFormatter) outputFinal(
	startElapsed time.Duration, t histogram.Tick, titleSuffix string,
) {
	fmt.Println(totalHeader + titleSuffix)
	if t.Cumulative == nil {
		return
	}
	if t.Cumulative.TotalCount() == 0 {
		return
	}
	fmt.Printf("%7.1fs %8d %14d %14.1f %8.1f %8.1f %8.1f %8.1f %8.1f  %s\n",
		startElapsed.Seconds(),
		f.numErr,
		t.Cumulative.TotalCount(),
		float64(t.Cumulative.TotalCount())/startElapsed.Seconds(),
		time.Duration(t.Cumulative.Mean()).Seconds()*1000,
		time.Duration(t.Cumulative.ValueAtQuantile(50)).Seconds()*1000,
		time.Duration(t.Cumulative.ValueAtQuantile(95)).Seconds()*1000,
		time.Duration(t.Cumulative.ValueAtQuantile(99)).Seconds()*1000,
		time.Duration(t.Cumulative.ValueAtQuantile(100)).Seconds()*1000,
		t.Name,
	)
}


func TestLoop(t *testing.T) {
	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)

	concurreny := 50

	var ramp *time.Duration
	i := 4 * time.Second
	ramp = &i
	totalTime := 2 * time.Second

	urls := []string{
		"postgres://root@<roachprod ip>:26257?sslmode=disable",
		"postgres://root@<roachprod ip>:26257?sslmode=disable",
		"postgres://root@<roachprod ip>:26257?sslmode=disable",
	}
	//db, err := gosql.Open("postgres", "postgresql://root@localhost:26257/defaultdb?sslmode=disable")
	db, err := gosql.Open("postgres", urls[0])//strings.Join(urls, ` `))
	require.NoError(t, err)
	defer db.Close()
	db.SetMaxOpenConns(concurreny + 1)
	db.SetMaxIdleConns(concurreny + 1)
	db.Exec("create table if not exists t (a int primary key, b int)")
	db.Exec("insert into t(a,b) select g,g from generate_series(0,99) as g")

	readStmt, _ := db.Prepare("select b from t where a = $1")
	updateStmt, _ := db.Prepare("update t set b = $1 where a = $2")

	reg := histogram.NewRegistry(
		2 * time.Second,
		"myname",
	)
	handle := reg.GetHandle()


	workFn := func(ctx context.Context) error {
		start := timeutil.Now()
		defer func() {
			elapsed := timeutil.Since(start)
			handle.Get("the_op").Record(elapsed)
		}()
		return crdb.ExecuteTx(ctx, db, nil, func(tx *gosql.Tx) error {
			a := rand.Intn(100)
			var b int
			if err := tx.StmtContext(ctx, readStmt).QueryRowContext(ctx, a).Scan(&b); err != nil {
				return err
			}
			_, err := tx.StmtContext(ctx, updateStmt).ExecContext(ctx, rand.Intn(100), a)
			if err != nil {
			}
			return err
		})
	}
	debugFn := func() {
		stats := db.Stats()
		fmt.Printf("stats %+v\n", stats)
	}

	formatter := &textFormatter{}
	start := timeutil.Now()
	errCh := make(chan error)

	workersCtx, cancelWorkers := context.WithCancel(context.Background())
	defer cancelWorkers()
	var wg sync.WaitGroup
	wg.Add(concurreny)

	var rampDone chan struct{}
	if *ramp > 0 {
		// Create a channel to signal when the ramp period finishes. Will
		// be reset to nil when consumed by the process loop below.
		rampDone = make(chan struct{})
	}

	// If a ramp period was specified, start all of the workers gradually
	// with a new context.
	var rampCtx context.Context
	if rampDone != nil {
		var cancel func()
		rampCtx, cancel = context.WithTimeout(workersCtx, *ramp)
		defer cancel()
	}
	go func() {
		for i := 0; i < concurreny; i++ {
			go func(i int, workFn func(context.Context) error) {
				// If a ramp period was specified, start all of the workers
				// gradually with a new context.
				if rampCtx != nil {
					rampPerWorker := *ramp / time.Duration(concurreny)
					time.Sleep(time.Duration(i) * rampPerWorker)
					workerRun(rampCtx, errCh, nil /* wg */, nil, workFn, debugFn)
					//time.Sleep(rampPerWorker)
				}

				fmt.Println("start real worker", i)
				// Start worker again, this time with the main context.
				workerRun(workersCtx, errCh, &wg, nil, workFn, debugFn)
			}(i, workFn)
		}

		if rampCtx != nil {
			// Wait for the ramp period to finish, then notify the process loop
			// below to reset timers and histograms.
			<-rampCtx.Done()
			close(rampDone)
		}
	}()

	ticker := time.NewTicker(1 * time.Second)
	defer ticker.Stop()
	done := make(chan os.Signal, 3)
	signal.Notify(done, []os.Signal{unix.SIGINT, unix.SIGTERM}...)

	go func() {
		wg.Wait()
		done <- os.Interrupt
	}()

	go func() {
		time.Sleep(totalTime + *ramp)
		done <- os.Interrupt
	}()

	for {
		select {
		case err := <-errCh:
			formatter.outputError(err)
			panic(err)

		case <-ticker.C:
			startElapsed := timeutil.Since(start)
			reg.Tick(func(t histogram.Tick) {
				formatter.outputTick(startElapsed, t)
			})

		// Once the load generator is fully ramped up, we reset the histogram
		// and the start time to throw away the stats for the ramp up period.
		case <-rampDone:
			rampDone = nil
			start = timeutil.Now()
			formatter.rampDone()
			reg.Tick(func(t histogram.Tick) {
				t.Cumulative.Reset()
				t.Hist.Reset()
			})

		case <-done:
			cancelWorkers()

			startElapsed := timeutil.Since(start)
			resultTick := histogram.Tick{Name: "the_op"}
			reg.Tick(func(t histogram.Tick) {
				formatter.outputTotal(startElapsed, t)
					if resultTick.Cumulative == nil {
						resultTick.Now = t.Now
						resultTick.Cumulative = t.Cumulative
					} else {
						resultTick.Cumulative.Merge(t.Cumulative)
					}
			})
			formatter.outputResult(startElapsed, resultTick)
			return
		}
	}
}

@rafiss rafiss removed the GA-blocker label Sep 30, 2021
@rafiss rafiss assigned rafiss and unassigned cucaroach Sep 30, 2021
@blathers-crl blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Sep 30, 2021
@craig craig bot closed this as completed in cf4fe62 Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants