Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/zfs/checks=true failed #68303

Closed
cockroach-teamcity opened this issue Jul 31, 2021 · 148 comments
Closed

roachtest: clearrange/zfs/checks=true failed #68303

cockroach-teamcity opened this issue Jul 31, 2021 · 148 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team

Comments

@cockroach-teamcity
Copy link
Member

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 701b177d8f4b81d8654dfb4090a2cd3cf82e63a7:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)
Reproduce

See: roachtest README

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 31, 2021
@blathers-crl blathers-crl bot added the T-storage Storage Team label Jul 31, 2021
@bananabrick bananabrick self-assigned this Aug 2, 2021
@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ eef03a46f2e43ff70485dadf7d9ad445db05cab4:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)
Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@bananabrick bananabrick removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 4, 2021
@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 6b8d59327add74cf1342345fb3eaffc3a3e765d2:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)
Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 50ef2fc205baa65c5a740c2d614fe1de279367e9:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)
Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ cab185ff71f0924953d987fe6ffd14efdd32a3a0:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)
Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 847514dab6354d4cc4ccf7b2857487b32119fb37:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)
Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@bananabrick
Copy link
Contributor

These are failing during the "import" workload sporadically. Looking into it.

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 90809c048d05f923a67ce9b89597b2779fc73e32:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)
Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 0880e83e30ee5eb9aab7bb2297324e098d028225:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)
Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 7897f24246bef3cb94f9f4bfaed474ecaa9fdee6:

		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 8: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 9: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | I210820 08:22:08.352991 1 (gostd) cluster_synced.go:1677  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *cluster.WithCommandDetails (2) *exec.ExitError
Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 11e0a4da82124e70e772a009011ca7a4007bff85:

		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 8: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 9: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | I210821 08:06:06.679612 1 (gostd) cluster_synced.go:1677  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *cluster.WithCommandDetails (2) *exec.ExitError
Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ d18da6c092bf1522e7a6478fe3973817e318c247:

		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 8: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 9: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | I210822 08:36:00.123979 1 (gostd) cluster_synced.go:1677  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *cluster.WithCommandDetails (2) *exec.ExitError
Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 61bd543ba7288c8f0eed6cddded7b219c9d1fcd4:

		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 8: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | 9: ~ ./cockroach.sh: exit status 1
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:412
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:166
		  |   | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1709
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) ~ ./cockroach.sh
		  |   | �[0;1;31mJob for cockroach.service failed because the control process exited with error code.�[0m
		  |   | �[0;1;31mSee "systemctl status cockroach.service" and "journalctl -xe" for details.�[0m
		  | Wraps: (3) exit status 1
		  | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
		  | I210823 08:17:14.885081 1 (gostd) cluster_synced.go:1677  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *cluster.WithCommandDetails (2) *exec.ExitError
Reproduce

See: roachtest README

See: CI job to stress roachtests

For the CI stress job, click the ellipsis (...) next to the Run button and fill in: * Changes / Build branch: master * Parameters / `env.TESTS`: `^clearrange/zfs/checks=true$` * Parameters / `env.COUNT`: <number of runs>

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/checks=true failed with artifacts on master @ 8cae60f603ccc4d83137167b3b31cab09be9d41a:

		  |  1358.0s        0         3557.8         5106.1      5.0     32.5     62.9    159.4 write
		  |  1359.0s        0         3963.4         5105.3      5.0     24.1     79.7    121.6 write
		  |  1360.0s        0         4129.4         5104.6      5.0     24.1     54.5     92.3 write
		  | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  |  1361.0s        0          953.5         5101.5      5.0     26.2     50.3     83.9 write
		  |  1362.0s        0            0.0         5097.8      0.0      0.0      0.0      0.0 write
		  |  1363.0s        0            0.0         5094.0      0.0      0.0      0.0      0.0 write
		  |  1364.0s        0            0.0         5090.3      0.0      0.0      0.0      0.0 write
		  |  1365.0s        0            0.0         5086.6      0.0      0.0      0.0      0.0 write
		  |  1366.0s        0            0.0         5082.8      0.0      0.0      0.0      0.0 write
		  |  1367.0s        0            0.0         5079.1      0.0      0.0      0.0      0.0 write
		  |  1368.0s        0            0.0         5075.4      0.0      0.0      0.0      0.0 write
		  |  1369.0s        0            0.0         5071.7      0.0      0.0      0.0      0.0 write
		  |  1370.0s        0            0.0         5068.0      0.0      0.0      0.0      0.0 write
		  | Error: unexpected EOF
		  | COMMAND_PROBLEM: exit status 1
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,clearrange.go:207,clearrange.go:38,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:38
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 8cae60f603ccc4d83137167b3b31cab09be9d41a:

		Wraps: (2) output in run_090816.531265009_n1_cockroach_workload_fixtures_import_bank
		Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-3366874-1630046181-41-n10cpu16:1 -- ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank returned
		  | stderr:
		  | I210827 09:08:17.597020 1 ccl/workloadccl/fixture.go:345  [-] 1  starting import of 1 tables
		  | Error: importing fixture: importing table bank: dial tcp 127.0.0.1:26257: connect: connection refused
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ``````
		  |   | ./cockroach workload fixtures import bank --payload-bytes=10240 --ranges=10 --rows=65104166 --seed=4 --db=bigbank
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3366874-1630046181-41-n10cpu16 --oneshot --ignore-empty-nodes: exit status 1 1: dead (exit status 137)
		10: 13246
		4: 13917
		5: 14053
		2: 13877
		7: 13704
		8: 13512
		9: 13895
		3: 13794
		6: 13959
		Error: UNCLASSIFIED_PROBLEM: 1: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 1: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 44ea1fa0eba8fc78544700ef4afded62ab98a021:

		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3373578-1630131756-44-n10cpu16 --oneshot --ignore-empty-nodes: exit status 1 5: dead (exit status 134)
		6: 1381369
		10: 1433487
		1: 1027658
		8: 942369
		7: 1337138
		2: 1209474
		3: 1737446
		4: 1008459
		9: 1325433
		Error: UNCLASSIFIED_PROBLEM: 5: dead (exit status 134)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 5: dead (exit status 134)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 0b57dc40deda1206d9a1c215ffdb219bbf182a39:

		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

	cluster.go:1249,context.go:89,cluster.go:1237,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3380119-1630304219-45-n10cpu16 --oneshot --ignore-empty-nodes: exit status 1 1: 1458935
		5: 1129317
		3: 1155285
		7: 1713730
		8: 1288440
		6: 1139808
		2: dead (exit status 134)
		10: 1327500
		9: 1181349
		4: 958629
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 134)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1173
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2107
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 2: dead (exit status 134)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

See: roachtest README

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/checks=true failed with artifacts on master @ c1ef81f5f435b3cc5bdf8b218532e0779f03a6bf:

		  |  1595.0s        0         2227.4         3986.7      6.8     17.8    159.4    209.7 write
		  |  1596.0s        0         3167.1         3986.2      6.8     19.9    117.4    570.4 write
		  |  1597.0s        0         3393.2         3985.8      6.8     26.2     56.6    142.6 write
		  |  1598.0s        0         1478.8         3984.2      6.0     39.8    159.4    201.3 write
		  |  1599.0s        0            0.0         3981.8      0.0      0.0      0.0      0.0 write
		  |  1600.0s        0            0.0         3979.3      0.0      0.0      0.0      0.0 write
		  | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  |  1601.0s        0            0.0         3976.8      0.0      0.0      0.0      0.0 write
		  |  1602.0s        0            0.0         3974.3      0.0      0.0      0.0      0.0 write
		  |  1603.0s        0            0.0         3971.8      0.0      0.0      0.0      0.0 write
		  |  1604.0s        0            0.0         3969.3      0.0      0.0      0.0      0.0 write
		  |  1605.0s        0            0.0         3966.9      0.0      0.0      0.0      0.0 write
		  |  1606.0s        0            0.0         3964.4      0.0      0.0      0.0      0.0 write
		  |  1607.0s        0            0.0         3961.9      0.0      0.0      0.0      0.0 write
		  | Error: ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.151:49272->10.142.0.148:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003)
		  | COMMAND_PROBLEM: exit status 1
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

	monitor.go:128,clearrange.go:207,clearrange.go:38,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:38
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:6309
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:208
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 15b773c71f92d643795e34c922717fde0447f9cd:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)
Reproduce

See: roachtest README

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 42e5f9492d0d8d93638241303bca984fe78baae3:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)
Reproduce

See: roachtest README

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

@nvanbenschoten
Copy link
Member

Nice job extracting all of that Nick! That goroutine is interesting both because it is not stuck ([select]) and because there are over 60 of these computeChecksumPostApply functions firing at once on a single node. That indicates that these calls to computeChecksumPostApply are not stuck, but are very slow.

Could this slowness be explained by unexpectedly high consistency checker concurrency? All consistency checks on a single node will share the same consistencyLimiter rate limiter, which defaults to a rate of 8MB/s. Split across 60 ranges, that's 140KB/s per range. So the time to scan a single range will be 512MB / 140KB/s = 3840s = 64m.

@tbg would any of the replication circuit breaker work have led to the consistency checker queue detaching its context from an ongoing consistency check and moving on without the consistency check being canceled? If so, could this explain why we have more consistency checks running concurrently than the individual consistencyQueues should allow? And then the shared rate limiter would explain why these checks are getting slower and slower as more consistency checks leak.

@nicktrav
Copy link
Collaborator

Should have also mentioned that the stacks from above were taken from latest master (branched from fa93c68).

Here's a look at what we're calling the "bad" sha (i.e. 6664d0c). Same problem, just much more pronounced:

LSM state (on worst node):

Store 6:
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1   6.5 M       -   220 M       -       -       -       -   221 M       -       -       -     1.0
      0         0     0 B    0.00   215 M   685 M     137     0 B       0    52 M      72     0 B       0     0.2
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         7    25 M    0.57    39 M     0 B       0     0 B       0    96 M      28    97 M       1     2.4
      3        21   103 M    0.33   252 M   1.2 G      96    31 M      12   309 M     113   329 M       1     1.2
      4       355   4.3 G    1.00   1.2 G    21 G   1.7 K   638 M     100   1.6 G     318   1.6 G       1     1.4
      5      2171    29 G    1.00   7.2 G   108 G   7.1 K    11 G   1.0 K    11 G   1.0 K    11 G       1     1.5
      6      2803    98 G       -    97 G   1.6 G     125    36 M       8   169 G   5.9 K   169 G       1     1.7
  total      5357   131 G       -   132 G   132 G   9.1 K    12 G   1.1 K   314 G   7.4 K   182 G       5     2.4
  flush        36
compact      4326    66 G   4.6 M       1          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype      3159       5      29    1129       4  (default, delete, elision, move, read)
 memtbl         1    64 M
zmemtbl        13   280 M
   ztbl      6294   105 G
 bcache     220 K   3.2 G   20.3%  (score == hit-rate)
 tcache      10 K   6.5 M   97.8%  (score == hit-rate)
 titers      2643
 filter         -       -   98.2%  (score == utility)

530 goroutines computing the consistency checks:

0 | quotapool | quotapool.go:281 | (*AbstractPool).Acquire(#1309, {#4, *}, {#134, *})
1 | quotapool | int_rate.go:59 | (*RateLimiter).WaitN(#307, {#4, *}, *)
2 | kvserver | replica_consistency.go:581 | (*Replica).sha512.func1({{*, *, *}, {*, 0, 0}}, {*, *, *})
3 | storage | mvcc.go:3902 | ComputeStatsForRange({#147, *}, {*, *, *}, {*, *, *}, 0, {*, ...})
4 | kvserver | replica_consistency.go:636 | (*Replica).sha512(*, {#4, *}, {*, {*, *, 8}, {*, *, 8}, ...}, …)
5 | kvserver | replica_proposal.go:247 | (*Replica).computeChecksumPostApply.func1.1({#156, *}, {{*, *, *, *, *, *, *, *, ...}, ...}, …)
6 | kvserver | replica_proposal.go:253 | (*Replica).computeChecksumPostApply.func1({#4, *})
7 | stop | stopper.go:488 | (*Stopper).RunAsyncTaskEx.func2()

@nicktrav
Copy link
Collaborator

nicktrav commented Jan 27, 2022

Going to keep digging on master, as I'm fairly certain that 71f0b34 alleviates much of the problem we were seeing. Though, as PR #75448 mentions, there's likely an alternative failure mode.

nicktrav pushed a commit that referenced this issue Jan 28, 2022
On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that
of the proposal. As of #71806, this context is canceled right after the
corresponding proposal is signaled (and the client goroutine returns
from `sendWithRangeID`). This effectively prevents most consistency
checks from succeeding (they previously were not affected by
higher-level cancellation because the consistency check is triggered
from a local queue that talks directly to the replica, i.e. had
something like a minutes-long timeout).

This caused disastrous behavior in the `clearrange` suite of roachtests.
That test imports a large table. After the import, most ranges have
estimates (due to the ctx cancellation preventing the consistency
checks, which as a byproduct trigger stats adjustments) and their stats
claim that they are very small. Before recent PR #74674, `ClearRange` on
such ranges would use individual point deletions instead of the much
more efficient pebble range deletions, effectively writing a lot of data
and running the nodes out of disk.

Failures of `clearrange` with #74674 were also observed, but they did
not involve out-of-disk situations, so are possibly an alternative
failure mode (that may still be related to the newly introduced presence
of context cancellation).

Touches #68303.

Release note: None
@nicktrav
Copy link
Collaborator

Last update for the evening. Spent the remainder of today looking less at the ztbls leak (from what I'm seeing, after 71f0b34 we tend to see brief periods of elevation, but never anything near as bad as before that commit), and more on the replica imbalances problem that @tbg mentioned which is preventing the test from even getting to the actual "clearrange" step.

On the master, even taking ctx cancellation completely out of the picture, we're running out of disk as well, but it looks to be due to a replica imbalance.

Sampling some commits, I'm noticing the following "good" vs. "bad" behavior:

Good (fair allocation of replicas across all nodes):

Screen Shot 2022-01-27 at 8 17 07 PM

Bad (some nodes run out of disk and stall the import):

Screen Shot 2022-01-27 at 7 59 43 PM

I started a bisect, but it was taking some time. I'll pick this up again tomorrow.

@tbg
Copy link
Member

tbg commented Jan 28, 2022

@tbg would any of the replication circuit breaker work have led to the consistency checker queue detaching its context from an ongoing consistency check and moving on without the consistency check being canceled? If so, could this explain why we have more consistency checks running concurrently than the individual consistencyQueues should allow? And then the shared rate limiter would explain why these checks are getting slower and slower as more consistency checks leak.

I think I see what the problem is. I had actually thought about it before, but erroneously convinced myself that it wasn't an issue. Here's what it looks like on the leaseholder on the bad sha (i.e. ctx cancels):

  • leaseholder proposes ComputeChecksum
  • applies cmd, spawns async computation
  • ctx cancels
  • async computation stops immediately
  • leaseholder's long-poll to wait for the result errors out
  • leaseholder handles next range. goto 1

so no concurrency. But step 2 also happens on each follower, and there it will not have a cancelable context associated to it. So there:

  • applies cmd, spawns async
  • runs for a long time
  • but in the meantime the leaseholder is already doing ten more ranges that also all went through steps 1+2
  • have 60+ consistency checks running, oops

So basically the problem is that if a consistency check fails fast on the leader, this doesn't cancel the in-flight computation on the follower. Since each node is a follower for lots of ranges, we had tons of consistency checks running on each node.

What's curious is that when I ran this experiment I should've seen lots of snapshots open but I didn't, but maybe my instrumentation was wrong or the test never got to the point where it exhibited this problem (the graceful shutdowns I introduced after the import hung, I think).

With the cancellation fix, we're close to the previous behavior. The only difference is that previously, the computation on the leaseholder was canceled when the consistency checker queue gave up. But like before this wouldn't affect the followers if they still had the computation ongoing.

I think this might put a pin in the high ztbl count, right? Thanks for all of the work getting us here @nicktrav!

@tbg
Copy link
Member

tbg commented Jan 28, 2022

How are you bisecting, btw? Are you going through all 319 commits cd1093d...8eaf8d2? It sounds as though the strategy for each bisection attempt would be to cherry-pick 71f0b34 on top, but are we even sure this is "good" for any of the commits in that range?

@nvanbenschoten
Copy link
Member

So basically the problem is that if a consistency check fails fast on the leader, this doesn't cancel the in-flight computation on the follower. Since each node is a follower for lots of ranges, we had tons of consistency checks running on each node.

This makes a lot of sense. I'll still suggest that we should think carefully about whether the client ctx cancellation is the root of the problem, or whether it's actually d064059. The ability for a client ctx cancellation to propagate to Raft log application on the proposer replica seems like a serious problem to me. It breaks determinism, the cornerstone of the whole "replicated state machine" idea. I'm actually surprised this hasn't caused worse issues, like a short-circuited split on the proposer. We must just not currently check for context cancellation in many places below Raft.

@erikgrinaker
Copy link
Contributor

I fully agree with Nathan here. That commit was motivated by propagating tracing information through command application, but it should not propagate cancellation signals.

@cockroach-teamcity
Copy link
Member Author

roachtest.clearrange/checks=true failed with artifacts on master @ 71becf337d9d2731298dc092f3ce9cf0f0eedb2c:

		  | I220128 10:37:27.336682 337 workload/pgx_helpers.go:79  [-] 23  pgx logger [error]: Exec logParams=map[args:[-6450913955317917568 f1] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340792 326 workload/pgx_helpers.go:79  [-] 24  pgx logger [error]: Exec logParams=map[args:[8829663467242086327 62] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.336700 343 workload/pgx_helpers.go:79  [-] 25  pgx logger [error]: Exec logParams=map[args:[-3644533257171351169 0b] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.336710 323 workload/pgx_helpers.go:79  [-] 26  pgx logger [error]: Exec logParams=map[args:[3192999095032280912 da] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.336696 346 workload/pgx_helpers.go:79  [-] 27  pgx logger [error]: Exec logParams=map[args:[6493141783003117667 97] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340831 327 workload/pgx_helpers.go:79  [-] 28  pgx logger [error]: Exec logParams=map[args:[1555708056282946553 c5] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340861 65 workload/pgx_helpers.go:79  [-] 29  pgx logger [error]: Exec logParams=map[args:[1826535142466176772 5b] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340876 64 workload/pgx_helpers.go:79  [-] 30  pgx logger [error]: Exec logParams=map[args:[1318876305802062279 3a] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340893 344 workload/pgx_helpers.go:79  [-] 31  pgx logger [error]: Exec logParams=map[args:[1728666596595591428 ca] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340911 59 workload/pgx_helpers.go:79  [-] 32  pgx logger [error]: Exec logParams=map[args:[-6613865651839355368 ca] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340927 61 workload/pgx_helpers.go:79  [-] 33  pgx logger [error]: Exec logParams=map[args:[-3523718973629480045 3a] err:unexpected EOF sql:kv-2]
		  | I220128 10:37:27.340941 62 workload/pgx_helpers.go:79  [-] 34  pgx logger [error]: Exec logParams=map[args:[-8232659246879096639 05] err:unexpected EOF sql:kv-2]
		  | Error: unexpected EOF
		  | COMMAND_PROBLEM: exit status 1
		  |   10: 
		  | UNCLASSIFIED_PROBLEM: context canceled
		Wraps: (4) secondary error attachment
		  | COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 9. Command with error:
		  |   | ``````
		  |   | ./cockroach workload run kv --concurrency=32 --duration=1h
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	monitor.go:127,clearrange.go:207,clearrange.go:39,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 9: dead (exit status 10)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 9: dead (exit status 10)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@andreimatei
Copy link
Contributor

We must just not currently check for context cancellation in many places below Raft.

Is there a good reason why we check for cancellation in any places below Raft?

@tbg
Copy link
Member

tbg commented Jan 28, 2022

Is there a good reason why we check for cancellation in any places below Raft?

There may not be, but it seems brittle pass a cancelable context into a subsystem that must not check cancellation. It's both more robust and also, in my view, more appropriate to execute state machine transitions under a context that does not inherit the wholly unrelated client cancellation.

I think we should massage

// createTracingSpans creates and assigns a new tracing span for each decoded
// command. If a command was proposed locally, it will be given a tracing span
// that follows from its proposal's span.
func (d *replicaDecoder) createTracingSpans(ctx context.Context) {
const opName = "raft application"
var it replicatedCmdBufSlice
for it.init(&d.cmdBuf); it.Valid(); it.Next() {
cmd := it.cur()
if cmd.IsLocal() {
cmd.ctx, cmd.sp = tracing.ChildSpan(cmd.proposal.ctx, opName)
} else if cmd.raftCmd.TraceData != nil {
// The proposal isn't local, and trace data is available. Extract
// the remote span and start a server-side span that follows from it.
spanMeta, err := d.r.AmbientContext.Tracer.ExtractMetaFrom(tracing.MapCarrier{
Map: cmd.raftCmd.TraceData,
})
if err != nil {
log.Errorf(ctx, "unable to extract trace data from raft command: %s", err)
} else {
cmd.ctx, cmd.sp = d.r.AmbientContext.Tracer.StartSpanCtx(
ctx,
opName,
// NB: Nobody is collecting the recording of this span; we have no
// mechanism for it.
tracing.WithRemoteParent(spanMeta),
tracing.WithFollowsFrom(),
)
}
} else {
cmd.ctx, cmd.sp = tracing.ChildSpan(ctx, opName)
}
}
}

such that it avoids deriving from cmd.proposal.ctx (i.e. it can create a derived span, but not become a child of the proposer ctx).

@tbg
Copy link
Member

tbg commented Jan 28, 2022

Filed #75656

@andreimatei
Copy link
Contributor

There may not be, but it seems brittle pass a cancelable context into a subsystem that must not check cancellation

I kinda see it the other way around. The subsystem should be robust against any context passed into it. Depending on where you draw the boundary of the subsystem on question, you can say that raft application can be made robust by switching to a non-cancelable context itself. But still, if there's code that only ever runs below Raft, I think we should take out all the cancellation checks (at the very least, for clarity).

@nicktrav
Copy link
Collaborator

How are you bisecting, btw? ... It sounds as though the strategy for each bisection attempt would be to cherry-pick 71f0b34 on top

Yeah, taking this approach. There are only 8-ish steps in a full bisect, but it's a little bit of extra work to cherry-pick, etc.. So a little slower going.

are we even sure this is "good" for any of the commits in that range?

I don't think we are based on what came in while I was offline. That said, if I treat a "good" signal for this bisect as whether the replicas are balanced I seem to be zeroing in.

@nvanbenschoten
Copy link
Member

Depending on where you draw the boundary of the subsystem on question, you can say that raft application can be made robust by switching to a non-cancelable context itself.

Right, I think this is what we're saying, and what is proposed in #75656.

But still, if there's code that only ever runs below Raft, I think we should take out all the cancellation checks (at the very least, for clarity).

Trying to make this guarantee is the part that seems brittle. Even if we carefully audit and ensure that we don't perform context cancellation checks directly in Raft code, it's hard to guarantee that no lower-level logic or library that Raft code calls into will perform such checks. For instance, I broke this guarantee in #73279 while touching distant code, which Erik fixed in #73484. There are also proposals like golang/go#20280 to add context cancellation awareness to the filesystem operations provided by the standard library. If we don't want a subsystem to respect context cancellation, it's best not to give it a cancellable context.

@nvanbenschoten
Copy link
Member

I think we should massage .. such that it avoids deriving from cmd.proposal.ctx (i.e. it can create a derived span, but not become a child of the proposer ctx).

Maybe we could even remove cmd.proposal.ctx entirely. We already extract the tracing span (cmd.proposal.sp) from the context.

@tbg
Copy link
Member

tbg commented Jan 28, 2022

Let's continue discussing this on #75656, this comment thread is already pretty unwiedly.

@nicktrav
Copy link
Collaborator

I had some luck with the bisect on the replica imbalance issue.

I've narrowed it down to e12c9e6. On master with this commit included I see the following behavior on import:

Screen Shot 2022-01-28 at 9 37 24 AM

I then ran a branch with just this commit excluded and the replicas are far more balanced and the import is able to succeed:

Screen Shot 2022-01-28 at 10 47 39 AM

I don't have enough context to say whether, outside the context of just the clearrange/* tests, that commit would cause issues elsewhere. It could just be a matter of giving these test workers more headroom to allow them to complete the import, and then potentially rebalance to a more even state? cc: @dt - happy to spin up a new issue for this to avoid piling onto this ticket.

Once the import succeeds, we're into the (well documented) realm of #75656 - ztbls isn't terrible, but we have a lot of goroutines (~100 on each node) running the consistency checks, which is slowing down test overall, and preventing the disk space from being reclaimed (not an issue for the remainder of the test, as we're just deleting).

In terms of debugging this specific test failure, I think we've found the two separate issues we theorized.

@tbg
Copy link
Member

tbg commented Jan 28, 2022

Great work, and much appreciate the consistent extra miles you're going.

I think we should close this issue, and file a separate issue about what you have found, and then link it to the new clear range failure issue once the next nightly creates it.

@dt
Copy link
Member

dt commented Jan 28, 2022

Well that's pretty interesting / mildly surprising since this is an ordered ingestion import, where we didn't expect this bulk-sent split size to do much other an aggravate the merge queue an hour later.

@nicktrav
Copy link
Collaborator

I think we should close this issue, and file a separate issue about what you have found, and then link it to the new clear range failure issue once the next nightly creates it.

Ack. Will do. @dt - I'll move discussion over there 👍

If I close this, I think it will just re-open when it fails again (at least until the replica imbalance issue is addressed). I'll just make it known to the rest of the Storage folks that we can probably safely leave this one alone.

Thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Projects
None yet
Development

No branches or pull requests

10 participants