New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

roachtest: increase the concurrency for gracefuldraining #107955

Merged

craig merged 3 commits into cockroachdb:master from andrewbaptist:2023-07-31-cleanup-graceful-drain

Aug 2, 2023

Collaborator

andrewbaptist commented Aug 1, 2023

This set of commits changes gracefuldraining to use the measureQPS method instead of looking at the stats from the stats database and additionally increases the concurrency so it no longer fails.

Member

cockroach-teamcity commented Aug 1, 2023

This change is

andrewbaptist force-pushed the 2023-07-31-cleanup-graceful-drain branch from 7e0e59f to b49dbe5 Compare

August 1, 2023 16:31

Collaborator Author

andrewbaptist commented Aug 1, 2023

Passes 50 roachtest runs with this change.

andrewbaptist marked this pull request as ready for review

August 1, 2023 17:39

andrewbaptist requested a review from a team as a code owner

August 1, 2023 17:39

andrewbaptist requested review from herkolategan and renatolabs and removed request for a team

August 1, 2023 17:39


          roachtest: make measureQPS more accurate

955e7e0

Previously measureQPS would not account for the time the query to
retrieve the counts was running. This change allows querying multiple
nodes in parallel and also subtracts out time while the query is
running.

Epic: none

Release note: None

andrewbaptist force-pushed the 2023-07-31-cleanup-graceful-drain branch from b49dbe5 to 1801a26 Compare

August 1, 2023 18:02

andrewbaptist requested a review from kvoli

August 1, 2023 18:24

kvoli reviewed

View reviewed changes

Collaborator

kvoli left a comment

Nice! Left some minor comments/questions.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @herkolategan, and @renatolabs)

-- commits line 17 at r2:
nit: Could you be more specific, "statistics during the test" could mean a lot of things. Perhaps, "based on the metric timeseries web API"?

-- commits line 26 at r2:
rm this line

-- commits line 31 at r3:
nit: could you mention that this is for the workload concurrency.

pkg/cmd/roachtest/tests/kv.go line 540 at r2 (raw file):

			// definitely a large number of leases on the node that we shut down
			// before it starts draining.
			c.Run(ctx, c.Node(1), "./cockroach workload init kv --splits 100")

Was --max-ops only supported on the deprecated workload runner?

pkg/cmd/roachtest/tests/kv.go line 561 at r2 (raw file):

			m.Go(func(ctx context.Context) error {
				cmd := fmt.Sprintf(
					"./cockroach workload run kv --duration=%s --read-percent=0 --max-rate=%d {pgurl:1-%d}",

How come you removed the --tolerate-errors flag here; It seems like we may want this, to protect against noise here, wdyt?

andrewbaptist added 2 commits

August 1, 2023 17:06


          roachtest: modernize gracefuldraining test

cf30717

Previously the test would attempt to determine the throughput based on
the metrics timeseries web API during the test. This caused a high
variance even though the test runner was showing lower variance. This
change updates to query against the `crdb_internal.node_metrics` table
during the test and produces more reliable results.

Epic: none
Informs: cockroachdb#106490

Release note: None


          roachtest: increase the concurrency for gracefuldraining

260ed7a

Previously the workload concurrency of 8 could cause spurious test
failures as the test would incorrectly count a single slow range as
causing too large of a dip. This isn't ideal behavior as we should have
more stable performance over time, but this is orthogonal to the goal of
this test. With a higher concurrency this test doesn't flake anymore.

Epic: none
Fixes: cockroachdb#106490

Release note: None

andrewbaptist commented

View reviewed changes

Collaborator Author

andrewbaptist left a comment

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @kvoli, and @renatolabs)

-- commits line 17 at r2:

Previously, kvoli (Austen) wrote…

nit: Could you be more specific, "statistics during the test" could mean a lot of things. Perhaps, "based on the metric timeseries web API"?

Done

-- commits line 26 at r2:

Previously, kvoli (Austen) wrote…

rm this line

Done

-- commits line 31 at r3:

Previously, kvoli (Austen) wrote…

nit: could you mention that this is for the workload concurrency.

Done

pkg/cmd/roachtest/tests/kv.go line 540 at r2 (raw file):

Previously, kvoli (Austen) wrote…

Was --max-ops only supported on the deprecated workload runner?

This wasn't really necessary as it was trying to run workload init but doing it in a weird way. I'm not sure why it did it this way as it really only wanted to cause the splits to happen, so I think it was just a workaround. This was an old test, so probably predated many of the newer patterns.

pkg/cmd/roachtest/tests/kv.go line 561 at r2 (raw file):

Previously, kvoli (Austen) wrote…

How come you removed the --tolerate-errors flag here; It seems like we may want this, to protect against noise here, wdyt?

I was on the fence with this one. I didn't really want there to be any errors, and in my stress testing I didn't have any. Ideally with a drain we shouldn't have errors in addition to not having t throughput drop, so while there is a risk this makes it more flakey, it would be much better if this flag wasn't needed as it would catch any issues we introduce that cause normal drain/restart to fail.

kvoli approved these changes

View reviewed changes

Collaborator

kvoli left a comment

I don't see the latest commits, did you push? With those nits addressed !

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist, @herkolategan, and @renatolabs)

pkg/cmd/roachtest/tests/kv.go line 540 at r2 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

This wasn't really necessary as it was trying to run workload init but doing it in a weird way. I'm not sure why it did it this way as it really only wanted to cause the splits to happen, so I think it was just a workaround. This was an old test, so probably predated many of the newer patterns.

Yeah, I can't see that option when running ./cockroach workload init kv --help, seems fine.

pkg/cmd/roachtest/tests/kv.go line 561 at r2 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I was on the fence with this one. I didn't really want there to be any errors, and in my stress testing I didn't have any. Ideally with a drain we shouldn't have errors in addition to not having t throughput drop, so while there is a risk this makes it more flakey, it would be much better if this flag wasn't needed as it would catch any issues we introduce that cause normal drain/restart to fail.

If it passed 50 stresses, then sounds good to me. I suppose we can re-assess when/if this causes a failure.

andrewbaptist force-pushed the 2023-07-31-cleanup-graceful-drain branch from 1801a26 to 260ed7a Compare

August 1, 2023 23:28

Collaborator Author

andrewbaptist commented Aug 1, 2023

bors r=kvoli

Contributor

craig bot commented Aug 2, 2023

Build succeeded:

Bazel Essential CI (Cockroach)

craig bot merged commit f153ff2 into cockroachdb:master

2 checks passed

andrewbaptist deleted the 2023-07-31-cleanup-graceful-drain branch

December 15, 2023 21:39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet