Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: increase the concurrency for gracefuldraining #107955

Merged

Conversation

andrewbaptist
Copy link
Collaborator

This set of commits changes gracefuldraining to use the measureQPS method instead of looking at the stats from the stats database and additionally increases the concurrency so it no longer fails.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andrewbaptist andrewbaptist force-pushed the 2023-07-31-cleanup-graceful-drain branch from 7e0e59f to b49dbe5 Compare August 1, 2023 16:31
@andrewbaptist
Copy link
Collaborator Author

Passes 50 roachtest runs with this change.

@andrewbaptist andrewbaptist marked this pull request as ready for review August 1, 2023 17:39
@andrewbaptist andrewbaptist requested a review from a team as a code owner August 1, 2023 17:39
@andrewbaptist andrewbaptist requested review from herkolategan and renatolabs and removed request for a team August 1, 2023 17:39
Previously measureQPS would not account for the time the query to
retrieve the counts was running. This change allows querying multiple
nodes in parallel and also subtracts out time while the query is
running.

Epic: none

Release note: None
@andrewbaptist andrewbaptist force-pushed the 2023-07-31-cleanup-graceful-drain branch from b49dbe5 to 1801a26 Compare August 1, 2023 18:02
Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Left some minor comments/questions.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @herkolategan, and @renatolabs)


-- commits line 17 at r2:
nit: Could you be more specific, "statistics during the test" could mean a lot of things. Perhaps, "based on the metric timeseries web API"?


-- commits line 26 at r2:
rm this line


-- commits line 31 at r3:
nit: could you mention that this is for the workload concurrency.


pkg/cmd/roachtest/tests/kv.go line 540 at r2 (raw file):

			// definitely a large number of leases on the node that we shut down
			// before it starts draining.
			c.Run(ctx, c.Node(1), "./cockroach workload init kv --splits 100")

Was --max-ops only supported on the deprecated workload runner?


pkg/cmd/roachtest/tests/kv.go line 561 at r2 (raw file):

			m.Go(func(ctx context.Context) error {
				cmd := fmt.Sprintf(
					"./cockroach workload run kv --duration=%s --read-percent=0 --max-rate=%d {pgurl:1-%d}",

How come you removed the --tolerate-errors flag here; It seems like we may want this, to protect against noise here, wdyt?

Previously the test would attempt to determine the throughput based on
the metrics timeseries web API during the test. This caused a high
variance even though the test runner was showing lower variance. This
change updates to query against the `crdb_internal.node_metrics` table
during the test and produces more reliable results.

Epic: none
Informs: cockroachdb#106490

Release note: None
Previously the workload concurrency of 8 could cause spurious test
failures as the test would incorrectly count a single slow range as
causing too large of a dip. This isn't ideal behavior as we should have
more stable performance over time, but this is orthogonal to the goal of
this test. With a higher concurrency this test doesn't flake anymore.

Epic: none
Fixes: cockroachdb#106490

Release note: None
Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @kvoli, and @renatolabs)


-- commits line 17 at r2:

Previously, kvoli (Austen) wrote…

nit: Could you be more specific, "statistics during the test" could mean a lot of things. Perhaps, "based on the metric timeseries web API"?

Done


-- commits line 26 at r2:

Previously, kvoli (Austen) wrote…

rm this line

Done


-- commits line 31 at r3:

Previously, kvoli (Austen) wrote…

nit: could you mention that this is for the workload concurrency.

Done


pkg/cmd/roachtest/tests/kv.go line 540 at r2 (raw file):

Previously, kvoli (Austen) wrote…

Was --max-ops only supported on the deprecated workload runner?

This wasn't really necessary as it was trying to run workload init but doing it in a weird way. I'm not sure why it did it this way as it really only wanted to cause the splits to happen, so I think it was just a workaround. This was an old test, so probably predated many of the newer patterns.


pkg/cmd/roachtest/tests/kv.go line 561 at r2 (raw file):

Previously, kvoli (Austen) wrote…

How come you removed the --tolerate-errors flag here; It seems like we may want this, to protect against noise here, wdyt?

I was on the fence with this one. I didn't really want there to be any errors, and in my stress testing I didn't have any. Ideally with a drain we shouldn't have errors in addition to not having t throughput drop, so while there is a risk this makes it more flakey, it would be much better if this flag wasn't needed as it would catch any issues we introduce that cause normal drain/restart to fail.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the latest commits, did you push? With those nits addressed :lgtm:!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist, @herkolategan, and @renatolabs)


pkg/cmd/roachtest/tests/kv.go line 540 at r2 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

This wasn't really necessary as it was trying to run workload init but doing it in a weird way. I'm not sure why it did it this way as it really only wanted to cause the splits to happen, so I think it was just a workaround. This was an old test, so probably predated many of the newer patterns.

Yeah, I can't see that option when running ./cockroach workload init kv --help, seems fine.


pkg/cmd/roachtest/tests/kv.go line 561 at r2 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I was on the fence with this one. I didn't really want there to be any errors, and in my stress testing I didn't have any. Ideally with a drain we shouldn't have errors in addition to not having t throughput drop, so while there is a risk this makes it more flakey, it would be much better if this flag wasn't needed as it would catch any issues we introduce that cause normal drain/restart to fail.

If it passed 50 stresses, then sounds good to me. I suppose we can re-assess when/if this causes a failure.

@andrewbaptist andrewbaptist force-pushed the 2023-07-31-cleanup-graceful-drain branch from 1801a26 to 260ed7a Compare August 1, 2023 23:28
@andrewbaptist
Copy link
Collaborator Author

bors r=kvoli

@craig
Copy link
Contributor

craig bot commented Aug 2, 2023

Build succeeded:

@craig craig bot merged commit f153ff2 into cockroachdb:master Aug 2, 2023
2 checks passed
@andrewbaptist andrewbaptist deleted the 2023-07-31-cleanup-graceful-drain branch December 15, 2023 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants