-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: increase the concurrency for gracefuldraining #107955
roachtest: increase the concurrency for gracefuldraining #107955
Conversation
7e0e59f
to
b49dbe5
Compare
Passes 50 roachtest runs with this change. |
Previously measureQPS would not account for the time the query to retrieve the counts was running. This change allows querying multiple nodes in parallel and also subtracts out time while the query is running. Epic: none Release note: None
b49dbe5
to
1801a26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Left some minor comments/questions.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @herkolategan, and @renatolabs)
-- commits
line 17 at r2:
nit: Could you be more specific, "statistics during the test" could mean a lot of things. Perhaps, "based on the metric timeseries web API"?
-- commits
line 26 at r2:
rm this line
-- commits
line 31 at r3:
nit: could you mention that this is for the workload concurrency.
pkg/cmd/roachtest/tests/kv.go
line 540 at r2 (raw file):
// definitely a large number of leases on the node that we shut down // before it starts draining. c.Run(ctx, c.Node(1), "./cockroach workload init kv --splits 100")
Was --max-ops
only supported on the deprecated workload runner?
pkg/cmd/roachtest/tests/kv.go
line 561 at r2 (raw file):
m.Go(func(ctx context.Context) error { cmd := fmt.Sprintf( "./cockroach workload run kv --duration=%s --read-percent=0 --max-rate=%d {pgurl:1-%d}",
How come you removed the --tolerate-errors
flag here; It seems like we may want this, to protect against noise here, wdyt?
Previously the test would attempt to determine the throughput based on the metrics timeseries web API during the test. This caused a high variance even though the test runner was showing lower variance. This change updates to query against the `crdb_internal.node_metrics` table during the test and produces more reliable results. Epic: none Informs: cockroachdb#106490 Release note: None
Previously the workload concurrency of 8 could cause spurious test failures as the test would incorrectly count a single slow range as causing too large of a dip. This isn't ideal behavior as we should have more stable performance over time, but this is orthogonal to the goal of this test. With a higher concurrency this test doesn't flake anymore. Epic: none Fixes: cockroachdb#106490 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @kvoli, and @renatolabs)
Previously, kvoli (Austen) wrote…
nit: Could you be more specific, "statistics during the test" could mean a lot of things. Perhaps, "based on the metric timeseries web API"?
Done
Previously, kvoli (Austen) wrote…
rm this line
Done
Previously, kvoli (Austen) wrote…
nit: could you mention that this is for the workload concurrency.
Done
pkg/cmd/roachtest/tests/kv.go
line 540 at r2 (raw file):
Previously, kvoli (Austen) wrote…
Was
--max-ops
only supported on the deprecated workload runner?
This wasn't really necessary as it was trying to run workload init
but doing it in a weird way. I'm not sure why it did it this way as it really only wanted to cause the splits to happen, so I think it was just a workaround. This was an old test, so probably predated many of the newer patterns.
pkg/cmd/roachtest/tests/kv.go
line 561 at r2 (raw file):
Previously, kvoli (Austen) wrote…
How come you removed the
--tolerate-errors
flag here; It seems like we may want this, to protect against noise here, wdyt?
I was on the fence with this one. I didn't really want there to be any errors, and in my stress testing I didn't have any. Ideally with a drain we shouldn't have errors in addition to not having t throughput drop, so while there is a risk this makes it more flakey, it would be much better if this flag wasn't needed as it would catch any issues we introduce that cause normal drain/restart to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the latest commits, did you push? With those nits addressed !
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist, @herkolategan, and @renatolabs)
pkg/cmd/roachtest/tests/kv.go
line 540 at r2 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
This wasn't really necessary as it was trying to run
workload init
but doing it in a weird way. I'm not sure why it did it this way as it really only wanted to cause the splits to happen, so I think it was just a workaround. This was an old test, so probably predated many of the newer patterns.
Yeah, I can't see that option when running ./cockroach workload init kv --help
, seems fine.
pkg/cmd/roachtest/tests/kv.go
line 561 at r2 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
I was on the fence with this one. I didn't really want there to be any errors, and in my stress testing I didn't have any. Ideally with a drain we shouldn't have errors in addition to not having t throughput drop, so while there is a risk this makes it more flakey, it would be much better if this flag wasn't needed as it would catch any issues we introduce that cause normal drain/restart to fail.
If it passed 50 stresses, then sounds good to me. I suppose we can re-assess when/if this causes a failure.
1801a26
to
260ed7a
Compare
bors r=kvoli |
Build succeeded: |
This set of commits changes gracefuldraining to use the
measureQPS
method instead of looking at the stats from the stats database and additionally increases the concurrency so it no longer fails.