-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsql,jobs: propagate 'job' log tag and pprof label to remote nodes #91630
distsql,jobs: propagate 'job' log tag and pprof label to remote nodes #91630
Conversation
3cd7006
to
a83a4fd
Compare
a83a4fd
to
92bf199
Compare
cc @tbg FYI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we split up the first commit into a separate PR to be backported to 22.2? Or do we plan to backport the whole change there?
Reviewed 4 of 4 files at r1, 22 of 23 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @HonoreDB, and @lidorcarmel)
-- commits
line 18 at r2:
nit: s/tagand/tag and/
.
pkg/sql/distsql_running.go
line 410 at r2 (raw file):
return "" } setupReq := execinfrapb.SetupFlowRequest{
nit: could you please add the TODO for me here:
// TODO(yuzefovich): avoid populating some fields of the SetupFlowRequest
// for local plans.
pkg/sql/execinfra/server_config.go
line 304 at r2 (raw file):
// VerifyContext, when non-nil, is called by the execinfrapb.DistSQLServer // when responding to SetupFlow RPCs. VerifyContext func(ctx context.Context)
Is it possible to expand SetupFlowCb
to do what VerifyContext
does (or in the opposite direction)? I think the current single usage of SetupFlowCb
would be ok if it was called right before RunFlow
.
92bf199
to
389ca86
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for asking! we just discussed it this morning, we probably want to backport everything to 22.2.x. And the pprofutil code touched by the first commit is all new code, so no need to backport to earlier releases.
even if we don't backport, I guess I'm not too worried about backporting the small pprofutil commit because the other 2 usages of SetProfilerLabelsFromCtxTags
always add a tag first and should succeed.
but with the current pr, in mixed version the new<->old servers will not propagate log tags, so I'll change this PR to still send the job ID but add it as a tag only if it's not there already. cc @dt
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @HonoreDB, and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/tagand/tag and/
.
done
pkg/sql/distsql_running.go
line 410 at r2 (raw file):
// TODO(yuzefovich): avoid populating some fields of the SetupFlowRequest
// for local plans.
of course..
pkg/sql/execinfra/server_config.go
line 304 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Is it possible to expand
SetupFlowCb
to do whatVerifyContext
does (or in the opposite direction)? I think the current single usage ofSetupFlowCb
would be ok if it was called right beforeRunFlow
.
unfortunately if we call that callback after setting up the flow the test will fail, because we see an inflight flow here:
cockroach/pkg/sql/distsql_running_test.go
Line 667 in 0c1095e
if n := remoteNode.NumRemoteRunningFlows(); n != 0 { |
if there's an easy way around it I can try.
otherwise I can s/VerifyContext/PostSetupFlowCb/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @HonoreDB, and @lidorcarmel)
pkg/sql/execinfra/server_config.go
line 304 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
unfortunately if we call that callback after setting up the flow the test will fail, because we see an inflight flow here:
cockroach/pkg/sql/distsql_running_test.go
Line 667 in 0c1095e
if n := remoteNode.NumRemoteRunningFlows(); n != 0 { if there's an easy way around it I can try.
otherwise I can s/VerifyContext/PostSetupFlowCb/
Hm, I'd like to not introduce another testing knob if we can. I just pushed another commit to your branch that seems to work (I stressed it for a couple of minutes locally). Let me know what you think of that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @HonoreDB, and @yuzefovich)
pkg/sql/execinfra/server_config.go
line 304 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Hm, I'd like to not introduce another testing knob if we can. I just pushed another commit to your branch that seems to work (I stressed it for a couple of minutes locally). Let me know what you think of that change.
ah, I see why what I did didn't work, I didn't do the cleanup. your commit lgtm, thanks!
b62a0e3
to
41c872a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might good to get a sign off from someone else too.
Reviewed 4 of 5 files at r4, 16 of 16 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @HonoreDB)
@@ -37,8 +37,9 @@ import "cloud/cloudpb/external_storage.proto"; | |||
// descriptor in the database, and doesn't emit any rows nor support | |||
// any post-processing. | |||
message BackfillerSpec { | |||
// TODO(lidor): this is not needed in 23.2, the new way to send the job tag is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: rather than "in 23.2", I'd say "when interoperability with 22.2 is dropped" or something, which may or may not be 23.2 but we'd be trying to predict the future if we specify which version will drop backwards compat with 22.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
41c872a
to
5fe124c
Compare
If we don't have log tags in the context we return invalid context and undo func and then the caller might crash. Return an empty func and the same context. Add a couple of basic tests. Epic: None Release note: None
5fe124c
to
676afd0
Compare
Previously, the recently added job log tag and pprof label were only used locally and not sent to remote nodes. Which means that, for example, pprof did not show the job ids of jobs that consume cpu but were started on another node. This PR propagates the job tag to the remote nodes, and there it adds that tag to be used when logging. It also adds the same tag as a pprof label on the remote node. With this change we no longer need to set the tags that were added in cockroachdb#77397, but we cannot remove those yet because we want the old job tags to still be sent and applied during upgrades. Once 23.1 is out, we can remove the old implementation from 23.2. Note that this PR changes the job log tag from `job=<job id>` to `job=<job type> id=<job id>`, to be the same as the pprof label. Epic: None Release note: None
676afd0
to
1a7ec75
Compare
@yuzefovich PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense, LGTM
Reviewed 16 of 16 files at r6, 11 of 11 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru and @HonoreDB)
Thanks! |
Build failed (retrying...): |
Build succeeded: |
A recently added test (PR cockroachdb#91630) fails in 22.2 in stress or race. The test is running a backup on a 3 node cluster, and then verifies a remote flow was running with the right pprof labels and log tags. The problem is that if all flows are local we fail, which can happen with a 3 node cluster. Instead, this fix is changing the test cluster to have 4 nodes which will force at lease one flow to run remotely. Release note: None Epic: None
A recently added test (PR cockroachdb#91630) fails in 22.2 in stress or race. The test is running a backup on a 3 node cluster, and then verifies a remote flow was running with the right pprof labels and log tags. The problem is that if all flows are local we fail, which can happen with a 3 node cluster. Instead, this fix is changing the test cluster to have 4 nodes which will force at lease one flow to run remotely. Release note: None Epic: None
92696: backupccl: fix a recently added flaky test r=lidorcarmel a=lidorcarmel A recently added test (PR #91630) fails in 22.2 in stress or race. The test is running a backup on a 3 node cluster, and then verifies a remote flow was running with the right pprof labels and log tags. The problem is that if all flows are local we fail, which can happen with a 3 node cluster. Instead, this fix is changing the test cluster to have 4 nodes which will force at lease one flow to run remotely. Release note: None Epic: None Co-authored-by: Lidor Carmel <[email protected]>
A recently added test (PR cockroachdb#91630) fails in 22.2 in stress or race. The test is running a backup on a 3 node cluster, and then verifies a remote flow was running with the right pprof labels and log tags. The problem is that if all flows are local we fail, which can happen with a 3 node cluster. Instead, this fix is changing the test cluster to have 4 nodes which will force at lease one flow to run remotely. Release note: None Epic: None
Previously, the job log tags and the recently added job pprof labels were only used locally and not sent to remote nodes. Which means that, for example, pprof did not show the job ids of jobs that consume cpu but were started on another node.
This PR propagates the job tag to the remote nodes, and there it adds that tag to be used when logging. It also adds the same tag as a pprof label on the remote node. With this change we no longer need to set the tags that were added in #77397, therefore those are reverted here.
Note that this PR changes the job log tag from
job=<job id>
tojob=<job type> id=<job id>
, to be the same as the pprof label.See this gist for a before/after example https://gist.github.com/lidorcarmel/2c147b76d814e29682cccbdcadb5d7bb.
And here is pprof graph example with those tags:
There is another commit in this PR which is fixing a small bug in the recently added pprofutil package.
Epic: None
Release note: None