Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk, backupccl: introduce a Structured event Aggregator #84043

Merged
merged 1 commit into from
Aug 17, 2022

Conversation

adityamaru
Copy link
Contributor

@adityamaru adityamaru commented Jul 7, 2022

This change introduces an Aggregator object that is capable
of listening for Structured events emitted in a recording,
aggregating them and rendering them as LazyTags.
We also introduce an AggregatorEvent interface that can be
implemented by a Structured event thereby making it eligible
for aggregation in the Aggregator.

The first user of the Aggregator will be every backup data
processor that is setup on the nodes in the cluster during
a backup. The Aggregator lives as long as the processor, and
listens for Aggregator events emitted by any span that is a child
of the processors' span. This includes both local children as
well as remote children whose recordings have been imported into
a local span. The Aggregator stores running aggregates of each
AggregatorEvent it is notified about, bucketed by the events'
tag name. This aggregate will be rendered on the tracing span as a LazyTag.

This change teaches every ExportRequest to emit an AggregatorEvent.
Going forward we expect many more operations in bulk jobs to
define and emit such events providing visibility into otherwise
opaque operations.

We cleanup some of the StructuredEvents that were previously
added but have not proved useful, and also change some of the
tracing span operation names to be more intuitive.

To view these aggregated events once can navigate to the
/tracez endpoint of the debug console to take a snapshot and search for either
BACKUP or the job ID to filter for tracing spans on that node.
The span associated with the backup processor will be decorated with
tags that correspond to the fields in the introduced ExportStats
proto message. Note these stats are only aggregated on a per node
basis.

Fixes: #80388

Release note: None

Release justification: low risk change for improved observability into backups.

@adityamaru adityamaru requested review from dt, andreimatei, stevendanna and a team July 7, 2022 22:23
@adityamaru adityamaru requested review from a team as code owners July 7, 2022 22:23
@adityamaru adityamaru requested a review from a team July 7, 2022 22:23
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru adityamaru removed request for a team July 7, 2022 22:23
@adityamaru
Copy link
Contributor Author

adityamaru commented Jul 7, 2022

Screen Shot 2022-07-07 at 6 37 36 PM

An example of how the LazyTag for an AggregatorEvent is rendered.

@adityamaru adityamaru requested a review from rhu713 July 11, 2022 18:11
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @rhu713, and @stevendanna)


pkg/util/bulk/aggregator.go line 64 at r1 (raw file):

		b.mu.sp.SetLazyTag(eventTag, b.mu.aggregatedEvents[eventTag])
	} else {
		b.mu.aggregatedEvents[eventTag].Combine(bulkEvent)

does this change an event that's part of the structured recording, or does it change a copy that's only used for the tag?


pkg/util/bulk/aggregator.go line 83 at r1 (raw file):

// The Aggregator instance is responsible for finishing the returned span, and
// so the user must call Close().
func MakeAggregatorWithSpan(

I don't need to care about this code, so feel free to ignore, but I for one would not intertwine the aggregator and the span. I'd have callers create a span explicitly, then attach an aggregator to the span. The aggregator would not be in charge of closing the span.


pkg/util/tracing/crdbspan.go line 512 at r1 (raw file):

	for _, span := range childRecording {
		for _, record := range span.StructuredRecords {
			s.notifyEventListeners(record.Payload)

remind me why this had to change pls

Copy link
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @dt, @rhu713, and @stevendanna)


pkg/util/bulk/aggregator.go line 64 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

does this change an event that's part of the structured recording, or does it change a copy that's only used for the tag?

good catch, added a Clone method to the interface that returns a copy the first time we assign it a slot in the map.


pkg/util/bulk/aggregator.go line 83 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't need to care about this code, so feel free to ignore, but I for one would not intertwine the aggregator and the span. I'd have callers create a span explicitly, then attach an aggregator to the span. The aggregator would not be in charge of closing the span.

I did try this before the current implementation, but it wasn't very pretty. Since event listeners have to be registered at span creation time, it involved plumbing the event listener through a few methods in processorbase starting from bp.StartInternal(ctx, backupProcessorName).


pkg/util/tracing/crdbspan.go line 512 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

remind me why this had to change pls

https://github.com/cockroachdb/cockroach/pull/84043/files#diff-d5042f9b360d17b5e8009b0cedd5b8a7c490b64b4b41c6958b4e5fd99074ad57R47 in the aggregator we type check that the structured event is an AggregatorEvent. When importing remote spans we notify the event listeners on the span being imported into of these events. Before this diff we would pass along record.Payload which is of type *types.Any. Casting record.Payload to the AggregatorEvent interface type will always fail, since it is the underlying protoutil.Message that implements AggregatorEvent. So we first unmarshal the types.Any into a types.DynamicAny which would give us access to the concrete proto type.

@@ -930,6 +935,9 @@ func (s *crdbSpan) getRecordingNoChildrenLocked(
childKey := string(tag.Key)
childValue := tag.Value.Emit()

if rs.Tags == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, didn't you fix this already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I rebased on master a couple days ago. When we chatted it was me working on this diff its just been sitting in cold storage since. TFTR! I'll wait on Steven to give it a pass since I bugged him to before you stamped it, and then merge away.

@benbardin
Copy link
Collaborator

Approved for backupccl

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, I left some comments, but I leave them to your discretion, I don't think they are blocking.

It might be nice to include in the already good commit message a line or two about how I might see these aggregated events -- just so anyone who sees this new feature knows how to start experimenting with it.

pkg/ccl/backupccl/backuppb/backup.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/backuppb/backup.go Show resolved Hide resolved
pkg/ccl/backupccl/backup_processor.go Show resolved Hide resolved
Comment on lines +514 to +516
if err := types.UnmarshalAny(record.Payload, &d); err != nil {
continue
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it doesn't matter too much, but I kinda wonder if we can hold off on Unmarshaling this until we know there are event listeners in place for this span.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd either have to change the notifyEventListeners signature to take a *types.Any or inside notifyEventListeners we will have to MarshalAny the passed initem Structured before we UnmarshalAny it into a DynamicAny. I think I'm going to leave it as is for now since this code is only called when importing remote spans into our span, not when every structured event is emitted.

pkg/util/bulk/aggregator_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@rhu713 rhu713 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM. I just have a slight design input that may improve things down the line.

pkg/util/bulk/aggregator.go Outdated Show resolved Hide resolved
@adityamaru adityamaru force-pushed the backup-tracing-cleanup branch 2 times, most recently from a507887 to d1bb4b8 Compare August 16, 2022 20:49
@adityamaru adityamaru requested a review from rhu713 August 16, 2022 20:49
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, @benbardin, @dt, @rhu713, and @stevendanna)


pkg/util/bulk/aggregator.go line 23 at r4 (raw file):

// Aggregator. An AggregatorEvent also implements the tracing.LazyTag interface
// to render its information on the associated tracing span.
type AggregatorEvent interface {

given that there is only one type of event implementing this, is the interface worth it?


pkg/util/bulk/aggregator.go line 37 at r4 (raw file):

// An Aggregator can be used to aggregate and render AggregatorEvents that are
// emitted as part of its tracing spans' recording.
type Aggregator struct {

consider naming this TracingEventAggregator, or something more specific than simply Aggregator.


pkg/util/bulk/aggregator.go line 61 at r4 (raw file):

	// If this is the first AggregatorEvent with this tag, set it as a LazyTag on
	// the associated tracing span. This way the AggregatorEvent will be

nit: scratch the second sentence. Every user of a tag is not expected to explain what the point of tags are.


pkg/util/bulk/aggregator.go line 96 at r4 (raw file):

	sp := tracing.SpanFromContext(ctx)

	aggCtx, aggSpan := sp.Tracer().StartSpanCtx(ctx, aggregatorName,

use EnsureChildSpan, and then you don't need to deal with the parent.


pkg/util/bulk/aggregator.go line 97 at r4 (raw file):

	aggCtx, aggSpan := sp.Tracer().StartSpanCtx(ctx, aggregatorName,
		tracing.WithEventListeners([]tracing.EventListener{agg}), tracing.WithParent(sp))

let's make WithEventListener a variadic function (take ...EventListener)

This change introduces an Aggregator object that is capable
of listening for Structured events emitted in a recording,
aggregating them and rendering them as LazyTags.
We also introduce an AggregatorEvent interface that can be
implemented by a Structured event thereby making it eligible
for aggregation in the Aggregator.

The first user of the Aggregator will be every backup data
processor that is setup on the nodes in the cluster during
a backup. The Aggregator lives as long as the processor, and
listens for Aggregator events emitted by any span that is a child
of the processors' span. This includes both local children as
well as remote children whose recordings have been imported into
a local span. The Aggregator stores running aggregates of each
AggregatorEvent it is notified about, bucketed by the events'
tag name. This aggregate will be rendered on the tracing span as a LazyTag.

This change teaches every ExportRequest to emit an AggregatorEvent.
Going forward we expect many more operations in bulk jobs to
define and emit such events providing visibility into otherwise
opaque operations.

We cleanup some of the StructuredEvents that were previously
added but have not proved useful, and also change some of the
tracing span operation names to be more intuitive.

To view these aggregated events once can navigate to the
`/tracez` endpoint of the debug console to take a snapshot and search for either
`BACKUP` or the job ID to filter for tracing spans on that node.
The span associated with the backup processor will be decorated with
tags that correspond to the fields in the introduced `ExportStats`
proto message. Note these stats are only aggregated on a per node
basis.

Fixes: cockroachdb#80388

Release note: None
Copy link
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @benbardin, @dt, @rhu713, and @stevendanna)


pkg/ccl/backupccl/backuppb/backup.go line 176 at r2 (raw file):

Previously, stevendanna (Steven Danna) wrote…

We could log the type of other here.

Done.


pkg/util/bulk/aggregator.go line 23 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

given that there is only one type of event implementing this, is the interface worth it?

I'm bullish that there'll be more very soon. Especially in import/restore to surface AddSSTable, BulkAdder stats etc.


pkg/util/bulk/aggregator.go line 37 at r4 (raw file):

TracingEventAggregator
done.


pkg/util/bulk/aggregator.go line 96 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

use EnsureChildSpan, and then you don't need to deal with the parent.

Done.


pkg/util/bulk/aggregator.go line 97 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

let's make WithEventListener a variadic function (take ...EventListener)

Done.


pkg/util/bulk/aggregator_test.go line 60 at r2 (raw file):

Previously, stevendanna (Steven Danna) wrote…
	// have not imported the remote child's Recording.

or, I suppose,

	// have not imported the remote children's Recordings.

Done.

@adityamaru
Copy link
Contributor Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 17, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 17, 2022

Build succeeded:

@craig craig bot merged commit 395cf17 into cockroachdb:master Aug 17, 2022
@adityamaru adityamaru deleted the backup-tracing-cleanup branch August 18, 2022 13:21
knz added a commit to knz/cockroach that referenced this pull request Sep 30, 2022
The "sending ExportRequest for span" log message was the 5th most
voluminous log event source in CC.

This commit makes it less verbose, by importing a one-line change from
another merged PR cockroachdb#84043 from v22.2.

Release note: None
knz added a commit to knz/cockroach that referenced this pull request Sep 30, 2022
The "sending ExportRequest for span" log message was the 5th most
voluminous log event source in CC.

This commit makes it less verbose, by importing a one-line change from
another merged PR cockroachdb#84043 from v22.2.

Release note: None
knz added a commit to knz/cockroach that referenced this pull request Sep 30, 2022
The "sending ExportRequest for span" log message was the 5th most
voluminous log event source in CC.

This commit makes it less verbose, by importing a one-line change from
another merged PR cockroachdb#84043 from v22.2.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bulk: implement a bulk aggregator that uses tracing to surface information about suboperations
6 participants