filter out stale spans from metrics generator #1612

ie-pham · 2022-08-01T18:58:44Z

What this PR does: This PR adds a configurable variable under metrics generator to filter out any span that is older than "metrics_ingestion_time_range_slack" before metrics are aggregated. The current default is set to 30s.

Which issue(s) this PR fixes:
Fixes #1537

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

kvrhdn

This looks good! I've left a bunch of comments, no major stuff just some suggestions/nitpicks.

modules/generator/config.go

kvrhdn · 2022-08-02T12:24:43Z

modules/generator/instance.go

@@ -261,14 +266,27 @@ func (i *instance) pushSpans(ctx context.Context, req *tempopb.PushSpansRequest)
 func (i *instance) updatePushMetrics(req *tempopb.PushSpansRequest) {


I suggest renaming this method since its scope has changed. This method used to only read the spans but now It will also modify the received request. Maybe change it to something like preprocessSpans? (better suggestion are welcome)

agree that the function needs to be renamed if it's going to mutate the contents of the push request.

Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.

Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.

Yeah, that should be possible. Both the span metrics and the service graphs processor loop through the batches anyway and they process spans one by one.
It will be a bit tricky how we deal with the resource attributes though. We currently extract some data out of the resource attributes before looping through the instrumentation library spans individually.

Oh, that is kind of gross. We could nil out the span pointers and make it an expectation of the processors that some spans may be nil.

I think ideally we leave the processors alone and do some clever in place manipulation of the slices to remove the "bad" spans. This logic could get rather gross though.

I could filter out the outdated spans inside the aggregateMetricsForSpan function and the consume function for service graph. But then we wouldn't be keep track of the numbers of spans dropped in this situation

But that would require every processor to implement the same logic which will lead to duplicated work and code. I'd be interested to see how many spans we drop in practice, if only a small amount of batches have to be reallocated the impact will be fine. If we are constantly dropping spans that are too old we might have to re-evaluate.

If we need to get rid of these reallocations we could change the interface of Processor so it passes the resource attributes of the batch next to each span.

modules/generator/instance.go

integration/e2e/metrics_generator_test.go

kvrhdn · 2022-08-02T14:06:18Z

modules/generator/instance.go

@@ -44,6 +45,11 @@ var (
 		Name:      "metrics_generator_bytes_received_total",
 		Help:      "The total number of proto bytes received per tenant",
 	}, []string{"tenant"})
+	metricSpansDiscarded = promauto.NewCounterVec(prometheus.CounterOpts{
+		Namespace: "tempo",
+		Name:      "metrics_generator_spans_discarded_total",


I'd consider renaming to metrics_generator_discarded_spans_total just to make it similar to this other metric: https://github.com/grafana/tempo/blob/main/modules/overrides/discarded_spans.go#L12 Not a big deal though, both should show up in grafana 🤷🏻

I think it's good to have separate metrics since a span discarded in the metrics-generator is very different from a span discarded in the ingester/compactor.

Hmm should we make it similar to the other discarded span name or should we keep it similar to the other metrics in the same space?
https://github.com/grafana/tempo/blob/main/modules/generator/instance.go#L39

Oh I see 🙃 Err, either is fine I guess? Maybe a slight preference for keeping it consistent with the other tempo_metrics_generator_ metrics then.

Naming is hard 😅

kvrhdn · 2022-08-02T14:07:44Z

modules/generator/config.go

+	// setting default for max span age before discarding to 30 sec
+	cfg.MaxSpanAge = 30


I'm curious how this default behaves in practices. I honestly have no clue what a typical latency is between the span creation and ingestion by Tempo.

These are the ingestion latency numbers on ops for a few days. Do we think setting it at 30s is a right or is it too aggressive? @kvrhdn @joe-elliott

30s as a default looks good to me. It seems this should be include 99% of the data while excluding 1% that is lagging behind.
It's also configurable, so other people can switch it up.

modules/generator/config.go

modules/generator/instance.go

joe-elliott · 2022-08-02T15:07:05Z

modules/generator/instance.go

+			var newSpansArr []*v1.Span
+			timeNow := time.Now().UnixNano()
+			for _, span := range ils.Spans {
+				if span.EndTimeUnixNano >= uint64(timeNow-i.cfg.MaxSpanAge*1000000000) {


both sides of this time range need to be checked. If the user sends a span that's 5 days in the future it should not impact metrics.

we have similar code in the wal:

tempo/tempodb/wal/append_block.go

Line 235 in 5c885f3

func (a *AppendBlock) adjustTimeRangeForSlack(start uint32, end uint32, additionalStartSlack time.Duration) (uint32, uint32) {

Is there a reason why we pick 5 days?

I think 5 days was just an example. I think we can start with a symmetrical time range, i.e. use the same duration before and after time.Now(). If this doesn't work well we could still break it out into two config options.

modules/generator/config.go

joe-elliott · 2022-08-02T15:10:04Z

modules/generator/instance.go

@@ -261,14 +266,27 @@ func (i *instance) pushSpans(ctx context.Context, req *tempopb.PushSpansRequest)
 func (i *instance) updatePushMetrics(req *tempopb.PushSpansRequest) {


agree that the function needs to be renamed if it's going to mutate the contents of the push request.

Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.

knylander-grafana

Doc updates look good.

kvrhdn

Nice work!

ie-pham marked this pull request as ready for review August 1, 2022 19:24

ie-pham requested review from joe-elliott, annanay25, mdisibio, mapno, kvrhdn and zalegrala as code owners August 1, 2022 19:24

kvrhdn reviewed Aug 2, 2022

View reviewed changes

joe-elliott reviewed Aug 2, 2022

View reviewed changes

ie-pham requested review from knylander-grafana and KMiller-Grafana as code owners August 10, 2022 22:35

knylander-grafana reviewed Aug 29, 2022

View reviewed changes

ie-pham added 11 commits September 9, 2022 11:30

filter out stale spans from metrics generator

e60e114

removed debug steps/logs

583608a

gofmt

3a3de90

addressed some review comments

22e8293

fixed docs

f29b089

more review comment addressing

aaf5def

testing stuff

86bdcb3

rebase

d60a85f

rebased

07cb521

removing debug stuff

948b8e0

lint

e5ac4b8

ie-pham requested review from kvrhdn and joe-elliott and removed request for KMiller-Grafana September 9, 2022 17:49

kvrhdn approved these changes Sep 12, 2022

View reviewed changes

kvrhdn merged commit 9a135a9 into grafana:main Sep 12, 2022

ie-pham deleted the jennie/1537 branch March 17, 2023 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter out stale spans from metrics generator #1612

filter out stale spans from metrics generator #1612

ie-pham commented Aug 1, 2022 •

edited

Loading

kvrhdn left a comment

kvrhdn Aug 2, 2022

joe-elliott Aug 2, 2022

kvrhdn Aug 2, 2022

joe-elliott Aug 2, 2022

ie-pham Aug 10, 2022

kvrhdn Aug 11, 2022

kvrhdn Aug 2, 2022

ie-pham Aug 10, 2022

kvrhdn Aug 11, 2022

kvrhdn Aug 2, 2022

ie-pham Sep 6, 2022

kvrhdn Sep 8, 2022

joe-elliott Aug 2, 2022

ie-pham Aug 10, 2022

kvrhdn Aug 11, 2022

joe-elliott Aug 2, 2022

knylander-grafana left a comment

kvrhdn left a comment

		@@ -261,14 +266,27 @@ func (i instance) pushSpans(ctx context.Context, req tempopb.PushSpansRequest)
		func (i instance) updatePushMetrics(req tempopb.PushSpansRequest) {

		// setting default for max span age before discarding to 30 sec
		cfg.MaxSpanAge = 30

filter out stale spans from metrics generator #1612

filter out stale spans from metrics generator #1612

Conversation

ie-pham commented Aug 1, 2022 • edited Loading

kvrhdn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knylander-grafana left a comment

Choose a reason for hiding this comment

kvrhdn left a comment

Choose a reason for hiding this comment

ie-pham commented Aug 1, 2022 •

edited

Loading