Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter out stale spans from metrics generator #1612

Merged
merged 11 commits into from
Sep 12, 2022
Merged

filter out stale spans from metrics generator #1612

merged 11 commits into from
Sep 12, 2022

Conversation

ie-pham
Copy link
Contributor

@ie-pham ie-pham commented Aug 1, 2022

What this PR does: This PR adds a configurable variable under metrics generator to filter out any span that is older than "metrics_ingestion_time_range_slack" before metrics are aggregated. The current default is set to 30s.

Which issue(s) this PR fixes:
Fixes #1537

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@ie-pham ie-pham marked this pull request as ready for review August 1, 2022 19:24
Copy link
Member

@kvrhdn kvrhdn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I've left a bunch of comments, no major stuff just some suggestions/nitpicks.

modules/generator/config.go Outdated Show resolved Hide resolved
@@ -261,14 +266,27 @@ func (i *instance) pushSpans(ctx context.Context, req *tempopb.PushSpansRequest)
func (i *instance) updatePushMetrics(req *tempopb.PushSpansRequest) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest renaming this method since its scope has changed. This method used to only read the spans but now It will also modify the received request. Maybe change it to something like preprocessSpans? (better suggestion are welcome)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree that the function needs to be renamed if it's going to mutate the contents of the push request.

Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.

Yeah, that should be possible. Both the span metrics and the service graphs processor loop through the batches anyway and they process spans one by one.
It will be a bit tricky how we deal with the resource attributes though. We currently extract some data out of the resource attributes before looping through the instrumentation library spans individually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that is kind of gross. We could nil out the span pointers and make it an expectation of the processors that some spans may be nil.

I think ideally we leave the processors alone and do some clever in place manipulation of the slices to remove the "bad" spans. This logic could get rather gross though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could filter out the outdated spans inside the aggregateMetricsForSpan function and the consume function for service graph. But then we wouldn't be keep track of the numbers of spans dropped in this situation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that would require every processor to implement the same logic which will lead to duplicated work and code. I'd be interested to see how many spans we drop in practice, if only a small amount of batches have to be reallocated the impact will be fine. If we are constantly dropping spans that are too old we might have to re-evaluate.

If we need to get rid of these reallocations we could change the interface of Processor so it passes the resource attributes of the batch next to each span.

modules/generator/instance.go Outdated Show resolved Hide resolved
integration/e2e/metrics_generator_test.go Outdated Show resolved Hide resolved
@@ -44,6 +45,11 @@ var (
Name: "metrics_generator_bytes_received_total",
Help: "The total number of proto bytes received per tenant",
}, []string{"tenant"})
metricSpansDiscarded = promauto.NewCounterVec(prometheus.CounterOpts{
Namespace: "tempo",
Name: "metrics_generator_spans_discarded_total",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider renaming to metrics_generator_discarded_spans_total just to make it similar to this other metric: https://github.com/grafana/tempo/blob/main/modules/overrides/discarded_spans.go#L12 Not a big deal though, both should show up in grafana 🤷🏻

I think it's good to have separate metrics since a span discarded in the metrics-generator is very different from a span discarded in the ingester/compactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm should we make it similar to the other discarded span name or should we keep it similar to the other metrics in the same space?
https://github.com/grafana/tempo/blob/main/modules/generator/instance.go#L39

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see 🙃 Err, either is fine I guess? Maybe a slight preference for keeping it consistent with the other tempo_metrics_generator_ metrics then.

Naming is hard 😅

Comment on lines 35 to 36
// setting default for max span age before discarding to 30 sec
cfg.MaxSpanAge = 30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how this default behaves in practices. I honestly have no clue what a typical latency is between the span creation and ingestion by Tempo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2022-08-24 at 10 38 14 AM
Screen Shot 2022-08-24 at 10 39 00 AM
These are the ingestion latency numbers on ops for a few days. Do we think setting it at 30s is a right or is it too aggressive? @kvrhdn @joe-elliott

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30s as a default looks good to me. It seems this should be include 99% of the data while excluding 1% that is lagging behind.
It's also configurable, so other people can switch it up.

modules/generator/config.go Outdated Show resolved Hide resolved
modules/generator/instance.go Outdated Show resolved Hide resolved
var newSpansArr []*v1.Span
timeNow := time.Now().UnixNano()
for _, span := range ils.Spans {
if span.EndTimeUnixNano >= uint64(timeNow-i.cfg.MaxSpanAge*1000000000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both sides of this time range need to be checked. If the user sends a span that's 5 days in the future it should not impact metrics.

we have similar code in the wal:

func (a *AppendBlock) adjustTimeRangeForSlack(start uint32, end uint32, additionalStartSlack time.Duration) (uint32, uint32) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we pick 5 days?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 5 days was just an example. I think we can start with a symmetrical time range, i.e. use the same duration before and after time.Now(). If this doesn't work well we could still break it out into two config options.

modules/generator/config.go Outdated Show resolved Hide resolved
@@ -261,14 +266,27 @@ func (i *instance) pushSpans(ctx context.Context, req *tempopb.PushSpansRequest)
func (i *instance) updatePushMetrics(req *tempopb.PushSpansRequest) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree that the function needs to be renamed if it's going to mutate the contents of the push request.

Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.

Copy link
Contributor

@knylander-grafana knylander-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc updates look good.

@ie-pham ie-pham requested review from kvrhdn and joe-elliott and removed request for KMiller-Grafana September 9, 2022 17:49
Copy link
Member

@kvrhdn kvrhdn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@kvrhdn kvrhdn merged commit 9a135a9 into grafana:main Sep 12, 2022
@ie-pham ie-pham deleted the jennie/1537 branch March 17, 2023 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrics Generator: Add configurable ingestion slack time
4 participants