Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve poller missing index error condition #3223

Closed
wants to merge 6 commits into from

Conversation

zalegrala
Copy link
Contributor

@zalegrala zalegrala commented Dec 11, 2023

What this PR does:

Here we handle an error condition for a missing tenant index on non-index-builder pollers. This avoids unnecessary work when a tenant has been deleted, but remains in the tenant list due to the behavior described in #2754.

For non-index-builders:

  • a missing index will skip the tenant
  • a corrupted index will poll the tenant if fallback is enabled
  • a corrupted index will not poll the tenant if fallback is disabled

This change resulted in the compactor being required as part of some tests. Previously, if the index was not available, all pollers would attempt to build and write the index.

Which issue(s) this PR fixes:
Related #2754

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@zalegrala zalegrala force-pushed the pollerIndexErrNotFound branch 6 times, most recently from 83add0c to 764f8fc Compare December 15, 2023 14:43
@zalegrala zalegrala marked this pull request as ready for review December 15, 2023 19:25
tempoDistributor := util.NewTempoDistributor()
tempoQueryFrontend := util.NewTempoQueryFrontend()
tempoQuerier := util.NewTempoQuerier()
require.NoError(t, s.StartAndWaitReady(tempoIngester1, tempoIngester2, tempoIngester3, tempoDistributor, tempoQueryFrontend, tempoQuerier))
require.NoError(t, s.StartAndWaitReady(tempoIngester1, tempoIngester2, tempoIngester3, tempoDistributor, tempoQueryFrontend, tempoQuerier, tempoCompactor))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this will just test that the compactor starts? did you intend to add this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compactor needs to start in the tests now, since without it, the index won't get written.

@@ -223,12 +235,13 @@ func (p *Poller) pollTenantAndCreateIndex(

// polling fallback is true, log the error and continue in this method to completely poll the backend
level.Error(p.logger).Log("msg", "failed to pull bucket index for tenant. falling back to polling", "tenant", tenantID, "err", err)
} else {
metricTenantIndexBuilder.WithLabelValues(tenantID).Set(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason to move this? is this to distinguish between a "true" tenant index builder and a compactor that's just falling through?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to avoid setting the builder to true when the poller doesn't own the writing. I think this is what we intend. Or are you thinking that even if not configured to be the writer, if we fall though and end up writing, that we should metric as such?

if !builder {
metricTenantIndexBuilder.WithLabelValues(tenantID).Set(0)

i, err := p.reader.TenantIndex(derivedCtx, tenantID)
err = p.tenantIndexPollError(i, err)

if errors.Is(err, backend.ErrDoesNotExist) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the negative here is if we manually delete a tenant index to force recreation then only the assigned tenant index builders will attempt to recreate. which means that certain queriers will be unaware of any blocks in the backend for a tenant until it gets recreated (multiple minutes).

instead we could

  1. log info/debug "no tenant index found. falling through to polling"
  2. falling through to the blocklist poll below
  3. if the full blocklist list call returns an empty list log a warning. "no blocks found for tenant that should exist"
  4. raise the metric tempod_blocklist_poll_errors_total to build alerts off of/warn the operator to clean up partial blocks

wdyt?

Copy link
Contributor Author

@zalegrala zalegrala Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to avoid polling in the case of a missing index for non-builders. The only signal that we have currently that we wish to delete the tenant, is to delete the index. If you wanted to force a recreation of the index, you could corrupt the index, which would cause the poller to poll and recreate it.

@zalegrala
Copy link
Contributor Author

Closing for non-response.

@zalegrala zalegrala closed this Mar 14, 2024
zalegrala added a commit to zalegrala/tempo that referenced this pull request Apr 10, 2024
The tenant index deletion was originally put in as TCO win, but did not
have the desired effect and surfaced other issues in the system.

Related grafana#2678
Related grafana#2754
Related grafana#2781
Related grafana#2878
Related grafana#3115
Related grafana#3223

Due to the number of issues here, and causing considerable noise on the
pager, perhaps the right thing to do is back out the tenant deletion.

Raising here for discussion.
@zalegrala zalegrala mentioned this pull request Apr 10, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants