Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Partial Success in ConsumeTraces #2571

Merged
merged 73 commits into from
Jan 4, 2024
Merged

Handle Partial Success in ConsumeTraces #2571

merged 73 commits into from
Jan 4, 2024

Conversation

ie-pham
Copy link
Collaborator

@ie-pham ie-pham commented Jun 19, 2023

What this PR does:

  1. Right now if a single trace in the batch triggers an error, Tempo drops the entire batch. This PR allows every trace in the batch to get a turn at getting pushed before returning a response to the client.

  2. From internal discussion, we are going to match what is required here for partial success and return a 200 instead of an error.

The tricky part was tallying up the discarded span count.

If replication factor is 3, then the minimum success is 2 (rep factor / 2 +1) and the max allowable error is 1. So initially I thought a simple logic should account for this. Tally up all the responses and if a trace is marked as "error" twice then we can consider that trace discarded. However, this approach is only doable if we were working with cases where the number of responses = replication factor. That is not the case. We think the ring.DoBatch function returns as soon as 2 errors or 2 successes are recorded without waiting for the third response (a partial success is still considered a success). In a perfect world where two those responses are exact the same, we would not have a problem. However, if we have a scenario where the lists of discarded traces are different in the two responses, it is hard to accurately determine whether a trace was discarded or not without the third response.

Decision:
Since the discarded span count currently does not play an important role in the functionality of Tempo, I am going with the "over counting" solution.

Since the number of responses is always the minimum success or more

	// max number of error allowed is {number of responses} - {minimum success required}

So if the replication factor is 3 and we get only 2 responses (and the minimum success required is 2), then the max number of error allowed is 0. If a trace is marked "error" even once, it will be considered discarded. However, for example, for replication of 5 with a minimum success of 3, if we only get 4 responses then with the same rules, the max error allowed is 1 not 0.

This solution has drawbacks because we may overcount the number of discarded spans. For example if we have one partial success and one total success, without the third response, there is a chance that whatever trace failed in the partial success actually was successfully processed in the third response but we will mark it as discarded.

Another note is I am adding a mutex lock to collect the responses from the ingester. Will need to test this to make sure it does not impact performance too much.

Which issue(s) this PR fixes:
Fixes #1957

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@ie-pham
Copy link
Collaborator Author

ie-pham commented Jun 30, 2023

So the logic works but I'm just not sure if it's too convoluted to record the correct discarded span count by reason

@ie-pham ie-pham marked this pull request as ready for review June 30, 2023 14:30
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I question the efforts to perfectly record dropped spans for each reason. I'm wondering if a better approach would be to change PushResponse to return trace indexes and rejection reasons back to the distributor. The distributor could then update the appropriate metrics based on the response.

integration/util.go Outdated Show resolved Hide resolved
integration/e2e/limits_test.go Show resolved Hide resolved
modules/distributor/distributor.go Outdated Show resolved Hide resolved
modules/distributor/distributor.go Outdated Show resolved Hide resolved
pkg/tempopb/tempo.proto Outdated Show resolved Hide resolved
modules/ingester/instance.go Outdated Show resolved Hide resolved
@ie-pham ie-pham requested a review from joe-elliott July 6, 2023 13:59
@ie-pham ie-pham changed the title Fix1957 Handle Partial Success in ConsumeTraces Jul 24, 2023
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed mainly the distributor part and I think we need to make some changes so I'm not quite going to dig deep on the ingester part yet.

Overall, this is really heading the right direction. A lot of comments and questions b/c this is a very performance sensitive part of the code and a very nuanced change.

Also, I think this invalidates: #1958 since we are just going to start returning 200 on dropped spans. Can you confirm?

CHANGELOG.md Outdated Show resolved Hide resolved
integration/e2e/limits_test.go Show resolved Hide resolved
modules/distributor/distributor.go Outdated Show resolved Hide resolved
modules/distributor/distributor.go Outdated Show resolved Hide resolved
modules/distributor/distributor.go Outdated Show resolved Hide resolved
modules/distributor/distributor.go Outdated Show resolved Hide resolved
modules/distributor/distributor.go Outdated Show resolved Hide resolved
modules/distributor/distributor.go Outdated Show resolved Hide resolved
modules/distributor/distributor.go Outdated Show resolved Hide resolved
modules/ingester/ingester.go Outdated Show resolved Hide resolved
@ie-pham ie-pham requested a review from joe-elliott July 27, 2023 21:29
CHANGELOG.md Outdated Show resolved Hide resolved
modules/distributor/distributor.go Outdated Show resolved Hide resolved
modules/distributor/distributor.go Show resolved Hide resolved
modules/distributor/distributor.go Outdated Show resolved Hide resolved
modules/distributor/distributor.go Outdated Show resolved Hide resolved
modules/ingester/ingester.go Outdated Show resolved Hide resolved
modules/ingester/instance.go Outdated Show resolved Hide resolved
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition!

@ie-pham ie-pham merged commit 306fdd7 into grafana:main Jan 4, 2024
14 checks passed
@ie-pham ie-pham mentioned this pull request Jan 15, 2024
3 tasks
@adugar-conga
Copy link

adugar-conga commented Feb 7, 2024

@ie-pham @joe-elliott In which release version of Grafana Tempo will this be addressed ?

@adugar-conga
Copy link

@joe-elliott @ie-pham In which release version of Grafana Tempo will this be addressed ?

@joe-elliott
Copy link
Member

This feature is merged into main and will be in the next release of Tempo. We attempt a 3 month cadence on OSS releases but make no guarantees.

Releases are listed here: https://github.com/grafana/tempo/releases/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Complete writing a batch even if some spans refused
5 participants