Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ketama quorum #5910
Fix ketama quorum #5910
Changes from 1 commit
572bddf
b002700
d83a624
a08ba98
f4e8ff0
fb91d51
4c870b4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this variable named
er
receive a better name? I have no clue what aner
is and it's easy to mistake it forerr
and evenendpointReplica
(often variables of this type have the nameer
, which is something else I think we have to slowly move away from).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I renamed this variable to
writeTarget
for clarity.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could
ec
receive a better name? It's used many times in the next hundreds of lines and the name isn't clear. Suggestions:errorChannel
, if that's even what is actually is. 😅There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to
responses
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here about the
er
variable name, which also applies to other occurrences: it gives no clue of what it is in this context and can be easily confused witherr
. Could we rename it? Some suggestions:replicationKey
,replicaKey
,replicationID
,endpointReplica
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll also have to change how we determine the HTTP error we return to client, when this
cause
error bubbles up back tohandleRequest
. Right now, we'll return error that occurs the most or the original multi error, since we use threshold1
. But this might be incorrect, as if acause
error for any individual series replications will be server error, we have to retry the whole request. I think solution would be:cause
errors is an unknown error / unavailable / not ready (cases when we have to retry). Tricky but less important part here might be exactly which error to return if we have a mixed bag of server errors - the behavior of client should be same though regardless of the error message we decide to return.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing we should be mindful of here is when
cause
will return the original multi-error (and same actually above for theif
branch), we are putting a multi-error inside of theerrs
multi-error, which can lead to erroneous5xx
as described in #5407 (comment) .There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are correct. However I wonder if we already have this issue in main because we calculate the top-level cause the same way using
threshold=1
. So if we have 2 batches with conflict and 1 batch with server error, we will return conflict to the user and not retry the request.In any case, I would also prefer to solve this problem now since it can lead to data loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I am not sure about is what should the error code be when we try to replicate a series and we get one success, one server error and one client error. Right now I believe we return client-error, but if we change the rules, we would return a server error. It also means that in case of 2 conflicts (samples already exist in TSDB) and 1 server error, we would still return a server error even though that might not be necessary.
Maybe for replicating an individual series we can treat client-errors as success and only return 5xx when two replicas fail. For the overall response, we can return 5xx if any series has a 5xx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe we basically have to treat conflict as if we have 'success'. It's just important to return the correct status to the upstream so if we have any conflicts in the replication, we'll want to return this to the client. Otherwise 5xx and OK should be clear (5xx if any series fails quorum; OK if no failed quorums or conflicts).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I think the
MultiError
anddetermineWriteErrorCause
are not good abstractions for this. ThedetermineWriteErrorCause
function is overloaded and tries to determine the error for both cases.Because of this, I added two error types
writeErrors
, andreplicationErrors
with their own ownCause()
methods. ThewriteErrors
cause prioritizes server errors, while the one fromreplicationErrors
is mostly identical todetermineWriteErrorCause
and is used for determining the error of replicating a single series.This way we always use the
Cause
method and depending on the error type we will bubble the appropriate error.