Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry storage access on internal errors #180

Closed
wants to merge 12 commits into from
Closed

Conversation

fische
Copy link
Contributor

@fische fische commented Aug 16, 2022

This revision adds a retryBucket struct that wraps the bucket interface in order to implement a retry logic on top of it. We go through this logic only on internal errors.

We are implementing this following an increase in the number of internal errors coming from reads on GCP buckets. The only suggested solution is to just retry when those errors occur: googleapis/google-cloud-go#784

@fische fische self-assigned this Aug 16, 2022
@fische fische marked this pull request as ready for review August 16, 2022 14:20
elan/rpc/retry.go Outdated Show resolved Hide resolved
elan/rpc/retry.go Outdated Show resolved Hide resolved
Copy link
Member

@peterebden peterebden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but this doesn't handle the case where mettle workers go directly to storage (probably that wants a retryable wrapper around the 'clients' in this package too).

It also won't retry if you get an initial reader or writer successfully but it fails later. Do we know if that happens or if it's generally upfront that it fails?

@peterebden
Copy link
Member

You might also want to rebase on master - #181 eventually got a green build out of golangci.

@fische
Copy link
Contributor Author

fische commented Aug 17, 2022

Looks good but this doesn't handle the case where mettle workers go directly to storage (probably that wants a retryable wrapper around the 'clients' in this package too).

Do you mean when we use the elanClient instead of the remoteClient implementation? If that's what you mean, it's already using this wrapper as New calls createServer.

It also won't retry if you get an initial reader or writer successfully but it fails later. Do we know if that happens or if it's generally upfront that it fails?

That's a very good point. There's no indication that this is only happening when initialising the reader/writers, so I'll add retries for that as well.

@fische
Copy link
Contributor Author

fische commented Aug 17, 2022

Abandoning this PR, as we can just use the retry logic from the GCS client, we just need to update the storage package for that.

@fische fische closed this Aug 17, 2022
@fische fische deleted the retry-elan-bucket branch August 17, 2022 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants