Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default retry settings result in too-frequent retries for Resource Exhausted errors #12328

Closed
amfisher-404 opened this issue Aug 7, 2023 · 3 comments · Fixed by #12306
Closed
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@amfisher-404
Copy link

With the given initial retry of 0.1 sec, multiplier of 1.3, it will take over 18 attempts before the retry interval is > 10s.

The backoff should be more aggressive for RPCs that could retry ResourceExhausted. In the C++ library, that is all RPCs: https://github.com/googleapis/google-cloud-cpp/blob/main/google/cloud/pubsub/retry_policy.h#L29-L37

@amfisher-404 amfisher-404 added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Aug 7, 2023
@coryan
Copy link
Contributor

coryan commented Aug 7, 2023

With the given initial retry of 0.1 sec, multiplier of 1.3, it will take over 18 attempts before the retry interval is > 10s.

I am sure an initial backoff of 0.1s and a 1.3 multiplier is wrong in some cases. But I am not sure why 0.1s and 4.0 is a better set of default parameters. That is likely to be wrong in other cases too. Without some criteria to evaluate which values are "better", they all seem equally arbitrary.

I am also not sure why 10s is an important threshold. Why not 5s? Or 7.654321s?


Without a principled approach to decide what these parameters should be, I am inclined to use the same defaults we use in all the other libraries (1s and 2.0). If nothing else for consistency within the C++ SDK.

@dbolduc
Copy link
Member

dbolduc commented Aug 7, 2023

(Just discussing)

If the server knows better about how long a client should wait (e.g. in the resource exhausted case), it can attach a google.rpc.RetryInfo: https://github.com/googleapis/googleapis/blob/353f09793b7a00acbe9e1b4fb8997fe533038ca4/google/rpc/error_details.proto#L91-L94

Of course, the C++ clients do not respect RetryInfo at the moment.... but adding that support could be an alternative to changing the default retry behavior for all transient errors for all idempotent pubsub RPCs.

@amfisher-404
Copy link
Author

That's a good point. The issue I'm really trying to get at is excessive retries of RESOURCE_EXHAUSTED, which can harm the service. RESOURCE_EXHAUSTED is considered a retryable code by this library, even though Google API standards categorize it as "generally nonretryable": https://google.aip.dev/194#generally-non-retryable-codes. Changing the categorization of this error in the client library would be a breaking change, since clients would have to start handling RESOURCE_EXHAUSTED errors directly.

If OnePlatform APIs included RetryInfo in their quota-exceeded rejections, that could certainly address the issue in a non-breaking way. In the meantime, I'm hoping to make backoff modifications, to address a longstanding issue that has affected the Pub/Sub service.

Note also that changing the multiplier will only affect a small quantity of RPCs that fail more than twice in a row, since the first retry interval is unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants