Default retry settings result in too-frequent retries for Resource Exhausted errors #12328

amfisher-404 · 2023-08-07T16:22:54Z

With the given initial retry of 0.1 sec, multiplier of 1.3, it will take over 18 attempts before the retry interval is > 10s.

The backoff should be more aggressive for RPCs that could retry ResourceExhausted. In the C++ library, that is all RPCs: https://github.com/googleapis/google-cloud-cpp/blob/main/google/cloud/pubsub/retry_policy.h#L29-L37

coryan · 2023-08-07T17:20:06Z

With the given initial retry of 0.1 sec, multiplier of 1.3, it will take over 18 attempts before the retry interval is > 10s.

I am sure an initial backoff of 0.1s and a 1.3 multiplier is wrong in some cases. But I am not sure why 0.1s and 4.0 is a better set of default parameters. That is likely to be wrong in other cases too. Without some criteria to evaluate which values are "better", they all seem equally arbitrary.

I am also not sure why 10s is an important threshold. Why not 5s? Or 7.654321s?

Without a principled approach to decide what these parameters should be, I am inclined to use the same defaults we use in all the other libraries (1s and 2.0). If nothing else for consistency within the C++ SDK.

dbolduc · 2023-08-07T21:28:09Z

(Just discussing)

If the server knows better about how long a client should wait (e.g. in the resource exhausted case), it can attach a google.rpc.RetryInfo: https://github.com/googleapis/googleapis/blob/353f09793b7a00acbe9e1b4fb8997fe533038ca4/google/rpc/error_details.proto#L91-L94

Of course, the C++ clients do not respect RetryInfo at the moment.... but adding that support could be an alternative to changing the default retry behavior for all transient errors for all idempotent pubsub RPCs.

amfisher-404 · 2023-08-08T21:51:48Z

That's a good point. The issue I'm really trying to get at is excessive retries of RESOURCE_EXHAUSTED, which can harm the service. RESOURCE_EXHAUSTED is considered a retryable code by this library, even though Google API standards categorize it as "generally nonretryable": https://google.aip.dev/194#generally-non-retryable-codes. Changing the categorization of this error in the client library would be a breaking change, since clients would have to start handling RESOURCE_EXHAUSTED errors directly.

If OnePlatform APIs included RetryInfo in their quota-exceeded rejections, that could certainly address the issue in a non-breaking way. In the meantime, I'm hoping to make backoff modifications, to address a longstanding issue that has affected the Pub/Sub service.

Note also that changing the multiplier will only affect a small quantity of RPCs that fail more than twice in a row, since the first retry interval is unchanged.

amfisher-404 added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Aug 7, 2023

alevenberg mentioned this issue Aug 7, 2023

fix(pubsub): increase backoff from 1.3 to 4.0 #12306

Merged

devbww assigned alevenberg Aug 7, 2023

alevenberg closed this as completed in #12306 Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default retry settings result in too-frequent retries for Resource Exhausted errors #12328

Default retry settings result in too-frequent retries for Resource Exhausted errors #12328

amfisher-404 commented Aug 7, 2023

coryan commented Aug 7, 2023

dbolduc commented Aug 7, 2023

amfisher-404 commented Aug 8, 2023

Default retry settings result in too-frequent retries for Resource Exhausted errors #12328

Default retry settings result in too-frequent retries for Resource Exhausted errors #12328

Comments

amfisher-404 commented Aug 7, 2023

coryan commented Aug 7, 2023

dbolduc commented Aug 7, 2023

amfisher-404 commented Aug 8, 2023