research AWS retries, reduce fetch hanging behaviour #1391

shimonp21 · 2022-05-05T13:28:48Z

Describe the bug

too high max_retries and max_backoff can cause "hanging" behaviour, while all goroutines "sleep" to wait for retries.

Expected Behavior

It's better to fail than to hang.

Steps to Reproduce

the hanging is notoriously difficult to reproduce.

Possible Solution

No response

Provider and CloudQuery version

core: 0.22.10, aws: 0.11.2

Additional Context

No response

shimonp21 · 2022-05-05T13:36:06Z

"throttling warnings" don't actually always mean that all retries (from maxRetries) has been exhausted. It is also possible to get throttling warnings even if the highest number of retries was 4.

failed to resolve table "aws_ec2_vpcs": EC2: DescribeVpcs - failed to get rate limit token, retry quota exceeded, 0 available, 10 requested

Working theory is that "retryTokens" are shared between different API calls, so even if every API call doesn't retry 20 times, the retry-tokens still get exhausted.

shimonp21 · 2022-05-05T13:38:37Z

Definitely looks like there are deeper questions here.

However, I would also support not investigating this super-deep and just setting:

maxRetries = 10
maxBackoff = 30s

shimonp21 · 2022-05-06T10:26:00Z

I am reminded of a conversation with @bbernays where he said that part of the throttling we are seeing happens in the client-side (i.e. in the https://github.com/aws/aws-sdk-go).
This makes a lot of sense considering I am seeing throttling warnings when after too many failed DNS queries. i.e. the aws-sdk doesn't distinguish between a DNS issue (i.e. request never actually reached Amazon's server's) and an actual failure in the request to AWS, so he throttles. This is of course an undesirable behaviour, since if DNS failed we want to keep asking the DNS server rather than getting "fake-throttled"...

shimonp21 · 2022-05-06T10:28:17Z

Also mentioning @roneli @spangenberg @disq for completeness of discussion

shimonp21 · 2022-05-06T10:34:45Z

Attaching summary of retries (extracted from logs with python-script).
https://docs.google.com/spreadsheets/d/1QJJdrXC9mlV1poysGf8rvsWqx79ut5F4rTSsaG-diV8/edit?usp=sharing

disq · 2022-05-06T10:38:20Z

aws sdk has provisions for setting any error type as "retryable" and it does a good job of specifying which ones should be retried - source. I don't think there's any way to specify some errors to have different retry quotas or make it exempt from client side throttling depending on the type of error (but still have it retry).

Context at https://github.com/cloudquery/cq-provider-aws/issues/853

shimonp21 · 2022-05-09T08:33:47Z

TODO: remove suggestion to user to increase their max_retries. It is unhelpful and might cause what users perceive as "hangs".

…868) See context at https://github.com/cloudquery/cq-provider-aws/issues/853. Increasing max_retries will not have a positive effect in most cases, and will likely only exacerbate any "hanging" issues.

shimonp21 · 2022-05-16T08:28:32Z

Leaving this bug as "research AWS retries, reduce fetch hanging behaviour".

shimonp21 · 2022-05-16T08:42:44Z

Summary:

Too many retries can cause "hanging" behaviour because of very long sleep time between retries and large amount of retries".
"throttling" warnings can appear even if the specified amount of retries hasn't been exhausted. The source of these throttling warnings is likely the aws-sdk-for-go (i.e. client-side, not server-side) .
Even DNS issues can cause "throttling" warnings - i.e. the AWS-sdk-for-go throttles us from making too many DNS queries, before any requests made it to the AWS servers. Can other network issues cause similar behaviour?

Further research is required.

I think we can leave this as low priority until we get customer-reported hangs.

* chore: Synced file(s) with cloudquery/.github * Add missing workflow file Co-authored-by: cq-bot <null> Co-authored-by: Herman Schaaf <[email protected]>

yevgenypats · 2022-09-23T19:50:20Z

Per my latest research AWS failed to get rate limit token which we should just change the default bucket cost and expose it to the user in any case it is not a good default for them. for CloudQuery the client side throttling doesn't make a lot of sense.

shimonp21 added the bug label May 5, 2022

shimonp21 mentioned this issue May 6, 2022

fix: Reduce retry params cloudquery/cq-provider-aws#855

Merged

6 tasks

shimonp21 referenced this issue in cloudquery/cq-provider-aws May 6, 2022

fix: Reduce retry params (#855)

23e2fe1

Context at https://github.com/cloudquery/cq-provider-aws/issues/853

shimonp21 mentioned this issue May 9, 2022

fix: Remove "increase max_retries" from the throttling error message cloudquery/cq-provider-aws#868

Merged

6 tasks

shimonp21 self-assigned this May 9, 2022

shimonp21 changed the title ~~Reduce default max_retries and max_backoff~~ research AWS retries, reduce fetch hanging behaviour May 16, 2022

shimonp21 removed their assignment May 17, 2022

erezrokah added the aws label Aug 16, 2022

erezrokah transferred this issue from cloudquery/cq-provider-aws Aug 16, 2022

yevgenypats closed this as completed Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research AWS retries, reduce fetch hanging behaviour #1391

research AWS retries, reduce fetch hanging behaviour #1391

shimonp21 commented May 5, 2022

shimonp21 commented May 5, 2022

shimonp21 commented May 5, 2022

shimonp21 commented May 6, 2022

shimonp21 commented May 6, 2022 •

edited

Loading

shimonp21 commented May 6, 2022

disq commented May 6, 2022 •

edited

Loading

shimonp21 commented May 9, 2022 •

edited

Loading

shimonp21 commented May 16, 2022

shimonp21 commented May 16, 2022 •

edited

Loading

yevgenypats commented Sep 23, 2022

research AWS retries, reduce fetch hanging behaviour #1391

research AWS retries, reduce fetch hanging behaviour #1391

Comments

shimonp21 commented May 5, 2022

Describe the bug

Expected Behavior

Steps to Reproduce

Possible Solution

Provider and CloudQuery version

Additional Context

shimonp21 commented May 5, 2022

shimonp21 commented May 5, 2022

shimonp21 commented May 6, 2022

shimonp21 commented May 6, 2022 • edited Loading

shimonp21 commented May 6, 2022

disq commented May 6, 2022 • edited Loading

shimonp21 commented May 9, 2022 • edited Loading

shimonp21 commented May 16, 2022

shimonp21 commented May 16, 2022 • edited Loading

yevgenypats commented Sep 23, 2022

shimonp21 commented May 6, 2022 •

edited

Loading

disq commented May 6, 2022 •

edited

Loading

shimonp21 commented May 9, 2022 •

edited

Loading

shimonp21 commented May 16, 2022 •

edited

Loading