Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

research AWS retries, reduce fetch hanging behaviour #1391

Closed
shimonp21 opened this issue May 5, 2022 · 10 comments
Closed

research AWS retries, reduce fetch hanging behaviour #1391

shimonp21 opened this issue May 5, 2022 · 10 comments

Comments

@shimonp21
Copy link
Contributor

Describe the bug

too high max_retries and max_backoff can cause "hanging" behaviour, while all goroutines "sleep" to wait for retries.

Expected Behavior

It's better to fail than to hang.

Steps to Reproduce

the hanging is notoriously difficult to reproduce.

Possible Solution

No response

Provider and CloudQuery version

core: 0.22.10, aws: 0.11.2

Additional Context

No response

@shimonp21 shimonp21 added the bug label May 5, 2022
@shimonp21
Copy link
Contributor Author

"throttling warnings" don't actually always mean that all retries (from maxRetries) has been exhausted. It is also possible to get throttling warnings even if the highest number of retries was 4.

failed to resolve table "aws_ec2_vpcs": EC2: DescribeVpcs - failed to get rate limit token, retry quota exceeded, 0 available, 10 requested

Working theory is that "retryTokens" are shared between different API calls, so even if every API call doesn't retry 20 times, the retry-tokens still get exhausted.

@shimonp21
Copy link
Contributor Author

Definitely looks like there are deeper questions here.

However, I would also support not investigating this super-deep and just setting:

maxRetries = 10
maxBackoff = 30s

@shimonp21
Copy link
Contributor Author

I am reminded of a conversation with @bbernays where he said that part of the throttling we are seeing happens in the client-side (i.e. in the https://github.com/aws/aws-sdk-go).
This makes a lot of sense considering I am seeing throttling warnings when after too many failed DNS queries. i.e. the aws-sdk doesn't distinguish between a DNS issue (i.e. request never actually reached Amazon's server's) and an actual failure in the request to AWS, so he throttles. This is of course an undesirable behaviour, since if DNS failed we want to keep asking the DNS server rather than getting "fake-throttled"...

@shimonp21
Copy link
Contributor Author

shimonp21 commented May 6, 2022

Also mentioning @roneli @spangenberg @disq for completeness of discussion

@shimonp21
Copy link
Contributor Author

Attaching summary of retries (extracted from logs with python-script).
https://docs.google.com/spreadsheets/d/1QJJdrXC9mlV1poysGf8rvsWqx79ut5F4rTSsaG-diV8/edit?usp=sharing

@disq
Copy link
Member

disq commented May 6, 2022

aws sdk has provisions for setting any error type as "retryable" and it does a good job of specifying which ones should be retried - source. I don't think there's any way to specify some errors to have different retry quotas or make it exempt from client side throttling depending on the type of error (but still have it retry).

@shimonp21
Copy link
Contributor Author

shimonp21 commented May 9, 2022

TODO: remove suggestion to user to increase their max_retries. It is unhelpful and might cause what users perceive as "hangs".

@shimonp21 shimonp21 self-assigned this May 9, 2022
shimonp21 referenced this issue in cloudquery/cq-provider-aws May 16, 2022
…868)

See context at https://github.com/cloudquery/cq-provider-aws/issues/853.

Increasing max_retries will not have a positive effect in most cases, and will likely
only exacerbate any "hanging" issues.
@shimonp21
Copy link
Contributor Author

Leaving this bug as "research AWS retries, reduce fetch hanging behaviour".

@shimonp21
Copy link
Contributor Author

shimonp21 commented May 16, 2022

Summary:

  • Too many retries can cause "hanging" behaviour because of very long sleep time between retries and large amount of retries".
  • "throttling" warnings can appear even if the specified amount of retries hasn't been exhausted. The source of these throttling warnings is likely the aws-sdk-for-go (i.e. client-side, not server-side) .
  • Even DNS issues can cause "throttling" warnings - i.e. the AWS-sdk-for-go throttles us from making too many DNS queries, before any requests made it to the AWS servers. Can other network issues cause similar behaviour?

Further research is required.

I think we can leave this as low priority until we get customer-reported hangs.

@shimonp21 shimonp21 changed the title Reduce default max_retries and max_backoff research AWS retries, reduce fetch hanging behaviour May 16, 2022
@shimonp21 shimonp21 removed their assignment May 17, 2022
@erezrokah erezrokah added the aws label Aug 16, 2022
@erezrokah erezrokah transferred this issue from cloudquery/cq-provider-aws Aug 16, 2022
amanenk pushed a commit to amanenk/cloudquery that referenced this issue Sep 12, 2022
* chore: Synced file(s) with cloudquery/.github

* Add missing workflow file

Co-authored-by: cq-bot <null>
Co-authored-by: Herman Schaaf <[email protected]>
@yevgenypats
Copy link
Member

Per my latest research AWS failed to get rate limit token which we should just change the default bucket cost and expose it to the user in any case it is not a good default for them. for CloudQuery the client side throttling doesn't make a lot of sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants