-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add AWS API call reties on AMI prevalidation #8034
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one, there are some little things to fix but overall this LGTM 🙂
After some thinking: wasn't setting I think adding an env var setting only to retry aws validation is too much. |
Those settings didn’t work and in the logs we’d see it fail after the prevalidation step frequently. You can also see the call isn’t retried. Adding the retry in the sdk here resolved the issue. Agreed separate env vars for a single call is annoying, but the poll system env vars don’t factor easily into the sdk’s retry mechanism. The names are hard coded to the poller and they have two mode of retry and the sdk has just one. One option might be to simplify the poller’s env var settings so they’re compatible with the sdk’s and make the name more generic so they can be reused. Though that would be a potentially breaking change. Thoughts? |
Ah, too bad, I think using the |
Hi, cool PR. I had seen reports of this a while ago but since I wasn't able to reproduce easily, hadn't had a chance to investigate. Thanks for opening!
In this situation, I think it makes more sense to just have a set number of retries (five or ten?) and not have it configurable. That's how we do a fair number of our other retry mechanisms that are built to work around network issues (Example: step_key_pair), and being throttled feels similar to that. I think that if a small handful of retries isn't enough to get around the throttling mechanism, it'll be unlikely that other parts of the build are going to be successful; it'll just push the flake out further. At a certain point, Packer can't fix your org's throttling limits. |
For posterity, this is related to issue: #6330 |
From a customer POV, what would work best for us is to have a single set of settings (e.g. env vars) that specify the retry behavior globally for the AWS builder and have reasonable defaults so that it's only in extreme cases they need to be adjusted. We happen to be in the extreme case I believe, as that we have a large dev env and dynamic ec2 Jenkins agents and our apps need to make AWS API calls to manage resources. So not being able to adjust them if they still timeout would mean we'd need to fork Packer to adjust it to get past busy requests times. Thinking long term, what about this? Replace the current
With:
|
Ah, you're right -- those vars are used inside Packer to set the I don't think these replacing these variables with a newly-named, global one is a good match. The max polling wait you want for a throttled synchronous API request is not the same as the max polling wait which you want for waiting for a 30Gb image to be copied between six regions. In testing this in your setup, have you ever been throttled on other API calls like the one I linked in step_key_pair? If the answer is yes, then we do need a configurable retry for synchronous api calls, but it needs to happen in multiple places in the code, not just in the describe images call. Rather than AWS_DESCRIBE_IMAGES_RETRIES, why not just AWS_API_RETRIES, and then we can apply it to all the other api calls that we have inside a retry loop. If the answer is no, then we probably don't need a configurable retry for any of them. |
Yes, we get throttled on at least the describe images call here, it's possible we've had failures due to other non-retried calls too however.
Then the poller could just default to a reasonable interval and be less complex to configure too. What do you think? One other comment: it might make sense to prefix the vars with PACKER_* just to avoid confusing them with SDK vars. |
Prefxing with AWS makes it clear that this applies only to the AWS builder. I'd be fine with both, though, e.g. PACKER_AWS_API_RETRIES
I'd rather be certain that this is an issue you see before we add a new variable to fix it. |
The prevalidation check is confirmed yes, what I was saying is that in addition to those build failures it's possible other builds have failed due to other API calls not being retried as you mentioned. |
Yes, I understand that you're seeing build failures in prevalidation without this PR. That's not my question. My question is whether you've seen build failures due to API throttling at other points which are currently already wrapped in retries. (search the code for Speaking of which -- you're using the RetryCount field on the request, but elsewhere in the code we use a custom exponential backoff wrapper for the query specifically to reduce the number of tries we need to attempt during high traffic times, and so that we don't block on a long-running remote request in case a cancellation command comes through. For consistency's sake, and to allow the backoff, I think I'd like to see this retry follow that same format (regardless of where we land on the config option). I'm sorry if it seems like I'm being arbitrarily hard on the idea of adding this environment variable. The reason I'm not wild about making this a lever you can tweak is that it encourages bad practices. Sure, I can tell Packer to retry every API call 200 times if I'm being throttled. But that's a band-aid for the actual problem, which is that my organization is regularly running for long periods of time above its API query threshold. Having Packer retry its calls a small handful of times to make sure that individual accidental spikes in API consumption don't mess up the build is one thing. Having Packer consistently banging over and over against Amazon's API thresholds because we haven't tuned our ci system is another. Which is why I feel that unless you're also seeing build failures as a result of the many other API calls we make, inside retries, but with hardcoded upper limits on the retries, then it isn't worth adding a configuration variable, and could potentially be harmful by covering up a pipeline issue. I promise I'm not trying to be a difficult for no reason here. I just want to make sure that this is really necessary before we add it because it feels like an antipattern. |
Yes, the existing wrapper makes more sense, didn't know about it. Omitting works since it sounds like it's been a pain point for you. Ideally nothing needs to bet set and it just works ™️. I haven't seen failures where other calls are retired other than the poller. |
I appreciate it! And I am happy to revisit it in the future if it turns out a simple retry is insufficient :) |
@SwampDragons I took a look at the retry wrapper and it's actually just retrying on any error rather than throttling events like the SDK, which means that auth errors will be retried for example. Additionally the SDK does randomization on the retry so it'll break up synchronized requests, during parallel builds better. Lmk if that's ok, or if you'd still prefer to use the wrapper. |
I think setting the retry count like this should be enough an this looks good to me. ( I would like to let @SwampDragons have the final call here though 🙂 ) FYI: The retry.Config has a packer/vendor/github.com/aws/aws-sdk-go/aws/request/request.go Lines 603 to 642 in ecd118b
It would be way better if that shouldRetry func was public though. |
Confirmed internally, thanks @cove 🙂 |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
Packer does a pre-validation check on the AMI name before proceeding with a build, however this check is not retried and when throttled by AWS it causes the build to fail. This PR switches to use the AWS SDK builtin retry mechanism for this call with sane defaults, and provides additional documentation on how to increase the retry settings.
This patch has been tested for about 1 week in a busy AWS account that requires robust retrying.