Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add AWS API call reties on AMI prevalidation #8034

Merged
merged 1 commit into from
Aug 29, 2019
Merged

add AWS API call reties on AMI prevalidation #8034

merged 1 commit into from
Aug 29, 2019

Conversation

cove
Copy link
Contributor

@cove cove commented Aug 24, 2019

Packer does a pre-validation check on the AMI name before proceeding with a build, however this check is not retried and when throttled by AWS it causes the build to fail. This PR switches to use the AWS SDK builtin retry mechanism for this call with sane defaults, and provides additional documentation on how to increase the retry settings.

This patch has been tested for about 1 week in a busy AWS account that requires robust retrying.

@cove cove requested a review from a team as a code owner August 24, 2019 00:00
@hashicorp-cla
Copy link

hashicorp-cla commented Aug 24, 2019

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@azr azr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, there are some little things to fix but overall this LGTM 🙂

builder/amazon/common/step_pre_validate.go Outdated Show resolved Hide resolved
builder/amazon/common/step_pre_validate.go Outdated Show resolved Hide resolved
@azr
Copy link
Contributor

azr commented Aug 26, 2019

After some thinking: wasn't setting AWS_POLL_DELAY_SECONDS & AWS_MAX_ATTEMPTS enough ?
Did you try setting them and did that solve your problem ?

I think adding an env var setting only to retry aws validation is too much.

@cove
Copy link
Contributor Author

cove commented Aug 26, 2019

Those settings didn’t work and in the logs we’d see it fail after the prevalidation step frequently. You can also see the call isn’t retried. Adding the retry in the sdk here resolved the issue.

Agreed separate env vars for a single call is annoying, but the poll system env vars don’t factor easily into the sdk’s retry mechanism. The names are hard coded to the poller and they have two mode of retry and the sdk has just one.

One option might be to simplify the poller’s env var settings so they’re compatible with the sdk’s and make the name more generic so they can be reused. Though that would be a potentially breaking change. Thoughts?

@azr
Copy link
Contributor

azr commented Aug 27, 2019

Ah, too bad, I think using the AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS settings ( without changing anything else ) could be the right call here. 🤔 what do you think @SwampDragons ?

@SwampDragons
Copy link
Contributor

Hi, cool PR. I had seen reports of this a while ago but since I wasn't able to reproduce easily, hadn't had a chance to investigate. Thanks for opening!

AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS are defined by the AWS SDK, not by Packer, and I'm not a fan of overloading them to have packer-specific meanings as well. In general, we've been trying to move Packer away from using custom env vars where possible.

In this situation, I think it makes more sense to just have a set number of retries (five or ten?) and not have it configurable. That's how we do a fair number of our other retry mechanisms that are built to work around network issues (Example: step_key_pair), and being throttled feels similar to that. I think that if a small handful of retries isn't enough to get around the throttling mechanism, it'll be unlikely that other parts of the build are going to be successful; it'll just push the flake out further. At a certain point, Packer can't fix your org's throttling limits.

@SwampDragons
Copy link
Contributor

For posterity, this is related to issue: #6330

@cove
Copy link
Contributor Author

cove commented Aug 27, 2019

From a customer POV, what would work best for us is to have a single set of settings (e.g. env vars) that specify the retry behavior globally for the AWS builder and have reasonable defaults so that it's only in extreme cases they need to be adjusted.

We happen to be in the extreme case I believe, as that we have a large dev env and dynamic ec2 Jenkins agents and our apps need to make AWS API calls to manage resources. So not being able to adjust them if they still timeout would mean we'd need to fork Packer to adjust it to get past busy requests times.

Thinking long term, what about this?

Replace the current AWS_* vars (which look Packer specific, since I couldn't find them in the SDK)

  • AWS_POLL_DELAY_SECONDS
  • AWS_TIMEOUT_SECONDS
  • AWS_MAX_ATTEMPTS
  • AWS_POLL_DELAY_SECONDS

With:

  • PACKER_RETRIES_MAX default 10
  • PACKER_RETRIES_DELAY default 10s
    For the legacy env vars, simply print out warnings for a release the names will change, then warn it's no longer honored going forward.

@SwampDragons
Copy link
Contributor

SwampDragons commented Aug 27, 2019

Ah, you're right -- those vars are used inside Packer to set the request.WithWaiterDelay and request.WithWaiterMaxAttempts values in the SDK. AWS_TIMEOUT_SECONDS is already a legacy env var that we're supporting for backwards compatibility.

I don't think these replacing these variables with a newly-named, global one is a good match. The max polling wait you want for a throttled synchronous API request is not the same as the max polling wait which you want for waiting for a 30Gb image to be copied between six regions.

In testing this in your setup, have you ever been throttled on other API calls like the one I linked in step_key_pair?

If the answer is yes, then we do need a configurable retry for synchronous api calls, but it needs to happen in multiple places in the code, not just in the describe images call. Rather than AWS_DESCRIBE_IMAGES_RETRIES, why not just AWS_API_RETRIES, and then we can apply it to all the other api calls that we have inside a retry loop.

If the answer is no, then we probably don't need a configurable retry for any of them.

@cove
Copy link
Contributor Author

cove commented Aug 27, 2019

Yes, we get throttled on at least the describe images call here, it's possible we've had failures due to other non-retried calls too however.

AWS_API_RETRIES set to a default of say 10 and exp backoff max of 30s or 60s, and I think technically some randomization to help with parallelized builds would be great.

Then the poller could just default to a reasonable interval and be less complex to configure too.

What do you think?

One other comment: it might make sense to prefix the vars with PACKER_* just to avoid confusing them with SDK vars.

@SwampDragons
Copy link
Contributor

Prefxing with AWS makes it clear that this applies only to the AWS builder. I'd be fine with both, though, e.g. PACKER_AWS_API_RETRIES

it's possible we've had failures due to other non-retried calls too however.

I'd rather be certain that this is an issue you see before we add a new variable to fix it.

@cove
Copy link
Contributor Author

cove commented Aug 27, 2019

PACKER_AWS_API_RETRIES makes sense to me.

The prevalidation check is confirmed yes, what I was saying is that in addition to those build failures it's possible other builds have failed due to other API calls not being retried as you mentioned.

@SwampDragons
Copy link
Contributor

Yes, I understand that you're seeing build failures in prevalidation without this PR. That's not my question.

My question is whether you've seen build failures due to API throttling at other points which are currently already wrapped in retries. (search the code for retry.Config -- we use hardcoded retries all the time for AWS API calls) If you haven't seen failures at any of these points, then an extra configuration variable is unnecessary.

Speaking of which -- you're using the RetryCount field on the request, but elsewhere in the code we use a custom exponential backoff wrapper for the query specifically to reduce the number of tries we need to attempt during high traffic times, and so that we don't block on a long-running remote request in case a cancellation command comes through. For consistency's sake, and to allow the backoff, I think I'd like to see this retry follow that same format (regardless of where we land on the config option).

I'm sorry if it seems like I'm being arbitrarily hard on the idea of adding this environment variable. The reason I'm not wild about making this a lever you can tweak is that it encourages bad practices. Sure, I can tell Packer to retry every API call 200 times if I'm being throttled. But that's a band-aid for the actual problem, which is that my organization is regularly running for long periods of time above its API query threshold. Having Packer retry its calls a small handful of times to make sure that individual accidental spikes in API consumption don't mess up the build is one thing. Having Packer consistently banging over and over against Amazon's API thresholds because we haven't tuned our ci system is another.

Which is why I feel that unless you're also seeing build failures as a result of the many other API calls we make, inside retries, but with hardcoded upper limits on the retries, then it isn't worth adding a configuration variable, and could potentially be harmful by covering up a pipeline issue.

I promise I'm not trying to be a difficult for no reason here. I just want to make sure that this is really necessary before we add it because it feels like an antipattern.

@cove
Copy link
Contributor Author

cove commented Aug 27, 2019

Yes, the existing wrapper makes more sense, didn't know about it.

Omitting works since it sounds like it's been a pain point for you. Ideally nothing needs to bet set and it just works ™️. I haven't seen failures where other calls are retired other than the poller.

@SwampDragons
Copy link
Contributor

I appreciate it! And I am happy to revisit it in the future if it turns out a simple retry is insufficient :)

@cove
Copy link
Contributor Author

cove commented Aug 28, 2019

@SwampDragons I took a look at the retry wrapper and it's actually just retrying on any error rather than throttling events like the SDK, which means that auth errors will be retried for example. Additionally the SDK does randomization on the retry so it'll break up synchronized requests, during parallel builds better. Lmk if that's ok, or if you'd still prefer to use the wrapper.

@azr
Copy link
Contributor

azr commented Aug 28, 2019

I think setting the retry count like this should be enough an this looks good to me.

( I would like to let @SwampDragons have the final call here though 🙂 )


FYI: The retry.Config has a ShouldRetry func(error) bool field which could probably set to something like the aws sdk already has:

func shouldRetryError(origErr error) bool {
switch err := origErr.(type) {
case awserr.Error:
if err.Code() == CanceledErrorCode {
return false
}
return shouldRetryError(err.OrigErr())
case *url.Error:
if strings.Contains(err.Error(), "connection refused") {
// Refused connections should be retried as the service may not yet
// be running on the port. Go TCP dial considers refused
// connections as not temporary.
return true
}
// *url.Error only implements Temporary after golang 1.6 but since
// url.Error only wraps the error:
return shouldRetryError(err.Err)
case temporary:
if netErr, ok := err.(*net.OpError); ok && netErr.Op == "dial" {
return true
}
// If the error is temporary, we want to allow continuation of the
// retry process
return err.Temporary() || isErrConnectionReset(origErr)
case nil:
// `awserr.Error.OrigErr()` can be nil, meaning there was an error but
// because we don't know the cause, it is marked as retryable. See
// TestRequest4xxUnretryable for an example.
return true
default:
switch err.Error() {
case "net/http: request canceled",
"net/http: request canceled while waiting for connection":
// known 1.5 error case when an http request is cancelled
return false
}
// here we don't know the error; so we allow a retry.
return true
}
}

It would be way better if that shouldRetry func was public though.

@azr azr merged commit 86cee5c into hashicorp:master Aug 29, 2019
@azr
Copy link
Contributor

azr commented Aug 29, 2019

Confirmed internally, thanks @cove 🙂

@ghost
Copy link

ghost commented Mar 29, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Mar 29, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants