add AWS API call reties on AMI prevalidation #8034

cove · 2019-08-24T00:00:53Z

Packer does a pre-validation check on the AMI name before proceeding with a build, however this check is not retried and when throttled by AWS it causes the build to fail. This PR switches to use the AWS SDK builtin retry mechanism for this call with sane defaults, and provides additional documentation on how to increase the retry settings.

This patch has been tested for about 1 week in a busy AWS account that requires robust retrying.

hashicorp-cla · 2019-08-24T00:01:00Z

All committers have signed the CLA.

azr

Nice one, there are some little things to fix but overall this LGTM 🙂

builder/amazon/common/step_pre_validate.go

azr · 2019-08-26T07:39:40Z

After some thinking: wasn't setting AWS_POLL_DELAY_SECONDS & AWS_MAX_ATTEMPTS enough ?
Did you try setting them and did that solve your problem ?

I think adding an env var setting only to retry aws validation is too much.

cove · 2019-08-26T13:38:36Z

Those settings didn’t work and in the logs we’d see it fail after the prevalidation step frequently. You can also see the call isn’t retried. Adding the retry in the sdk here resolved the issue.

Agreed separate env vars for a single call is annoying, but the poll system env vars don’t factor easily into the sdk’s retry mechanism. The names are hard coded to the poller and they have two mode of retry and the sdk has just one.

One option might be to simplify the poller’s env var settings so they’re compatible with the sdk’s and make the name more generic so they can be reused. Though that would be a potentially breaking change. Thoughts?

azr · 2019-08-27T14:25:58Z

Ah, too bad, I think using the AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS settings ( without changing anything else ) could be the right call here. 🤔 what do you think @SwampDragons ?

SwampDragons · 2019-08-27T17:15:01Z

Hi, cool PR. I had seen reports of this a while ago but since I wasn't able to reproduce easily, hadn't had a chance to investigate. Thanks for opening!

AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS are defined by the AWS SDK, not by Packer, and I'm not a fan of overloading them to have packer-specific meanings as well. In general, we've been trying to move Packer away from using custom env vars where possible.

In this situation, I think it makes more sense to just have a set number of retries (five or ten?) and not have it configurable. That's how we do a fair number of our other retry mechanisms that are built to work around network issues (Example: step_key_pair), and being throttled feels similar to that. I think that if a small handful of retries isn't enough to get around the throttling mechanism, it'll be unlikely that other parts of the build are going to be successful; it'll just push the flake out further. At a certain point, Packer can't fix your org's throttling limits.

SwampDragons · 2019-08-27T17:16:34Z

For posterity, this is related to issue: #6330

cove · 2019-08-27T17:53:40Z

From a customer POV, what would work best for us is to have a single set of settings (e.g. env vars) that specify the retry behavior globally for the AWS builder and have reasonable defaults so that it's only in extreme cases they need to be adjusted.

We happen to be in the extreme case I believe, as that we have a large dev env and dynamic ec2 Jenkins agents and our apps need to make AWS API calls to manage resources. So not being able to adjust them if they still timeout would mean we'd need to fork Packer to adjust it to get past busy requests times.

Thinking long term, what about this?

Replace the current AWS_* vars (which look Packer specific, since I couldn't find them in the SDK)

AWS_POLL_DELAY_SECONDS
AWS_TIMEOUT_SECONDS
AWS_MAX_ATTEMPTS
AWS_POLL_DELAY_SECONDS

With:

PACKER_RETRIES_MAX default 10
PACKER_RETRIES_DELAY default 10s
For the legacy env vars, simply print out warnings for a release the names will change, then warn it's no longer honored going forward.

SwampDragons · 2019-08-27T18:34:14Z

Ah, you're right -- those vars are used inside Packer to set the request.WithWaiterDelay and request.WithWaiterMaxAttempts values in the SDK. AWS_TIMEOUT_SECONDS is already a legacy env var that we're supporting for backwards compatibility.

I don't think these replacing these variables with a newly-named, global one is a good match. The max polling wait you want for a throttled synchronous API request is not the same as the max polling wait which you want for waiting for a 30Gb image to be copied between six regions.

In testing this in your setup, have you ever been throttled on other API calls like the one I linked in step_key_pair?

If the answer is yes, then we do need a configurable retry for synchronous api calls, but it needs to happen in multiple places in the code, not just in the describe images call. Rather than AWS_DESCRIBE_IMAGES_RETRIES, why not just AWS_API_RETRIES, and then we can apply it to all the other api calls that we have inside a retry loop.

If the answer is no, then we probably don't need a configurable retry for any of them.

cove · 2019-08-27T20:28:16Z

Yes, we get throttled on at least the describe images call here, it's possible we've had failures due to other non-retried calls too however.

AWS_API_RETRIES set to a default of say 10 and exp backoff max of 30s or 60s, and I think technically some randomization to help with parallelized builds would be great.

Then the poller could just default to a reasonable interval and be less complex to configure too.

What do you think?

One other comment: it might make sense to prefix the vars with PACKER_* just to avoid confusing them with SDK vars.

SwampDragons · 2019-08-27T22:06:47Z

Prefxing with AWS makes it clear that this applies only to the AWS builder. I'd be fine with both, though, e.g. PACKER_AWS_API_RETRIES

it's possible we've had failures due to other non-retried calls too however.

I'd rather be certain that this is an issue you see before we add a new variable to fix it.

cove · 2019-08-27T22:12:48Z

PACKER_AWS_API_RETRIES makes sense to me.

The prevalidation check is confirmed yes, what I was saying is that in addition to those build failures it's possible other builds have failed due to other API calls not being retried as you mentioned.

SwampDragons · 2019-08-27T22:42:50Z

Yes, I understand that you're seeing build failures in prevalidation without this PR. That's not my question.

My question is whether you've seen build failures due to API throttling at other points which are currently already wrapped in retries. (search the code for retry.Config -- we use hardcoded retries all the time for AWS API calls) If you haven't seen failures at any of these points, then an extra configuration variable is unnecessary.

Speaking of which -- you're using the RetryCount field on the request, but elsewhere in the code we use a custom exponential backoff wrapper for the query specifically to reduce the number of tries we need to attempt during high traffic times, and so that we don't block on a long-running remote request in case a cancellation command comes through. For consistency's sake, and to allow the backoff, I think I'd like to see this retry follow that same format (regardless of where we land on the config option).

I'm sorry if it seems like I'm being arbitrarily hard on the idea of adding this environment variable. The reason I'm not wild about making this a lever you can tweak is that it encourages bad practices. Sure, I can tell Packer to retry every API call 200 times if I'm being throttled. But that's a band-aid for the actual problem, which is that my organization is regularly running for long periods of time above its API query threshold. Having Packer retry its calls a small handful of times to make sure that individual accidental spikes in API consumption don't mess up the build is one thing. Having Packer consistently banging over and over against Amazon's API thresholds because we haven't tuned our ci system is another.

Which is why I feel that unless you're also seeing build failures as a result of the many other API calls we make, inside retries, but with hardcoded upper limits on the retries, then it isn't worth adding a configuration variable, and could potentially be harmful by covering up a pipeline issue.

I promise I'm not trying to be a difficult for no reason here. I just want to make sure that this is really necessary before we add it because it feels like an antipattern.

cove · 2019-08-27T23:05:31Z

Yes, the existing wrapper makes more sense, didn't know about it.

Omitting works since it sounds like it's been a pain point for you. Ideally nothing needs to bet set and it just works ™️. I haven't seen failures where other calls are retired other than the poller.

SwampDragons · 2019-08-27T23:07:53Z

I appreciate it! And I am happy to revisit it in the future if it turns out a simple retry is insufficient :)

…ottled

cove · 2019-08-28T01:34:50Z

@SwampDragons I took a look at the retry wrapper and it's actually just retrying on any error rather than throttling events like the SDK, which means that auth errors will be retried for example. Additionally the SDK does randomization on the retry so it'll break up synchronized requests, during parallel builds better. Lmk if that's ok, or if you'd still prefer to use the wrapper.

azr · 2019-08-28T08:34:19Z

I think setting the retry count like this should be enough an this looks good to me.

( I would like to let @SwampDragons have the final call here though 🙂 )

FYI: The retry.Config has a ShouldRetry func(error) bool field which could probably set to something like the aws sdk already has:

packer/vendor/github.com/aws/aws-sdk-go/aws/request/request.go

Lines 603 to 642 in ecd118b

    
           func shouldRetryError(origErr error) bool { 
        
           	switch err := origErr.(type) { 
        
           	case awserr.Error: 
        
           		if err.Code() == CanceledErrorCode { 
        
           			return false 
        
           		} 
        
           		return shouldRetryError(err.OrigErr()) 
        
           	case *url.Error: 
        
           		if strings.Contains(err.Error(), "connection refused") { 
        
           			// Refused connections should be retried as the service may not yet 
        
           			// be running on the port. Go TCP dial considers refused 
        
           			// connections as not temporary. 
        
           			return true 
        
           		} 
        
           		// *url.Error only implements Temporary after golang 1.6 but since 
        
           		// url.Error only wraps the error: 
        
           		return shouldRetryError(err.Err) 
        
           	case temporary: 
        
           		if netErr, ok := err.(*net.OpError); ok && netErr.Op == "dial" { 
        
           			return true 
        
           		} 
        
           		// If the error is temporary, we want to allow continuation of the 
        
           		// retry process 
        
           		return err.Temporary() || isErrConnectionReset(origErr) 
        
           	case nil: 
        
           		// `awserr.Error.OrigErr()` can be nil, meaning there was an error but 
        
           		// because we don't know the cause, it is marked as retryable. See 
        
           		// TestRequest4xxUnretryable for an example. 
        
           		return true 
        
           	default: 
        
           		switch err.Error() { 
        
           		case "net/http: request canceled", 
        
           			"net/http: request canceled while waiting for connection": 
        
           			// known 1.5 error case when an http request is cancelled 
        
           			return false 
        
           		} 
        
           		// here we don't know the error; so we allow a retry. 
        
           		return true 
        
           	} 
        
           }

It would be way better if that shouldRetry func was public though.

azr · 2019-08-29T08:49:21Z

Confirmed internally, thanks @cove 🙂

ghost · 2020-03-29T01:52:42Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

cove requested a review from a team as a code owner August 24, 2019 00:00

azr suggested changes Aug 26, 2019

View reviewed changes

builder/amazon/common/step_pre_validate.go Outdated Show resolved Hide resolved

builder/amazon/common/step_pre_validate.go Outdated Show resolved Hide resolved

azr reviewed Aug 26, 2019

View reviewed changes

builder/amazon/common/step_pre_validate.go Outdated Show resolved Hide resolved

add prevalidating AMI name retries so the build doesn't fail when thr…

87e0a3b

…ottled

azr approved these changes Aug 28, 2019

View reviewed changes

azr merged commit 86cee5c into hashicorp:master Aug 29, 2019

SwampDragons added enhancement builder/amazon labels Sep 3, 2019

SwampDragons added this to the 1.4.4 milestone Sep 3, 2019

ghost locked and limited conversation to collaborators Mar 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add AWS API call reties on AMI prevalidation #8034

add AWS API call reties on AMI prevalidation #8034

cove commented Aug 24, 2019 •

edited

Loading

hashicorp-cla commented Aug 24, 2019 •

edited

Loading

azr left a comment

azr commented Aug 26, 2019 •

edited

Loading

cove commented Aug 26, 2019 •

edited

Loading

azr commented Aug 27, 2019

SwampDragons commented Aug 27, 2019

SwampDragons commented Aug 27, 2019

cove commented Aug 27, 2019 •

edited

Loading

SwampDragons commented Aug 27, 2019 •

edited

Loading

cove commented Aug 27, 2019 •

edited

Loading

SwampDragons commented Aug 27, 2019

cove commented Aug 27, 2019

SwampDragons commented Aug 27, 2019

cove commented Aug 27, 2019 •

edited

Loading

SwampDragons commented Aug 27, 2019

cove commented Aug 28, 2019

azr commented Aug 28, 2019 •

edited

Loading

azr commented Aug 29, 2019

ghost commented Mar 29, 2020

add AWS API call reties on AMI prevalidation #8034

add AWS API call reties on AMI prevalidation #8034

Conversation

cove commented Aug 24, 2019 • edited Loading

hashicorp-cla commented Aug 24, 2019 • edited Loading

azr left a comment

Choose a reason for hiding this comment

azr commented Aug 26, 2019 • edited Loading

cove commented Aug 26, 2019 • edited Loading

azr commented Aug 27, 2019

SwampDragons commented Aug 27, 2019

SwampDragons commented Aug 27, 2019

cove commented Aug 27, 2019 • edited Loading

SwampDragons commented Aug 27, 2019 • edited Loading

cove commented Aug 27, 2019 • edited Loading

SwampDragons commented Aug 27, 2019

cove commented Aug 27, 2019

SwampDragons commented Aug 27, 2019

cove commented Aug 27, 2019 • edited Loading

SwampDragons commented Aug 27, 2019

cove commented Aug 28, 2019

azr commented Aug 28, 2019 • edited Loading

azr commented Aug 29, 2019

ghost commented Mar 29, 2020

cove commented Aug 24, 2019 •

edited

Loading

hashicorp-cla commented Aug 24, 2019 •

edited

Loading

azr commented Aug 26, 2019 •

edited

Loading

cove commented Aug 26, 2019 •

edited

Loading

cove commented Aug 27, 2019 •

edited

Loading

SwampDragons commented Aug 27, 2019 •

edited

Loading

cove commented Aug 27, 2019 •

edited

Loading

cove commented Aug 27, 2019 •

edited

Loading

azr commented Aug 28, 2019 •

edited

Loading