Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amazon Builder not retrying after RequestLimitExceeded - throttling issue #6330

Closed
CarlInglisBJSS opened this issue Jun 1, 2018 · 15 comments

Comments

@CarlInglisBJSS
Copy link

CarlInglisBJSS commented Jun 1, 2018

Packer version: 1.2.3
Platform: Linux
Builder: amazon-ebs

08:28:26 2018/06/01 07:28:24 packer: 2018/06/01 07:28:24 Waiting for state to become: available  
08:28:26 2018/06/01 07:28:24 packer: 2018/06/01 07:28:24 Using 30s as polling delay (change with AWS_POLL_DELAY_SECONDS)  
08:28:26 2018/06/01 07:28:24 packer: 2018/06/01 07:28:24 Allowing 300s to complete (change with AWS_TIMEOUT_SECONDS)  
08:29:33 2018/06/01 07:29:30 packer: 2018/06/01 07:29:30 Error on AMIStateRefresh: RequestLimitExceeded: Request limit exceeded.  
08:29:33 2018/06/01 07:29:30 packer: 	status code: 503, request id: 2c9c6471-16fc-45c4-9361-d92b6c5a78a1  

From my examination of the code (I'm not a Go programmer) it appears that function WaitForState is relying on isTransientNetworkError to determine retries, and 503 isn't classed as a retryable error.

I can't tell if the go sdk is retrying, however I suspect not given how quickly the error is being returned.

I have opened a support case with Amazon to get more information on the specific request and I will post anything significant that I get back from them to this issue.

We do have multiple packers running at the time, however the issue I am concerned about is not that, it's that the builder isn't retrying after throttling.

It is of course possible that I've misunderstood something in the Go code. If that's the case I'd appreciate being pointed at my misunderstanding. Thanks.

@rickard-von-essen
Copy link
Collaborator

Throttling shouldn't have status code 503. See https://docs.aws.amazon.com/health/latest/APIReference/CommonErrors.html

@CarlInglisBJSS
Copy link
Author

I appreciate your point, however the status code is clearly 503, and the error is RequestLimitExceeded - and a lot of Amazon's documentation refers to that as a throttling error. So I guess being pragmatic one would expect to deal with that error as per other throttling errors.

@rickard-von-essen
Copy link
Collaborator

Coul you add some more lines of the log so we know which API call this is for?

@CarlInglisBJSS
Copy link
Author

"Error on AMIStateRefresh" narrows it down precisely - that message only appears once in the codebase.

I have asked Amazon for more info on the specific request ID, and if I get anything I'll post it here.

@rickard-von-essen
Copy link
Collaborator

No it doesn't.. There are multiple places where we are waiting for AMI's to become available.

@CarlInglisBJSS
Copy link
Author

Information from Amazon:


Regarding the specific request provided in the case:
request id: 2c9c6471-16fc-45c4-9361-d92b6c5a78a1

this was a DescribeImages API call
by userArn [snipped]
from sourceIP [snipped]

I see also that for the same time-frame you got other API calls throttled, most of them Describe actions (e.g. for DescribeInstances, DescribeInstanceAttribute, DescribeSecurityGroups, etc.)

That said, being that you are using a programmatic approach to API calls, I would recommend you to implement exponential back-off in order to retry calls and make sure they will succeed.

Reference documentation on this can be found here:
https://docs.aws.amazon.com/general/latest/gr/api-retries.html
https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html


I've checked the code base again and that message is only used in that one procedure call but I think I may have misunderstood what you were asking - here it is in context of the output:

08:27:54 2018/06/01 07:27:53 ui:     amazon-ebs: Complete!
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [INFO] RPC endpoint: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 [INFO] 19043 bytes written for 'stdout'
08:27:54 2018/06/01 07:27:53 [INFO] 0 bytes written for 'stderr'
08:27:54 2018/06/01 07:27:53 [INFO] RPC client: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 [INFO] RPC endpoint: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [INFO] 19043 bytes written for 'stdout'
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [INFO] 0 bytes written for 'stderr'
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [INFO] RPC client: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [DEBUG] Opening new ssh session
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [DEBUG] starting remote command: rm -f /tmp/script_4177.sh
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [INFO] RPC endpoint: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 [INFO] RPC client: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 [INFO] RPC endpoint: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 packer: 2018/06/01 07:27:53 [INFO] RPC client: Communicator ended with: 0
08:27:54 2018/06/01 07:27:53 [INFO] (telemetry) ending shell
08:27:54 2018/06/01 07:27:53 ui: ==> amazon-ebs: Stopping the source instance...
08:27:54 2018/06/01 07:27:53 ui:     amazon-ebs: Stopping instance, attempt 1
08:27:54 2018/06/01 07:27:53 ui: ==> amazon-ebs: Waiting for the instance to stop...
08:28:26 2018/06/01 07:28:23 ui: ==> amazon-ebs: Creating the AMI: mot2-api_java_mot_test-ops-OPSTEAM-1750-update-packer/40-79081
08:28:26 2018/06/01 07:28:24 ui:     amazon-ebs: AMI: ami-7db7b204
08:28:26 2018/06/01 07:28:24 ui: ==> amazon-ebs: Waiting for AMI to become ready...
08:28:26 2018/06/01 07:28:24 packer: 2018/06/01 07:28:24 Waiting for state to become: available
08:28:26 2018/06/01 07:28:24 packer: 2018/06/01 07:28:24 Using 30s as polling delay (change with AWS_POLL_DELAY_SECONDS)
08:28:26 2018/06/01 07:28:24 packer: 2018/06/01 07:28:24 Allowing 300s to complete (change with AWS_TIMEOUT_SECONDS)
08:29:33 2018/06/01 07:29:30 packer: 2018/06/01 07:29:30 Error on AMIStateRefresh: RequestLimitExceeded: Request limit exceeded.
08:29:33 2018/06/01 07:29:30 packer: 	status code: 503, request id: 2c9c6471-16fc-45c4-9361-d92b6c5a78a1
08:29:33 2018/06/01 07:29:30 ui error: ==> amazon-ebs: Error waiting for AMI: RequestLimitExceeded: Request limit exceeded.
08:29:33 2018/06/01 07:29:30 ui: ==> amazon-ebs: Terminating the source AWS instance...
08:29:33 2018/06/01 07:29:30 packer: 2018/06/01 07:29:30 Waiting for state to become: terminated
08:29:33 2018/06/01 07:29:30 packer: 2018/06/01 07:29:30 Using 30s as polling delay (change with AWS_POLL_DELAY_SECONDS)
08:29:33 2018/06/01 07:29:30 packer: 2018/06/01 07:29:30 Allowing 300s to complete (change with AWS_TIMEOUT_SECONDS)
08:30:06 2018/06/01 07:30:01 ui: ==> amazon-ebs: Cleaning up any extra volumes...
08:30:06 2018/06/01 07:30:01 ui: ==> amazon-ebs: No volumes to clean up, skipping
08:30:06 2018/06/01 07:30:01 ui: ==> amazon-ebs: Deleting temporary security group...

@rickard-von-essen
Copy link
Collaborator

rickard-von-essen commented Jun 1, 2018

Oh I taught #6177 was released, but apparently it is awaiting 1.3.0 which is the next release. It should handle this correctly.

@CarlInglisBJSS
Copy link
Author

That's great, thanks! Is there an estimated release date? (So I can shove the ticket on our backlog with a "revisit after" date)

@rickard-von-essen
Copy link
Collaborator

Unfortunately no, maybe @SwampDragons or @mwhooker have one.

@SwampDragons
Copy link
Contributor

we just put out 1.2.4 and normally we wait ~ 6 weeks between releases; I would guess mid-July at the earliest. That said, I'm working on #6177 now, so if you want to follow that ticket I can build you a binary when it's done so you can test it out next week or so.

@mwhooker
Copy link
Contributor

mwhooker commented Jun 2, 2018

@CarlInglisBJSS thanks for the report. Hopefully #6177 will solve this, and it would be really useful if you could help us test that when it’s ready. In the mean time, I would suggest increasing the AWS_POLL_DELAY_SECONDS env var as a workaround.

@CarlInglisBJSS
Copy link
Author

I'm watching that issue and will happily help test it when it's ready.

I'll try the AWS_POLL_DELAY_SECONDS change - thanks.

@CarlInglisBJSS
Copy link
Author

CarlInglisBJSS commented Jun 4, 2018

Our existing AWS_POLL_DELAY_SECONDS was 30 seconds. I have added a random 0-9 seconds to that for each invocation to stagger the number of simultaneous calls. This, coupled with a random 1-10 second at the start of each invocation, appears (based on one set of 15 simultaneous starts) to have given us a work around.

EDIT: Now up to 3 different sets of 15 simultaneous starts of the script - looks like this is working! Thanks.

@SwampDragons
Copy link
Contributor

This was fixed by #6332

@ghost
Copy link

ghost commented Mar 31, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Mar 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants