Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User amazon SDK waiters with robust defaults #6177

Closed
2 of 4 tasks
mwhooker opened this issue Apr 23, 2018 · 28 comments · Fixed by #6332
Closed
2 of 4 tasks

User amazon SDK waiters with robust defaults #6177

mwhooker opened this issue Apr 23, 2018 · 28 comments · Fixed by #6332

Comments

@mwhooker
Copy link
Contributor

mwhooker commented Apr 23, 2018

Before the aws SDK had a viable wait method (i.e. to wait for resources to become ready), packer implemented its own version of this. See builder/amazon/common/state.go.

We routinely get issues reported that we're not waiting long enough. I've slowly been replacing the homegrown waiters with the SDK waiters, which generally seems to solve the problems. I think we should remove the home grown waiters for good and replace them with the SDK waiters.

There is a BC aspect to consider:

The common/state.go waiters use the env vars AWS_POLL_DELAY_SECONDS and AWS_TIMEOUT_SECONDS to control how long to wait.

The SDK waiters implement defaults for the above values per resource. For example, the wait for instances to be running looks like this

func (c *EC2) WaitUntilInstanceRunningWithContext(ctx aws.Context, input *DescribeInstancesInput, opts ...request.WaiterOption) error {
        w := request.Waiter{
                Name:        "WaitUntilInstanceRunning",
                MaxAttempts: 40,
                Delay:       request.ConstantWaiterDelay(15 * time.Second),
                Acceptors: []request.WaiterAcceptor{
                        {
                                State:   request.SuccessWaiterState,
                                Matcher: request.PathAllWaiterMatch, Argument: "Reservations[].Instances[].State.Name",
                                Expected: "running",
                        },
                        {
                                State:   request.FailureWaiterState,
                                Matcher: request.PathAnyWaiterMatch, Argument: "Reservations[].Instances[].State.Name",
                                Expected: "shutting-down",
                        },
                        {
                                State:   request.FailureWaiterState,
                                Matcher: request.PathAnyWaiterMatch, Argument: "Reservations[].Instances[].State.Name",
                                Expected: "terminated",
                        },
                        {
                                State:   request.FailureWaiterState,
                                Matcher: request.PathAnyWaiterMatch, Argument: "Reservations[].Instances[].State.Name",
                                Expected: "stopping",
                        },
                        {
                                State:    request.RetryWaiterState,
                                Matcher:  request.ErrorWaiterMatch,
                                Expected: "InvalidInstanceID.NotFound",
                        },
                },
                Logger: c.Config.Logger,
                NewRequest: func(opts []request.Option) (*request.Request, error) {
                        var inCpy *DescribeInstancesInput
                        if input != nil {
                                tmp := *input
                                inCpy = &tmp
                        }
                        req, _ := c.DescribeInstancesRequest(inCpy)
                        req.SetContext(ctx)
                        req.ApplyOptions(opts...)
                        return req, nil
                },
        }
        w.ApplyOptions(opts...)

        return w.WaitWithContext(ctx)
}

as you can see, it has the delay hardcoded to 15s, and retries set to 40. We do a 5 minute absolute timeout and 2 second delay.

I believe packer has a different use-case than the SDK defaults. We don't care so much about liveliness, and instead care deeply about the request eventually succeeding. I believe we should use the SDK waiters with default retries set to 10x the SDK default.

We must also allow these to be overridable by the user in case the values are unacceptable. Ideally we could reuse the existing env vars in a way that makes sense, but we may have to come up with an alternate configuration syntax.

Todo

  • get all the waiters replaced with the SDK waiters
  • increase the default retries 10x
  • make the defaults overridable by env vars
  • Document the env var behavior on the website
@mwhooker
Copy link
Contributor Author

mwhooker commented Jun 2, 2018

Looking at #6330, I find it odd that the default waiters use a constant delay. I wonder if we should add an option to use exponential back off...

@stevemac007
Copy link

I like the approach of using the SDK waiters for this solution - any ETA on having 1.3.0 ready?

@SwampDragons
Copy link
Contributor

It'll probably be at least a month -- I'm working on implementing the waiters right now in #6332 and will hopefully be done with testing today. I can build you a binary if you'd like to try it out.

@SwampDragons
Copy link
Contributor

@mwhooker I'm not sure points 2 and 4 are necessary in the way I've implemented it; I have used Packer's current env vars instead of overriding them with anything, so the default is already set by those env vars (point 2) and it should already be documented because nothing's changed (point 4)

@nstreet
Copy link

nstreet commented Jun 26, 2018

Could I please point out that the issue reported in #6305 that ends up getting tracked here is about the timeout happening when we have specifically requested that no timeout should apply via disable_stop_instance
Although the log output would suggest that a timeout is set:

2018/06/26 10:59:45 ui: ==> amazon-ebs: Automatic instance stop disabled. Please stop instance manually.
2018/06/26 10:59:45 ui: ==> amazon-ebs: Waiting for the instance to stop...
2018/06/26 10:59:45 packer: 2018/06/26 10:59:45 Waiting for state to become: stopped
2018/06/26 10:59:45 packer: 2018/06/26 10:59:45 Using 2s as polling delay (change with AWS_POLL_DELAY_SECONDS)
2018/06/26 10:59:45 packer: 2018/06/26 10:59:45 Allowing 300s to complete (change with AWS_TIMEOUT_SECONDS)

Using packer 1.0.2 packer waits indefinitely as we would want it to.
I have tried using packer version 1.2.2 and 1.2.4 and both time out - which is not the behaviour we require if we specify disable_stop_instance

@SwampDragons
Copy link
Contributor

@nstreet disable_stop_instance just prevents the instance from being stopped when the build has finished running; it doesn't affect timeouts.

@SwampDragons
Copy link
Contributor

@nstreet Looking closer at this, you've found a bug unrelated to the timeouts -- the issue here is that the waiter should be located inside an if statement that it's actually outside of. I can fix that for the next release.

@SwampDragons
Copy link
Contributor

Okay, final comment @nstreet -- I looked into this and it turns out it's a flag for a very very specific use case; for that use case it works fine, but it does not do what you want it to do. I've made a PR updating the docs to make them clearer.

@PCSshell
Copy link

@SwampDragons any chance I could get a build with this fix in it to try out?

@SwampDragons
Copy link
Contributor

@PCSshell sure -- what's your host OS?

@PCSshell
Copy link

@SwampDragons Ubuntu 16.04

@SwampDragons
Copy link
Contributor

packer.zip

@PCSshell
Copy link

@SwampDragons Apparently I am not doing something correctly. Below are the commands I am running to set the timeout settings. In the log it is telling me that the values to over ride are not set.

$ export AWS_POLL_DELAY_SECONDS=10
$ export AWS_MAX_ATTEMPTS=300
$ export AWS_TIMEOUT_SECONDS=3000
$ export PACKER_LOG=1
$ packer build 2012R2/Server2012R2.json

==> amazon-ebs: Error waiting for instance to stop: ResourceNotReady: exceeded wait attempts
2018/07/11 09:12:43 ui: ==> amazon-ebs: Terminating the source AWS instance...
==> amazon-ebs: Terminating the source AWS instance...
2018/07/11 09:12:44 packer: 2018/07/11 09:12:44 No AWS timeout and polling overrides have been set. Packer will defalt to waiter-specific delays and timeouts. If you would like to customize the length of time between retries and max number of retries you may do so by setting the environment variables AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS to your desired values.

$ echo $AWS_MAX_ATTEMPTS
300
$ echo $AWS_POLL_DELAY_SECONDS
10
$ echo $AWS_TIMEOUT_SECONDS
3000

@SwampDragons
Copy link
Contributor

Oh weird. And of course I merged before I saw your comment. I'll take a look at this.

@SwampDragons
Copy link
Contributor

ugh, found it. I made some last-minute changes to make testing easier, and... broke the functionality. 😅
packer.zip

@PCSshell
Copy link

I am still seeing the same behavior. The --version number I am seeing now is [INFO] Packer version: 1.3.0-dev

@SwampDragons
Copy link
Contributor

The same behavior as the previous build I gave you? I just tested again locally, and I only get the No AWS timeout and polling overrides have been set warning if I have not set the environment variables. If I set the environment variables as you have above, I get a different (correct) warning -- WARNING: AWS_MAX_ATTEMPTS and AWS_TIMEOUT_SECONDS are both set.... etc

I was bouncing back and forth between several issues yesterday so I guess it's possible I built from the wrong commit (I can't test the binary right now since I'm on a darwin machine)... just in case, here's a new build of the master branch.
packer.zip

@PCSshell
Copy link

Looks like it is recognizing the environment variables now, but does not seem to respect what I am entering for them.

export AWS_POLL_DELAY_SECONDS=30
export AWS_MAX_ATTEMPTS=3000
export AWS_TIMEOUT_SECONDS=3000

2018/07/12 13:47:41 ui: ==> amazon-ebs: Automatic instance stop disabled. Please stop instance manually.
==> amazon-ebs: Automatic instance stop disabled. Please stop instance manually.
2018/07/12 13:47:41 ui: ==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Error waiting for instance to stop: ResourceNotReady: exceeded wait attempts
2018/07/12 13:57:38 ui error: ==> amazon-ebs: Error waiting for instance to stop: ResourceNotReady: exceeded wait attempts
==> amazon-ebs: Terminating the source AWS instance...
2018/07/12 13:57:38 ui: ==> amazon-ebs: Terminating the source AWS instance...
2018/07/12 13:57:38 packer: 2018/07/12 13:57:38 WARNING: AWS_MAX_ATTEMPTS and AWS_TIMEOUT_SECONDS are both set. Packer will be using AWS_MAX_ATTEMPTS and discarding AWS_TIMEOUT_SECONDS. If you have not set AWS_POLL_DELAY_SECONDS, Packer will default to a 2 second poll delay.
2018/07/12 13:58:39 ui: ==> amazon-ebs: Cleaning up any extra volumes...
==> amazon-ebs: Cleaning up any extra volumes...
2018/07/12 13:58:39 ui: ==> amazon-ebs: No volumes to clean up, skipping
==> amazon-ebs: No volumes to clean up, skipping
2018/07/12 13:58:39 ui: ==> amazon-ebs: Deleting temporary security group...
==> amazon-ebs: Deleting temporary security group...
2018/07/12 13:58:40 ui: ==> amazon-ebs: Deleting temporary keypair...
==> amazon-ebs: Deleting temporary keypair...
2018/07/12 13:58:40 [INFO] (telemetry) ending amazon-ebs
2018/07/12 13:58:40 ui error: Build 'amazon-ebs' errored: Error waiting for instance to stop: ResourceNotReady: exceeded wait attempts
Build 'amazon-ebs' errored: Error waiting for instance to stop: ResourceNotReady: exceeded wait attempts

@SwampDragons
Copy link
Contributor

Okay, I dug around some more -- it looks like you've found a bug where this code was not governed by our environment variables even before I swapped out the waiters in the attached PR; it just used the waiter's default. I missed it when swapping out our custom code for the AWS waiters, because this particular wait had already been swapped out.

https://github.com/hashicorp/packer/blob/master/builder/amazon/common/step_stop_ebs_instance.go#L81

Third time's the charm?
packer.zip

@PCSshell
Copy link

@SwampDragons - Working as expected now. Thanks for all the assistance!

@SwampDragons
Copy link
Contributor

No problem -- thanks for your patience as I looked into this.

@WeekendsBull
Copy link

WeekendsBull commented Aug 28, 2018

@SwampDragons I have tried both packer v1.3 and v1.2.3 that I wasn't able to get rid of the error saying unexpected state 'failed', wanted target 'available'�

I currently building all 4 different AMIs from jenkins-slaves for 4 different aws accounts and three of them are working fine meaning it builds AMIs on all three accounts without having this error. Only one account is giving me the error even though packer version is on v1.2.3 and runs based on the same Ansible (v2.4.1) packer codes.
I wonder this might be related to IAM role permissions. I will update once I find out further after compared the IAM permissions.

[1;32m==> aws: Copying AMI: us-west-2(ami-0a30229bcbd197499)�[0m
[1;32m==> aws: Encrypting with KMS Key ID: arn:aws:kms:us-west-2:8xxxxxxxxxxxx:key/be91ec52-a7ed-4b6b-xxxx-d6594f71axxxxx[0m
[1;32m==> aws: Waiting for AMI copy to become ready...�[0m
[1;31m==> aws: Error waiting for AMI Copy: unexpected state 'failed', wanted target 'available'�[0m
[1;32m==> aws: Deregistering the AMI because cancellation or error...�[0m
[1;32m==> aws: Terminating the source AWS instance...�[0m
�[1;32m==> aws: Cleaning up any extra volumes...�[0m
[1;32m==> aws: No volumes to clean up, skipping�[0m
[1;32m==> aws: Deleting temporary security group...�[0m
[1;32m==> aws: Deleting temporary keypair...�[0m
1;31mBuild 'aws' errored: Error waiting for AMI Copy: unexpected state 'failed', wanted target 'available'�[0m

==> Some builds didn't complete successfully and had errors:
--> aws: Error waiting for AMI Copy: unexpected state 'failed', wanted target 'available'

==> Builds finished but no artifacts were created.

The jenkins slave machine which doesn't work, I have ran pip list to make sure all pip installed components are the same as the other including awscli, pip version, ansible, requests and etc. There is very minimal version discrepancies and almost the same.
This is something I don't understand why the other three accounts builds fine without having the issue but one of the 4 accounts shows the error on packer 1.2.3, 1.2.4 or 1.2.5 (above) and packer 1.3.0 as below.

2018/08/29 01:23:06 packer: 2018/08/29 01:23:06 Detected home directory from env var: /home/jenkins-slave
2018/08/29 01:23:06 packer: 2018/08/29 01:23:06 Plugin address: unix /tmp/packer-plugin482621616
2018/08/29 01:23:06 packer: 2018/08/29 01:23:06 Waiting for connection...
2018/08/29 01:23:06 packer: 2018/08/29 01:23:06 Serving a plugin connection...
2018/08/29 01:23:06 Loading provisioner: ansible-local
2018/08/29 01:23:06 Plugin could not be found. Checking same directory as executable.
2018/08/29 01:23:06 Current exe path: /usr/bin/packer
2018/08/29 01:23:06 Creating plugin client for path: /usr/bin/packer
2018/08/29 01:23:06 Starting plugin: /usr/bin/packer []string{"/usr/bin/packer", "plugin", "packer-provisioner-ansible-local"}
2018/08/29 01:23:06 Waiting for RPC address for: /usr/bin/packer
2018/08/29 01:23:06 packer: 2018/08/29 01:23:06 [INFO] Packer version: 1.3.0-dev
2018/08/29 01:23:06 packer: 2018/08/29 01:23:06 Packer Target OS/Arch: linux amd64
==> aws: Stopping the source instance...
2018/08/29 01:35:16 ui: ==> aws: Stopping the source instance...
aws: Stopping instance, attempt 1
2018/08/29 01:35:16 ui: aws: Stopping instance, attempt 1
==> aws: Waiting for the instance to stop...
2018/08/29 01:35:16 ui: ==> aws: Waiting for the instance to stop...
2018/08/29 01:35:16 packer: 2018/08/29 01:35:16 No AWS timeout and polling overrides have been set. Packer will defalt to waiter-specific delays and timeouts. If you would like to customize the length of time between retries and max number of retries you may do so by setting the environment variables AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS to your desired values.
2018/08/29 01:35:47 ui: ==> aws: Creating the AMI: sss-centos7-test
==> aws: Creating the AMI: sss-centos7-test
2018/08/29 01:35:47 ui: aws: AMI: ami-06059b63f16b53d61
==> aws: Waiting for AMI to become ready...
2018/08/29 01:35:47 ui: ==>aws: Waiting for AMI to become ready..
2018/08/29 01:35:47 packer: 2018/08/29 01:35:47 No AWS timeout and polling overrides have been set. Packer will defalt to waiter-specific delays and timeouts. If you would like to customize the length of time between retries and max number of retries you may do so by setting the environment variables AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS to your desired values.
2018/08/29 01:38:33 ui: ==> aws: Creating Encrypted AMI Copy
==> aws: Creating Encrypted AMI Copy
2018/08/29 01:38:33 ui: ==> aws: Copying AMI: us-west-2(ami-06059b63f16b53d61)
==> aws: Copying AMI: us-west-2(ami-06059b63f16b53d61)
2018/08/29 01:38:33 ui: ==> aws: Encrypting with KMS Key ID: arn:aws:kms:us-west-2:xxxxxxxxx:key/be91ec52-a7ed-4b6b-b8cd-d6594f71a044
==> aws: Encrypting with KMS Key ID: arn:aws:kms:us-west-2:xxxxxxxxxxx:key/be91ec52-a7ed-4b6b-b8cd-d6594f71a044
2018/08/29 01:38:33 ui: ==> aws: Waiting for AMI copy to become ready...
==>aws: Waiting for AMI copy to become ready...
2018/08/29 01:38:33 packer: 2018/08/29 01:38:33 No AWS timeout and polling overrides have been set. Packer will defalt to waiter-specific delays and timeouts. If you would like to customize the length of time between retries and max number of retries you may do so by setting the environment variables AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS to your desired values.
2018/08/29 01:48:23 ui error: ==> aws: Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts
2018/08/29 01:48:23 ui: ==> aws: Deregistering the AMI because cancellation or error...
==> aws: Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts
==> aws: Deregistering the AMI because cancellation or error...
2018/08/29 01:48:23 ui: ==> aws: Terminating the source AWS instance...
==> aws: Terminating the source AWS instance...
2018/08/29 01:48:38 ui: ==> aws: Deleting temporary security group...
2018/08/29 01:48:39 ui: ==> aws: Deleting temporary keypair...
==> aws: Deleting temporary keypair...
2018/08/29 01:48:39 [INFO] (telemetry) ending amazon-ebs
2018/08/29 01:48:39 ui error: Build 'aws' errored: Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts
2018/08/29 01:48:39 Builds completed. Waiting on interrupt barrier...
2018/08/29 01:48:39 machine readable: error-count []string{"1"}
2018/08/29 01:48:39 ui error:
==> Some builds didn't complete successfully and had errors:
2018/08/29 01:48:39 machine readable: aws,error []string{"Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts"}
2018/08/29 01:48:39 ui error: --> aws-ext-il2: Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts
2018/08/29 01:48:39 ui:
==> Builds finished but no artifacts were created.
Build 'aws' errored: Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts
==> Some builds didn't complete successfully and had errors:
--> aws: Error waiting for AMI Copy: ResourceNotReady: exceeded wait attempts
==> Builds finished but no artifacts were created.
2018/08/29 01:48:39 [INFO] (telemetry) Finalizing.
2018/08/29 01:48:39 waiting for all plugin processes to complete...
2018/08/29 01:48:39 /usr/bin/packer: plugin process exited
2018/08/29 01:48:39 /usr/bin/packer: plugin process exited
2018/08/29 01:48:39 /usr/bin/packer: plugin process exited
2018/08/29 01:48:39 /usr/bin/packer: plugin process exited
2018/08/29 01:48:39 /usr/bin/packer: plugin process exited
script returned exit code 1

Any suggestions?

@SwampDragons
Copy link
Contributor

I'll try to look at this today.

@WeekendsBull
Copy link

WeekendsBull commented Aug 30, 2018

@SwampDragons Thanks for your attention and I added some logs at https://github.com/WeekendsBull/packerbuild-error/tree/master/packer_debug

I put both centos and windows 2012r2 build logs with ones working and works failed. I have only provided packer v1.2.3 and would you want me to provide the v1.3.0-dev as well?

I put an excel file showing what I have installed in each jenkins slave instances but they are almost identical for installed version.

Here is the copy of centos template I use at https://github.com/WeekendsBull/packerbuild-error/blob/master/centos7build.json (each build name associated with each account).

@SwampDragons
Copy link
Contributor

@WeekendsBull I figured this out and responded to you in the mailing list, but I'll leave a note here in case future users run into the same problem. You need to increase the AWS timeout by setting the environment variables AWS_POLL_DELAY_SECONDS and AWS_MAX_ATTEMPTS

For example:

export AWS_POLL_DELAY_SECONDS=5
export AWS_MAX_ATTEMPTS=300
packer build example_template.json

or simply:

AWS_POLL_DELAY_SECONDS=5 AWS_MAX_ATTEMPTS=300 packer build example_template.json

@samof76
Copy link

samof76 commented Jul 31, 2019

Using 1.4.2 version of Packer. And still hit issue, in spite of setting AWS_POLL_DELAY_SECONDS=5 AWS_MAX_ATTEMPTS=300, while doing a packer build, here is my base.json...

{
    "variables": {
        "region": "us-east-1",
        "ssh_username": "ec2-user",
        "instance_type": "t2.micro",
        "vpc_id": "vpc-xxxxx",
        "subnet_id": "subnet-YYYY"
    },
    "builders": [
        {
            "type": "amazon-ebs",
            "ami_name": "eks-ami-sample-{{timestamp}}",
            "region": "{{user `region` }}",
            "vpc_id": "{{user `vpc_id`}}",
            "subnet_id": "{{user `subnet_id`}}",
            "instance_type": "{{user `instance_type` }}",
            "ssh_username": "{{user `ssh_username`}}",
            "associate_public_ip_address": false,
            "source_ami_filter": {
                "filters": {
                    "virtualization-type": "hvm",
                    "source": "amazon/amazon-eks-node-1.13-*",
                    "root-device-type": "ebs",
                    "architecture": "x86_64"
                },
                "owners": [
                    "602401143452"
                ],
                "most_recent": true
            }
        }
    ],
    "provisioners": [
        {
            "type": "shell",
            "script": "./scripts/touch_it.sh"
        }
    ]
}

@SwampDragons
Copy link
Contributor

This issue is a year old -- if you're still experiencing problems then please open a new ticket with full steps to reproduce and full debug logs.

@ghost
Copy link

ghost commented Mar 28, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Mar 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants