Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry timeout for IAM instance profile eventual consistency not high enough #13199

Closed
lvisterin opened this issue May 7, 2020 · 3 comments · Fixed by #17811
Closed

Retry timeout for IAM instance profile eventual consistency not high enough #13199

lvisterin opened this issue May 7, 2020 · 3 comments · Fixed by #17811
Assignees
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. service/iam Issues and PRs that pertain to the iam service.
Milestone

Comments

@lvisterin
Copy link

lvisterin commented May 7, 2020

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

Terraform v0.12.24
+ provider.aws v2.60.0

Affected Resource(s)

  • aws_instance
  • aws_iam_instance_profile

Terraform Configuration Files

provider "aws" {
  version = "~> 2.0"
  region  = "eu-west-1"
}

resource "aws_iam_instance_profile" "test" {
  name = "tf-test-profile"
  path = "/"
  role = aws_iam_role.test.name
}

resource "aws_iam_role" "test" {
  name                 = "aimrole-tf-test"
  path                 = "/"
  assume_role_policy   = <<EOF
{
  "Statement": [
    {
      "Principal": {
        "Service": [
          "ec2.amazonaws.com"
        ]
      },
      "Action": [
        "sts:AssumeRole"
      ],
      "Effect": "Allow"
    }
  ],
  "Version": "2012-10-17"
}
EOF

}

resource "aws_instance" "test" {
  count = 75

  ami           = "ami-06ce3edf0cff21f07"
  instance_type = "t3a.nano"
  key_name      = "test_lander"

  iam_instance_profile = aws_iam_instance_profile.test.id

  tags = {
    Name = "terraform-test-0-${count.index}"
  }
}

Expected Behavior

Instances are all created and are listed in the state.

Actual Behavior

Terraform encounters IAM instance profile errors and gives up after retries.

The instance is created on AWS with the correct IAM, but it is not tracked in the state.

Retrying the apply will result in a duplicate instance. This also breaks destroy if the instance has security groups because the untracked instance will still be attached.

Error: Error launching source instance: InvalidParameterValue: Value (tf-test-profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name
	status code: 400, request id: 589e361f-5b4d-46b7-8b62-c83aa964a8c3

  on main.tf line 36, in resource "aws_instance" "test":
  36: resource "aws_instance" "test" {

Steps to Reproduce

terraform apply -parallelism 100

The reason we are running into this a lot is that we are launching ~100 aws_instance at the same time. Because we run Terraform with -parallelism 100 the chance of this happening becomes higher.

It should be noted that this is pretty hard to reproduce since for most of the time the IAM roles propagate everywhere within AWS within the 30 seconds retry period.

Proposed fix

I have seen this has been worked around with a retry on the Invalid IAM Instance Profile error in many places. However this retry is not consistent across the code:

  • CreateAutoScalingGroup: 1 minute
  • AddRoleToInstanceProfile: 30 seconds
  • RunInstances: 30 seconds <- what we are having issues with
  • ReplaceIamInstanceProfileAssociation: 1 minute
  • RequestSpotInstances: 1 minute
  • CreateLaunchConfiguration: 90 seconds
  • AssociateIamInstanceProfile 1 minute

I suggest changing this delay to 1 minute for consistency, or 2 minutes, which is how long CloudFormation waits for this: https://forums.aws.amazon.com/thread.jspa?messageID=593651

Either way, the 30 second retry is not enough. I would like to make a PR to make this consistent across the provider code but I am not sure what retry period you think is best. What are your thoughts on this?

edit: I have changed the retry to 1 minute and still ran into the issue. Then I ran the example configuration with 2 minute retry and I haven't seen it come back yet, even with 100 instances.

It could also be that this retry timeout is unrelated to the issue and that something else went wrong that causes the instances go untracked on high parallelism, perhaps the API rate limit which we do seem to hit a lot according to the logs.

References

A few of the past issues that have been closed:

@ghost ghost added service/ec2 Issues and PRs that pertain to the ec2 service. service/iam Issues and PRs that pertain to the iam service. labels May 7, 2020
@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label May 7, 2020
@svyotov
Copy link

svyotov commented May 7, 2020

I have observed this too. For me the real issue is that the instance is created on AWS, but it is not tracked in the state. If terraform timeouts now and then (probally due to API rate limiting), a re-run will fix it. But having EC2 instances being created and not being tracked - increased the cost and messes up destroys.

@bflad bflad added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Jul 22, 2020
@bflad bflad self-assigned this Feb 25, 2021
bflad added a commit that referenced this issue Feb 25, 2021
…enabling go-mnd linter

Reference: #13199
Reference: #16752
Reference: #16753

IAM eventual consistency handling has long been the source of needing retries in resource logic. Due to the lack of a consistent implementation (e.g. static constant) for how long to retry for these types of errors, there have been varying retry durations. The `iamwaiter.PropagationTimeout` constant was introduced for this purpose.

This change begins by introducing the `go-mnd` linter to enforce the usage of constants in function arguments. Example reports below. The rest of the changes are the minimum required to ensure `iamwaiter.PropagationTimeout` with its 2 minute duration is applied. You will note that this is fixing the duration in some cases to slightly increase it to the standard value. Any higher durations are ignored to reduce changes for now. As such, this can be reviewed by validating that a lower duration was not introduced and skipping acceptance testing since no logic changes should be introduced.

One caveat to `go-mnd` is that it currently ignores `1` as a magic number, which is possible in usage such as `1*time.Minute`, and that ignored number cannot be overriden. An upstream issue will be created to ask the `ignore-number` configuration to overwrite instead of append.

Example previous report:

```
aws/resource_aws_api_gateway_account.go:99:23: mnd: Magic number: 2, in <argument> detected (gomnd)
	err = resource.Retry(2*time.Minute, func() *resource.RetryError {
	                     ^
```
bflad added a commit that referenced this issue Mar 22, 2021
…enabling go-mnd linter

Reference: #13199
Reference: #16752
Reference: #16753

IAM eventual consistency handling has long been the source of needing retries in resource logic. Due to the lack of a consistent implementation (e.g. static constant) for how long to retry for these types of errors, there have been varying retry durations. The `iamwaiter.PropagationTimeout` constant was introduced for this purpose.

This change begins by introducing the `go-mnd` linter to enforce the usage of constants in function arguments. Example reports below. The rest of the changes are the minimum required to ensure `iamwaiter.PropagationTimeout` with its 2 minute duration is applied. You will note that this is fixing the duration in some cases to slightly increase it to the standard value. Any higher durations are ignored to reduce changes for now. As such, this can be reviewed by validating that a lower duration was not introduced and skipping acceptance testing since no logic changes should be introduced.

One caveat to `go-mnd` is that it currently ignores `1` as a magic number, which is possible in usage such as `1*time.Minute`, and that ignored number cannot be overriden. An upstream issue will be created to ask the `ignore-number` configuration to overwrite instead of append.

Example previous report:

```
aws/resource_aws_api_gateway_account.go:99:23: mnd: Magic number: 2, in <argument> detected (gomnd)
	err = resource.Retry(2*time.Minute, func() *resource.RetryError {
	                     ^
```
bflad added a commit that referenced this issue Mar 26, 2021
…enabling go-mnd linter (#17811)

Reference: #13199
Reference: #16752
Reference: #16753

IAM eventual consistency handling has long been the source of needing retries in resource logic. Due to the lack of a consistent implementation (e.g. static constant) for how long to retry for these types of errors, there have been varying retry durations. The `iamwaiter.PropagationTimeout` constant was introduced for this purpose.

This change begins by introducing the `go-mnd` linter to enforce the usage of constants in function arguments. Example reports below. The rest of the changes are the minimum required to ensure `iamwaiter.PropagationTimeout` with its 2 minute duration is applied. You will note that this is fixing the duration in some cases to slightly increase it to the standard value. Any higher durations are ignored to reduce changes for now. As such, this can be reviewed by validating that a lower duration was not introduced and skipping acceptance testing since no logic changes should be introduced.

One caveat to `go-mnd` is that it currently ignores `1` as a magic number, which is possible in usage such as `1*time.Minute`, and that ignored number cannot be overriden. An upstream issue will be created to ask the `ignore-number` configuration to overwrite instead of append.

Example previous report:

```
aws/resource_aws_api_gateway_account.go:99:23: mnd: Magic number: 2, in <argument> detected (gomnd)
	err = resource.Retry(2*time.Minute, func() *resource.RetryError {
	                     ^
```
@github-actions github-actions bot added this to the v3.35.0 milestone Mar 26, 2021
@ghost
Copy link

ghost commented Apr 1, 2021

This has been released in version 3.35.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

@ghost
Copy link

ghost commented Apr 25, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked as resolved and limited conversation to collaborators Apr 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. service/iam Issues and PRs that pertain to the iam service.
Projects
None yet
3 participants