Retry timeout for IAM instance profile eventual consistency not high enough #13199

lvisterin · 2020-05-07T09:44:08Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

Terraform v0.12.24
+ provider.aws v2.60.0

Affected Resource(s)

aws_instance
aws_iam_instance_profile

Terraform Configuration Files

provider "aws" {
  version = "~> 2.0"
  region  = "eu-west-1"
}

resource "aws_iam_instance_profile" "test" {
  name = "tf-test-profile"
  path = "/"
  role = aws_iam_role.test.name
}

resource "aws_iam_role" "test" {
  name                 = "aimrole-tf-test"
  path                 = "/"
  assume_role_policy   = <<EOF
{
  "Statement": [
    {
      "Principal": {
        "Service": [
          "ec2.amazonaws.com"
        ]
      },
      "Action": [
        "sts:AssumeRole"
      ],
      "Effect": "Allow"
    }
  ],
  "Version": "2012-10-17"
}
EOF

}

resource "aws_instance" "test" {
  count = 75

  ami           = "ami-06ce3edf0cff21f07"
  instance_type = "t3a.nano"
  key_name      = "test_lander"

  iam_instance_profile = aws_iam_instance_profile.test.id

  tags = {
    Name = "terraform-test-0-${count.index}"
  }
}

Expected Behavior

Instances are all created and are listed in the state.

Actual Behavior

Terraform encounters IAM instance profile errors and gives up after retries.

The instance is created on AWS with the correct IAM, but it is not tracked in the state.

Retrying the apply will result in a duplicate instance. This also breaks destroy if the instance has security groups because the untracked instance will still be attached.

Error: Error launching source instance: InvalidParameterValue: Value (tf-test-profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name
	status code: 400, request id: 589e361f-5b4d-46b7-8b62-c83aa964a8c3

  on main.tf line 36, in resource "aws_instance" "test":
  36: resource "aws_instance" "test" {

Steps to Reproduce

terraform apply -parallelism 100

The reason we are running into this a lot is that we are launching ~100 aws_instance at the same time. Because we run Terraform with -parallelism 100 the chance of this happening becomes higher.

It should be noted that this is pretty hard to reproduce since for most of the time the IAM roles propagate everywhere within AWS within the 30 seconds retry period.

Proposed fix

I have seen this has been worked around with a retry on the Invalid IAM Instance Profile error in many places. However this retry is not consistent across the code:

CreateAutoScalingGroup: 1 minute
AddRoleToInstanceProfile: 30 seconds
RunInstances: 30 seconds <- what we are having issues with
ReplaceIamInstanceProfileAssociation: 1 minute
RequestSpotInstances: 1 minute
CreateLaunchConfiguration: 90 seconds
AssociateIamInstanceProfile 1 minute

I suggest changing this delay to 1 minute for consistency, or 2 minutes, which is how long CloudFormation waits for this: https://forums.aws.amazon.com/thread.jspa?messageID=593651

Either way, the 30 second retry is not enough. I would like to make a PR to make this consistent across the provider code but I am not sure what retry period you think is best. What are your thoughts on this?

edit: I have changed the retry to 1 minute and still ran into the issue. Then I ran the example configuration with 2 minute retry and I haven't seen it come back yet, even with 100 instances.

It could also be that this retry timeout is unrelated to the issue and that something else went wrong that causes the instances go untracked on high parallelism, perhaps the API rate limit which we do seem to hit a lot according to the logs.

References

A few of the past issues that have been closed:

IAM instance profile not created fast enough to modify EC2 instance #838 (AssociateIamInstanceProfile)
Invalid IAM Instance Profile name terraform#15341
IAM Role creation race stil present in spot instance requests #3554 (RequestSpotInstances)

The text was updated successfully, but these errors were encountered:

svyotov · 2020-05-07T12:45:50Z

I have observed this too. For me the real issue is that the instance is created on AWS, but it is not tracked in the state. If terraform timeouts now and then (probally due to API rate limiting), a re-run will fix it. But having EC2 instances being created and not being tracked - increased the cost and messes up destroys.

…enabling go-mnd linter Reference: #13199 Reference: #16752 Reference: #16753 IAM eventual consistency handling has long been the source of needing retries in resource logic. Due to the lack of a consistent implementation (e.g. static constant) for how long to retry for these types of errors, there have been varying retry durations. The `iamwaiter.PropagationTimeout` constant was introduced for this purpose. This change begins by introducing the `go-mnd` linter to enforce the usage of constants in function arguments. Example reports below. The rest of the changes are the minimum required to ensure `iamwaiter.PropagationTimeout` with its 2 minute duration is applied. You will note that this is fixing the duration in some cases to slightly increase it to the standard value. Any higher durations are ignored to reduce changes for now. As such, this can be reviewed by validating that a lower duration was not introduced and skipping acceptance testing since no logic changes should be introduced. One caveat to `go-mnd` is that it currently ignores `1` as a magic number, which is possible in usage such as `1*time.Minute`, and that ignored number cannot be overriden. An upstream issue will be created to ask the `ignore-number` configuration to overwrite instead of append. Example previous report: ``` aws/resource_aws_api_gateway_account.go:99:23: mnd: Magic number: 2, in <argument> detected (gomnd) err = resource.Retry(2*time.Minute, func() *resource.RetryError { ^ ```

…enabling go-mnd linter (#17811) Reference: #13199 Reference: #16752 Reference: #16753 IAM eventual consistency handling has long been the source of needing retries in resource logic. Due to the lack of a consistent implementation (e.g. static constant) for how long to retry for these types of errors, there have been varying retry durations. The `iamwaiter.PropagationTimeout` constant was introduced for this purpose. This change begins by introducing the `go-mnd` linter to enforce the usage of constants in function arguments. Example reports below. The rest of the changes are the minimum required to ensure `iamwaiter.PropagationTimeout` with its 2 minute duration is applied. You will note that this is fixing the duration in some cases to slightly increase it to the standard value. Any higher durations are ignored to reduce changes for now. As such, this can be reviewed by validating that a lower duration was not introduced and skipping acceptance testing since no logic changes should be introduced. One caveat to `go-mnd` is that it currently ignores `1` as a magic number, which is possible in usage such as `1*time.Minute`, and that ignored number cannot be overriden. An upstream issue will be created to ask the `ignore-number` configuration to overwrite instead of append. Example previous report: ``` aws/resource_aws_api_gateway_account.go:99:23: mnd: Magic number: 2, in <argument> detected (gomnd) err = resource.Retry(2*time.Minute, func() *resource.RetryError { ^ ```

ghost · 2021-04-01T22:57:05Z

This has been released in version 3.35.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

ghost · 2021-04-25T17:10:53Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

ghost added service/ec2 Issues and PRs that pertain to the ec2 service. service/iam Issues and PRs that pertain to the iam service. labels May 7, 2020

github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label May 7, 2020

bflad added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Jul 22, 2020

bflad mentioned this issue Dec 29, 2020

Ensure Cross-Service Eventual Consistency Retries for IAM Use iamwaiter.PropagationTimeout Constant #16752

Closed

bflad self-assigned this Feb 25, 2021

bflad mentioned this issue Feb 25, 2021

provider: Migrate to iamwaiter.PropagationTimeout constant and begin enabling go-mnd linter #17811

Merged

bflad closed this as completed in #17811 Mar 26, 2021

github-actions bot added this to the v3.35.0 milestone Mar 26, 2021

ghost locked as resolved and limited conversation to collaborators Apr 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry timeout for IAM instance profile eventual consistency not high enough #13199

Retry timeout for IAM instance profile eventual consistency not high enough #13199

lvisterin commented May 7, 2020 •

edited

Loading

svyotov commented May 7, 2020

ghost commented Apr 1, 2021

ghost commented Apr 25, 2021

Retry timeout for IAM instance profile eventual consistency not high enough #13199

Retry timeout for IAM instance profile eventual consistency not high enough #13199

Comments

lvisterin commented May 7, 2020 • edited Loading

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Expected Behavior

Actual Behavior

Steps to Reproduce

Proposed fix

References

svyotov commented May 7, 2020

ghost commented Apr 1, 2021

ghost commented Apr 25, 2021

lvisterin commented May 7, 2020 •

edited

Loading