-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry timeout for IAM instance profile eventual consistency not high enough #13199
Comments
I have observed this too. For me the real issue is that the instance is created on AWS, but it is not tracked in the state. If terraform timeouts now and then (probally due to API rate limiting), a re-run will fix it. But having EC2 instances being created and not being tracked - increased the cost and messes up destroys. |
…enabling go-mnd linter Reference: #13199 Reference: #16752 Reference: #16753 IAM eventual consistency handling has long been the source of needing retries in resource logic. Due to the lack of a consistent implementation (e.g. static constant) for how long to retry for these types of errors, there have been varying retry durations. The `iamwaiter.PropagationTimeout` constant was introduced for this purpose. This change begins by introducing the `go-mnd` linter to enforce the usage of constants in function arguments. Example reports below. The rest of the changes are the minimum required to ensure `iamwaiter.PropagationTimeout` with its 2 minute duration is applied. You will note that this is fixing the duration in some cases to slightly increase it to the standard value. Any higher durations are ignored to reduce changes for now. As such, this can be reviewed by validating that a lower duration was not introduced and skipping acceptance testing since no logic changes should be introduced. One caveat to `go-mnd` is that it currently ignores `1` as a magic number, which is possible in usage such as `1*time.Minute`, and that ignored number cannot be overriden. An upstream issue will be created to ask the `ignore-number` configuration to overwrite instead of append. Example previous report: ``` aws/resource_aws_api_gateway_account.go:99:23: mnd: Magic number: 2, in <argument> detected (gomnd) err = resource.Retry(2*time.Minute, func() *resource.RetryError { ^ ```
…enabling go-mnd linter Reference: #13199 Reference: #16752 Reference: #16753 IAM eventual consistency handling has long been the source of needing retries in resource logic. Due to the lack of a consistent implementation (e.g. static constant) for how long to retry for these types of errors, there have been varying retry durations. The `iamwaiter.PropagationTimeout` constant was introduced for this purpose. This change begins by introducing the `go-mnd` linter to enforce the usage of constants in function arguments. Example reports below. The rest of the changes are the minimum required to ensure `iamwaiter.PropagationTimeout` with its 2 minute duration is applied. You will note that this is fixing the duration in some cases to slightly increase it to the standard value. Any higher durations are ignored to reduce changes for now. As such, this can be reviewed by validating that a lower duration was not introduced and skipping acceptance testing since no logic changes should be introduced. One caveat to `go-mnd` is that it currently ignores `1` as a magic number, which is possible in usage such as `1*time.Minute`, and that ignored number cannot be overriden. An upstream issue will be created to ask the `ignore-number` configuration to overwrite instead of append. Example previous report: ``` aws/resource_aws_api_gateway_account.go:99:23: mnd: Magic number: 2, in <argument> detected (gomnd) err = resource.Retry(2*time.Minute, func() *resource.RetryError { ^ ```
…enabling go-mnd linter (#17811) Reference: #13199 Reference: #16752 Reference: #16753 IAM eventual consistency handling has long been the source of needing retries in resource logic. Due to the lack of a consistent implementation (e.g. static constant) for how long to retry for these types of errors, there have been varying retry durations. The `iamwaiter.PropagationTimeout` constant was introduced for this purpose. This change begins by introducing the `go-mnd` linter to enforce the usage of constants in function arguments. Example reports below. The rest of the changes are the minimum required to ensure `iamwaiter.PropagationTimeout` with its 2 minute duration is applied. You will note that this is fixing the duration in some cases to slightly increase it to the standard value. Any higher durations are ignored to reduce changes for now. As such, this can be reviewed by validating that a lower duration was not introduced and skipping acceptance testing since no logic changes should be introduced. One caveat to `go-mnd` is that it currently ignores `1` as a magic number, which is possible in usage such as `1*time.Minute`, and that ignored number cannot be overriden. An upstream issue will be created to ask the `ignore-number` configuration to overwrite instead of append. Example previous report: ``` aws/resource_aws_api_gateway_account.go:99:23: mnd: Magic number: 2, in <argument> detected (gomnd) err = resource.Retry(2*time.Minute, func() *resource.RetryError { ^ ```
This has been released in version 3.35.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks! |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks! |
Community Note
Terraform Version
Affected Resource(s)
aws_instance
aws_iam_instance_profile
Terraform Configuration Files
Expected Behavior
Instances are all created and are listed in the state.
Actual Behavior
Terraform encounters IAM instance profile errors and gives up after retries.
The instance is created on AWS with the correct IAM, but it is not tracked in the state.
Retrying the apply will result in a duplicate instance. This also breaks destroy if the instance has security groups because the untracked instance will still be attached.
Steps to Reproduce
The reason we are running into this a lot is that we are launching ~100
aws_instance
at the same time. Because we run Terraform with-parallelism 100
the chance of this happening becomes higher.It should be noted that this is pretty hard to reproduce since for most of the time the IAM roles propagate everywhere within AWS within the 30 seconds retry period.
Proposed fix
I have seen this has been worked around with a retry on the
Invalid IAM Instance Profile
error in many places. However this retry is not consistent across the code:CreateAutoScalingGroup
: 1 minuteAddRoleToInstanceProfile
: 30 secondsRunInstances
: 30 seconds <- what we are having issues withReplaceIamInstanceProfileAssociation
: 1 minuteRequestSpotInstances
: 1 minuteCreateLaunchConfiguration
: 90 secondsAssociateIamInstanceProfile
1 minuteI suggest changing this delay to 1 minute for consistency, or 2 minutes, which is how long CloudFormation waits for this: https://forums.aws.amazon.com/thread.jspa?messageID=593651
Either way, the 30 second retry is not enough. I would like to make a PR to make this consistent across the provider code but I am not sure what retry period you think is best. What are your thoughts on this?
edit: I have changed the retry to 1 minute and still ran into the issue. Then I ran the example configuration with 2 minute retry and I haven't seen it come back yet, even with 100 instances.
It could also be that this retry timeout is unrelated to the issue and that something else went wrong that causes the instances go untracked on high parallelism, perhaps the API rate limit which we do seem to hit a lot according to the logs.
References
A few of the past issues that have been closed:
AssociateIamInstanceProfile
)RequestSpotInstances
)The text was updated successfully, but these errors were encountered: