Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventual consistency issue with aws_iam_service_linked_role #7646

Closed
ineffyble opened this issue Feb 22, 2019 · 10 comments · Fixed by #12863
Closed

Eventual consistency issue with aws_iam_service_linked_role #7646

ineffyble opened this issue Feb 22, 2019 · 10 comments · Fixed by #12863
Labels
bug Addresses a defect in current functionality. service/autoscaling Issues and PRs that pertain to the autoscaling service. service/iam Issues and PRs that pertain to the iam service. service/kms Issues and PRs that pertain to the kms service.
Milestone

Comments

@ineffyble
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

Terraform 0.11.11
AWS provider 1.59.0

Affected Resource(s)

  • aws_iam_service_linked_role
  • aws_kms_key
  • aws_autoscaling_group

Expected Behavior

Terraform should create the defined Service Linked Role, and then create the AWS KMS key which has a policy referencing that role, and the AutoScaling Group that uses that role.

Actual Behavior

Terraform creates the role, but will often produce an error at the latter steps.

For example:

aws_kms_key.ami_key: MalformedPolicyDocumentException: Policy contains a statement with one or more invalid principles.

or an equivalent error while trying to create the AutoScaling Group, stating the role does not.

Running Terraform again works correctly, as the role has already been created.

This appears to be an issue with eventual consistency, similar to ones with IAM Roles which have already been solved. Suspect the solution is adding retries to the kms_key and autoscaling_group if they receive the error.

@bflad bflad added service/iam Issues and PRs that pertain to the iam service. service/autoscaling Issues and PRs that pertain to the autoscaling service. service/kms Issues and PRs that pertain to the kms service. labels Feb 22, 2019
@bflad
Copy link
Contributor

bflad commented Feb 22, 2019

Can you please provide example configuration(s) so we can write covering acceptance tests? Thanks.

@bflad bflad added the waiting-response Maintainers are waiting on response from community or contributor. label Feb 22, 2019
@ineffyble
Copy link
Author

Attached is a file that should replicate the problem. Note that it's not occurring 100% of the time, but most of the time, running terraform apply for the first time produces one or both of the following errors:

* aws_autoscaling_group.asg: 1 error(s) occurred:

* aws_autoscaling_group.asg: Error creating AutoScaling Group: ValidationError: ARN specified for Service-Linked Role does not exist.
	status code: 400, request id: a130e5ba-3723-11e9-89d5-d53908f3cb41
* aws_kms_key.kms: 1 error(s) occurred:

* aws_kms_key.kms: MalformedPolicyDocumentException: Policy contains a statement with one or more invalid principals.
	status code: 400, request id: edb01b4f-1fc0-43fe-9b4d-d57c2fbdab48

tf.txt

@ghost ghost removed the waiting-response Maintainers are waiting on response from community or contributor. label Feb 23, 2019
@bflad
Copy link
Contributor

bflad commented Feb 23, 2019

Hi @ineffyble 👋 Thank you so much for the minimal reproduction configuration. That should work great for reproducing this against both resources and ensuring the fixes cover them.

The maintainers will be heads down with version 2.0.0 development and testing work for the next week or so, but hopefully we can address this shortly thereafter (unless someone from the community picks this up 😄).

@bflad bflad added the bug Addresses a defect in current functionality. label Feb 23, 2019
@awilkins
Copy link

awilkins commented Mar 9, 2020

Came here looking for the migrated copy of this issue

This ticket seems closest.

This is a royal PITA.

Potential workarounds ;

  • Manually insert a null_resource that triggers on changes to the role unique_id attribute and executes a provisioner that does sleep 10 or similar. Depend on this resource in resources that need the role ARN to exist.
    • Obviously this is platform dependent and sucky
  • ??

bflad added a commit that referenced this issue Apr 16, 2020
…creation to that timeout and add KMS Key deletion to internal waiter package

Reference: #7646
Reference: #12840

The Terraform AWS Provider codebase contains many varied timeouts for handling IAM propagation retries. Here we introduce a shared constant for this amount of time. The choice of 2 minutes is based on that amount of time being:

- Most widely used across resources
- Based on lack of historical bug reports across those resources that implement that amount of time, most successful
- Ensuring a reasonable user experience (not waiting too long) should there be an actual misconfiguration

As an initial implementation of this IAM propagation timeout and further showing the potential waiter package refactoring, this fixes shorter IAM timeout implementations in the `aws_kms_key` and `aws_kms_external_key` resources, while also refactoring the pending deletion logic. This second change is designed as an inflection point for how we want to handle imports across multiple waiter packages, with the preference of this initial implementation to name the Go import of the outside service, `iamwaiter`, or generically SERVICEwaiter. If agreed, this will be added to the proposal and the refactoring documentation.

NOTE: There is other `StateChangeConf` / `StateRefreshFunc` logic in these KMS resources, but this change is solely focused on highlighting the multiple import situation, and those will be handled later.

Output from acceptance testing:

```
--- PASS: TestAccAWSKmsExternalKey_basic (19.53s)
--- PASS: TestAccAWSKmsExternalKey_DeletionWindowInDays (31.61s)
--- PASS: TestAccAWSKmsExternalKey_Description (32.11s)
--- PASS: TestAccAWSKmsExternalKey_disappears (13.84s)
--- PASS: TestAccAWSKmsExternalKey_Enabled (312.55s)
--- PASS: TestAccAWSKmsExternalKey_KeyMaterialBase64 (104.29s)
--- PASS: TestAccAWSKmsExternalKey_Policy (33.78s)
--- PASS: TestAccAWSKmsExternalKey_Tags (43.70s)
--- PASS: TestAccAWSKmsExternalKey_ValidTo (165.77s)

--- PASS: TestAccAWSKmsKey_asymmetricKey (18.20s)
--- PASS: TestAccAWSKmsKey_basic (21.13s)
--- PASS: TestAccAWSKmsKey_disappears (13.92s)
--- PASS: TestAccAWSKmsKey_isEnabled (236.91s)
--- PASS: TestAccAWSKmsKey_policy (35.34s)
--- PASS: TestAccAWSKmsKey_Policy_IamRole (34.14s)
--- PASS: TestAccAWSKmsKey_Policy_IamServiceLinkedRole (44.80s)
--- PASS: TestAccAWSKmsKey_tags (34.65s)
```
@bflad bflad added this to the v2.60.0 milestone Apr 29, 2020
bflad added a commit that referenced this issue Apr 29, 2020
…creation to that timeout and add KMS Key deletion to internal waiter package (#12863)

Reference: #7646
Reference: #12840

The Terraform AWS Provider codebase contains many varied timeouts for handling IAM propagation retries. Here we introduce a shared constant for this amount of time. The choice of 2 minutes is based on that amount of time being:

- Most widely used across resources
- Based on lack of historical bug reports across those resources that implement that amount of time, most successful
- Ensuring a reasonable user experience (not waiting too long) should there be an actual misconfiguration

As an initial implementation of this IAM propagation timeout and further showing the potential waiter package refactoring, this fixes shorter IAM timeout implementations in the `aws_kms_key` and `aws_kms_external_key` resources, while also refactoring the pending deletion logic. This second change is designed as an inflection point for how we want to handle imports across multiple waiter packages, with the preference of this initial implementation to name the Go import of the outside service, `iamwaiter`, or generically SERVICEwaiter. If agreed, this will be added to the proposal and the refactoring documentation.

NOTE: There is other `StateChangeConf` / `StateRefreshFunc` logic in these KMS resources, but this change is solely focused on highlighting the multiple import situation, and those will be handled later.

Output from acceptance testing:

```
--- PASS: TestAccAWSKmsExternalKey_basic (19.53s)
--- PASS: TestAccAWSKmsExternalKey_DeletionWindowInDays (31.61s)
--- PASS: TestAccAWSKmsExternalKey_Description (32.11s)
--- PASS: TestAccAWSKmsExternalKey_disappears (13.84s)
--- PASS: TestAccAWSKmsExternalKey_Enabled (312.55s)
--- PASS: TestAccAWSKmsExternalKey_KeyMaterialBase64 (104.29s)
--- PASS: TestAccAWSKmsExternalKey_Policy (33.78s)
--- PASS: TestAccAWSKmsExternalKey_Tags (43.70s)
--- PASS: TestAccAWSKmsExternalKey_ValidTo (165.77s)

--- PASS: TestAccAWSKmsKey_asymmetricKey (18.20s)
--- PASS: TestAccAWSKmsKey_basic (21.13s)
--- PASS: TestAccAWSKmsKey_disappears (13.92s)
--- PASS: TestAccAWSKmsKey_isEnabled (236.91s)
--- PASS: TestAccAWSKmsKey_policy (35.34s)
--- PASS: TestAccAWSKmsKey_Policy_IamRole (34.14s)
--- PASS: TestAccAWSKmsKey_Policy_IamServiceLinkedRole (44.80s)
--- PASS: TestAccAWSKmsKey_tags (34.65s)
```
@bflad
Copy link
Contributor

bflad commented Apr 29, 2020

The fix for this resource to wait up to 2 minutes for IAM change propagation (fairly standard across the provider) has been merged and will release with version 2.60.0 of the Terraform AWS Provider, later this week. 👍

@ghost
Copy link

ghost commented May 1, 2020

This has been released in version 2.60.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

@dinvlad
Copy link

dinvlad commented May 5, 2020

Hi folks, sorry for a tangential question, but we're having this issue for regular IAM roles. @ineffyble mentioned this was already solved for those. Could you provide details how? Thanks!

@awilkins
Copy link

awilkins commented May 6, 2020

Looks like version 2.60.0 of the provider now unifies the timeout period for IAM waits. The implication is that dependent resources each have code to wait on this ... what might be quite elegant is if the waiting is in the resource that is the source of the attribute itself and dependent resources can thus just wait for it to deliver ; that would work across the framework for all resources that have a wait involved.

@dinvlad
Copy link

dinvlad commented May 6, 2020

@awilkins thanks - I see in the release notes it mentions a few dependent resources that support waiting. This is nice, however in our case I tried to use depends_on from unrelated resources (i.e. any non-IAM resources whose management depends on the role being created first). I understand this is probably an edge case, but would be nice to get some resolution to it.

It sounds like a great idea to move the wait into the aws_iam_role_policy etc. resources themselves like you mentioned (at least as an option, like wait_for_propagation = true). This way, any resources that depend on them (either by reference or via depend_on), can trust that the policy has been propagated.

adamdecaf pushed a commit to adamdecaf/terraform-provider-aws that referenced this issue May 28, 2020
…creation to that timeout and add KMS Key deletion to internal waiter package (hashicorp#12863)

Reference: hashicorp#7646
Reference: hashicorp#12840

The Terraform AWS Provider codebase contains many varied timeouts for handling IAM propagation retries. Here we introduce a shared constant for this amount of time. The choice of 2 minutes is based on that amount of time being:

- Most widely used across resources
- Based on lack of historical bug reports across those resources that implement that amount of time, most successful
- Ensuring a reasonable user experience (not waiting too long) should there be an actual misconfiguration

As an initial implementation of this IAM propagation timeout and further showing the potential waiter package refactoring, this fixes shorter IAM timeout implementations in the `aws_kms_key` and `aws_kms_external_key` resources, while also refactoring the pending deletion logic. This second change is designed as an inflection point for how we want to handle imports across multiple waiter packages, with the preference of this initial implementation to name the Go import of the outside service, `iamwaiter`, or generically SERVICEwaiter. If agreed, this will be added to the proposal and the refactoring documentation.

NOTE: There is other `StateChangeConf` / `StateRefreshFunc` logic in these KMS resources, but this change is solely focused on highlighting the multiple import situation, and those will be handled later.

Output from acceptance testing:

```
--- PASS: TestAccAWSKmsExternalKey_basic (19.53s)
--- PASS: TestAccAWSKmsExternalKey_DeletionWindowInDays (31.61s)
--- PASS: TestAccAWSKmsExternalKey_Description (32.11s)
--- PASS: TestAccAWSKmsExternalKey_disappears (13.84s)
--- PASS: TestAccAWSKmsExternalKey_Enabled (312.55s)
--- PASS: TestAccAWSKmsExternalKey_KeyMaterialBase64 (104.29s)
--- PASS: TestAccAWSKmsExternalKey_Policy (33.78s)
--- PASS: TestAccAWSKmsExternalKey_Tags (43.70s)
--- PASS: TestAccAWSKmsExternalKey_ValidTo (165.77s)

--- PASS: TestAccAWSKmsKey_asymmetricKey (18.20s)
--- PASS: TestAccAWSKmsKey_basic (21.13s)
--- PASS: TestAccAWSKmsKey_disappears (13.92s)
--- PASS: TestAccAWSKmsKey_isEnabled (236.91s)
--- PASS: TestAccAWSKmsKey_policy (35.34s)
--- PASS: TestAccAWSKmsKey_Policy_IamRole (34.14s)
--- PASS: TestAccAWSKmsKey_Policy_IamServiceLinkedRole (44.80s)
--- PASS: TestAccAWSKmsKey_tags (34.65s)
```
@ghost
Copy link

ghost commented May 30, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators May 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/autoscaling Issues and PRs that pertain to the autoscaling service. service/iam Issues and PRs that pertain to the iam service. service/kms Issues and PRs that pertain to the kms service.
Projects
None yet
4 participants