Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error waiting for EC2 Internet Gateway (igw-xxx) to detach from VPC (vpc-xxx): unexpected state 'available', wanted target ''. last error: %!s(<nil>) #21792

Closed
mgusiew-guide opened this issue Nov 16, 2021 · 5 comments · Fixed by #21794
Labels
bug Addresses a defect in current functionality. regression Pertains to a degraded workflow resulting from an upstream patch or internal enhancement. service/ec2 Issues and PRs that pertain to the ec2 service.
Milestone

Comments

@mgusiew-guide
Copy link
Contributor

mgusiew-guide commented Nov 16, 2021

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

terraform version - 0.13.7
aws provider version - 3.65.0

Affected Resource(s)

  • aws_internet_gateway

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

Unfortunately I can only include the relevant fragment of the bigger config.

module vpc {
  source  = "terraform-aws-modules/vpc/aws"
  version = "2.78.0"

  name = var.name

  cidr = var.cidr

  azs             = var.azs
  private_subnets = var.private_subnets
  public_subnets  = var.public_subnets

  enable_dns_hostnames = true
  enable_dns_support   = true

  enable_nat_gateway = var.enable_nat_gateway
  single_nat_gateway = true

  enable_ssm_endpoint              = local.enable_aws_endpoints
  ssm_endpoint_private_dns_enabled = true
  ssm_endpoint_security_group_ids  = aws_security_group.ssm_agent.*.id

  enable_ssmmessages_endpoint              = local.enable_aws_endpoints
  ssmmessages_endpoint_private_dns_enabled = true
  ssmmessages_endpoint_security_group_ids  = aws_security_group.ssm_agent.*.id

  enable_ec2messages_endpoint              = local.enable_aws_endpoints
  ec2messages_endpoint_private_dns_enabled = true
  ec2messages_endpoint_security_group_ids  = aws_security_group.ssm_agent.*.id

  enable_s3_endpoint = local.enable_aws_endpoints

  # rules cleared to deny all
  manage_default_security_group  = true
  default_security_group_ingress = []
  default_security_group_egress  = []

  enable_flow_log                      = var.enable_flow_log
  create_flow_log_cloudwatch_log_group = true
  create_flow_log_cloudwatch_iam_role  = true
  flow_log_max_aggregation_interval    = var.flow_log_max_aggregation_interval_seconds

  tags = var.tags

  vpc_flow_log_tags = {
    "Name" : "${var.name}-flow-log"
  }
}

Debug Output

--- FAIL: TestProduceConsume/TestCases/single-vpc (1599.41s)
            destroy.go:11: 
                	Error Trace:	destroy.go:11
                	            				produce_consume_test.go:79
                	            				produce_consume_test.go:26
                	Error:      	Received unexpected error:
                	            	FatalError{Underlying: error while running command: exit status 1; �[31m
                	            	�[1m�[31mError: �[0m�[0m�[1merror waiting for EC2 Internet Gateway (igw-xxx) to detach from VPC (vpc-xxx): unexpected state 'available', wanted target ''. last error: %!s(<nil>)�[0m
                	            	
                	            	�[0m�[0m�[0m}
                	Test:       	TestProduceConsume/TestCases/single-vpc

Panic Output

NA

Expected Behavior

IGW should be succesfully deleted

Actual Behavior

After upgrading to aws provider version 3.65.0 from 3.62.0 we started to notice occasional errors in IGW removal during Terraform destroy, this causes stops destroy process and results in some orphan resources, e.g. VPC. This now occurs once/twice per our test suite which consists of approximately 100 test cases.

I briefly analysed the source code and build logs and noticed there were some changes in the IGW resource code. There are 2 things that I found interesting:

  1. looks like terraform stops retrying after approximately 40s, based on the provider code I would expect provider to wait few minutes. This is the last log I found:
Still destroying... [id=igw-xxx, 40s elapsed]
  1. After the test I ran CLI to check the IGW status: aws ec2 describe-internet-gateways --internet-gateway-ids "igw-xxx" and received following output:
{
    "InternetGateways": [
        {
            "Attachments": [],
            "InternetGatewayId": "igw-xxx",
            "OwnerId": "xxx",
            "Tags": [
                {
                    "Key": "Name",
                    "Value": "xxx"
                }
            ]
        }
    ]
}

In the resource code I see:

if len(internetGateway.Attachments) == 0 || internetGateway.Attachments[0] == nil {
		return nil, tfresource.NewEmptyResultError(internetGatewayID)
	}

but there is no handling for this error:

return func() (interface{}, string, error) {
		output, err := FindInternetGatewayAttachment(conn, internetGatewayID, vpcID)

		if tfresource.NotFound(err) {
			return nil, "", nil
		}

		if err != nil {
			return nil, "", err
		}

		return output, aws.StringValue(output.State), nil
	}

As I understand the expectation is that the IGW is either not found or has an attachment with empty State:

func WaitInternetGatewayDetached(conn *ec2.EC2, internetGatewayID, vpcID string, timeout time.Duration) (*ec2.InternetGatewayAttachment, error) {
	stateConf := &resource.StateChangeConf{
		Pending: []string{ec2.AttachmentStatusDetaching},
		Target:  []string{},
		Timeout: timeout,
		Refresh: StatusInternetGatewayAttachmentState(conn, internetGatewayID, vpcID),
	}

	outputRaw, err := stateConf.WaitForState()

	if output, ok := outputRaw.(*ec2.InternetGatewayAttachment); ok {
		return output, err
	}

	return nil, err
}

Note that I ran CLI and prvider uses GO client so the outputs may differ

Steps to Reproduce

Unfortunately the issue is non-deterministic and happens only from time to time, however after the upgrade I get it once/twice per test suite, before that i used to get sporadically (I saw that only few times over past few months)

Important Factoids

I would expect longer timeout plus seems that AWS may return different status messages for multiple runs of the same scenario

References

  • #0000
@github-actions github-actions bot added needs-triage Waiting for first response or review from a maintainer. service/ec2 Issues and PRs that pertain to the ec2 service. labels Nov 16, 2021
@ewbankkit ewbankkit added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Nov 16, 2021
@ewbankkit
Copy link
Contributor

@mgusiew-guide Thanks for raising this issue 👏.
When the logic for waiting for gateway detachment was changed the available state was not considered as a Pending value (hence the reported error). This must only happen under certain circumstances which weren't present during our testing.
I have submitted a fix to address this.

The

if len(internetGateway.Attachments) == 0 || internetGateway.Attachments[0] == nil {
	return nil, tfresource.NewEmptyResultError(internetGatewayID)
}

logic is expected as this returned error is a NotFoundError and so

if tfresource.NotFound(err) {
	return nil, "", nil
}

is true and Target: []string{}, is reached.

@ewbankkit ewbankkit added the regression Pertains to a degraded workflow resulting from an upstream patch or internal enhancement. label Nov 16, 2021
@mgusiew-guide
Copy link
Contributor Author

Thanks for taking a look @ewbankkit ! I will retest once the fix is released.

@github-actions github-actions bot added this to the v3.66.0 milestone Nov 16, 2021
@github-actions
Copy link

This functionality has been released in v3.66.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@mgusiew-guide
Copy link
Contributor Author

FTR I tested it in 3.66.0 and confirm that the problem went away

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. regression Pertains to a degraded workflow resulting from an upstream patch or internal enhancement. service/ec2 Issues and PRs that pertain to the ec2 service.
Projects
None yet
2 participants