Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google_service_networking_connection destroy calls appear to always fail in 5.x despite guidance #16275

Closed
gygitlab opened this issue Oct 17, 2023 · 24 comments · Fixed by GoogleCloudPlatform/magic-modules#9765, hashicorp/terraform-provider-google-beta#6830 or #16944

Comments

@gygitlab
Copy link

gygitlab commented Oct 17, 2023

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform 1.5.5

Affected Resource(s)

  • google_service_networking_connection

Terraform Configuration Files

locals {
  create_test_network = true

  test_vpc_name = local.create_test_network ? google_compute_network.test_vpc[0].name : "default"
}

resource "google_compute_network" "test_vpc" {
  count = local.create_test_network ? 1 : 0

  name                    = "test-vpc"
  auto_create_subnetworks = false
}

resource "google_compute_global_address" "test_private_service_ip_range" {
  count = local.create_test_network ? 1 : 0

  name          = "test-private-service-ip-range"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = 16
  network       = google_compute_network.test_vpc[0].id 
}

resource "google_service_networking_connection" "test_private_service_access" {
  count = local.create_test_network ? 1 : 0

  network                 = google_compute_network.test_vpc[0].id 
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.test_private_service_ip_range[0].name]
}

resource "google_sql_database_instance" "test_instance" {
  count = local.create_test_network ? 1 : 0

  name             = "test-db"
  database_version = "POSTGRES_13"

  depends_on = [google_service_networking_connection.test_private_service_access]

  settings {
    tier = "db-f1-micro"
    ip_configuration {
      ipv4_enabled                                  = false
      private_network                               = google_compute_network.test_vpc[0].id
      enable_private_path_for_google_cloud_services = true
    }
  }

  deletion_protection = false
}

Expected Behavior

Terraform should be able to destroy it's resources gracefully and complete successfully with the 5.x provider

Actual Behavior

With the stated change from the removePeering to deleteConnection method in the 5.x provider a regression arguably has occurred.

The 4.x version of the provider was able to cleanly destroy the service networking connection and in turn the VPC it was attached to (if created) and ended successfully.

The switch in 5.x now prevents this with the following error:

│ Error: Unable to remove Service Networking Connection, err: Error waiting for Delete Service Networking Connection: Error code 9, message: Failed to delete connection; Producer services (e.g. CloudSQL, Cloud Memstore, etc.) are still using this connection.

While it's stated this may be expected as the background deletions of dependent resources such as Cloud SQL is still proceeding in practice it doesn't seem to be the case. I had created a connection a week ago and was still not able to delete it today (all other resources were deleted) but as soon as I switched over to 4.x I able to remove all resources correctly. Testing further with 5.x I was able to remove the connection manually via UI when Terraform was still posting the same error also. This suggests a bug is actually present and Terraform can no longer delete the connection at all.

With the above hard error it causes Terraform to error and leave with an impartial state leading to a worsened UX also. Terraform was not designed in mind to have resources that cannot be destroyed.

Steps to Reproduce

Attempt to destroy a google_service_networking_connection resource with the 5.x provider and note that it is never successful.

Head to the UI instead and delete the connection under the VPC and notice it deletes successfully despite the Terraform error.

b/308248337

@gygitlab gygitlab added the bug label Oct 17, 2023
@gygitlab gygitlab changed the title google_service_networking_connection destroy calls appear to always in 5.x despite guidance google_service_networking_connection destroy calls appear to always fail in 5.x despite guidance Oct 17, 2023
@edwardmedia edwardmedia self-assigned this Oct 17, 2023
@edwardmedia edwardmedia added this to the Post-5.0.0 milestone Oct 17, 2023
@edwardmedia
Copy link
Contributor

edwardmedia commented Oct 17, 2023

@gygitlab can you share your debug log?

@gygitlab
Copy link
Author

@gygitlab did you follow the v5.0.0 Upgrade Guide to upgrade the provider before you run destroy?

I did but I also reproduced the behaviour clean with an environment built from 5.x

@edwardmedia
Copy link
Contributor

@gygitlab the issue looks like happening among several resources. Can you share a minimum config that I can use to repro?

@gygitlab
Copy link
Author

gygitlab commented Oct 17, 2023

@gygitlab the issue looks like happening among several resources. Can you share a minimum config that I can use to repro?

Yeah so our code is available here actually if that helps.

We simply create the services connection to be available for a Cloud SQL instance for private connections only.

@edwardmedia
Copy link
Contributor

@gygitlab can you share your debug logs for apply & destroy in 4.x and 5.x?

@gygitlab
Copy link
Author

I've not had time to get everything but I have got a test config for you:

locals {
  create_test_network = true

  test_vpc_name = local.create_test_network ? google_compute_network.test_vpc[0].name : "default"
}

resource "google_compute_network" "test_vpc" {
  count = local.create_test_network ? 1 : 0

  name                    = "test-vpc"
  auto_create_subnetworks = false
}

resource "google_compute_global_address" "test_private_service_ip_range" {
  count = local.create_test_network ? 1 : 0

  name          = "test-private-service-ip-range"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = 16
  network       = google_compute_network.test_vpc[0].id 
}

resource "google_service_networking_connection" "test_private_service_access" {
  count = local.create_test_network ? 1 : 0

  network                 = google_compute_network.test_vpc[0].id 
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.test_private_service_ip_range[0].name]
}

resource "google_sql_database_instance" "test_instance" {
  count = local.create_test_network ? 1 : 0

  name             = "test-db"
  database_version = "POSTGRES_13"

  depends_on = [google_service_networking_connection.test_private_service_access]

  settings {
    tier = "db-f1-micro"
    ip_configuration {
      ipv4_enabled                                  = false
      private_network                               = google_compute_network.test_vpc[0].id
      enable_private_path_for_google_cloud_services = true
    }
  }

  deletion_protection = false
}

The key here is the updated call in 5.x is triggering what appear to be a failsafe when cloud sql is present but it doesn't appear to ever clear on GCP's end (I kept trying it for over 4 days). I suspect it never actually clears for some reason as it fails always with the error in Terraform with 5.x but I can just go to the UI and delete the service connection peering fine.

I would maybe consider going back to the 4.x method for the provider at this time (or giving the option to invoke it alternatively) while figuring out what's going on on GCPs end as having terraform fail like this is not great.

Debug output for both versions - https://gist.github.com/gygitlab/420a4ab2f66307cfde793b879dde0484

@edwardmedia
Copy link
Contributor

edwardmedia commented Oct 18, 2023

@gygitlab thanks for the info. But I noticed that in the v4.84.0 log, there are 4 resources to be deleted, while in the v5.2.0, there are only 3 resources, and the google_sql_database_instance is not in the destroy plan. I guess this is what you observed? If you repeatedly try, you may have to rename the sql instance each time as the name will be preserved for a week . Does that impact your testing?

Below resource is not in the plan, To fix the problem, I guess you may want to delete the connection via other means now

Failed to delete connection; Producer services (e.g. CloudSQL, Cloud Memstore, etc.) are still using this connection

@gygitlab
Copy link
Author

@gygitlab thanks for the info. But I noticed that in the v4.84.0 log, there are 4 resources to be deleted, while in the v5.2.0, there are only 3 resources, and the google_sql_database_instance is not in the destroy plan. I guess this is what you observed? If you repeatedly try, you may have to rename the sql instance each time as the name will be preserved for a week . Does that impact your testing?

Below resource is not in the plan, To fix the problem, I guess you may want to delete the connection via other means now

Failed to delete connection; Producer services (e.g. CloudSQL, Cloud Memstore, etc.) are still using this connection

The output was large so I tried to cut them down, let me double check.

I guess this is what you observed? If you repeatedly try, you may have to rename the sql instance each time as the name will be preserved for a week . Does that impact your testing?

This would be a notable impact and regression for us from the behaviour in 4.x yeah

@gygitlab
Copy link
Author

Ok the gist has been updated now with full destroy output for both versions thanks.

@edwardmedia edwardmedia assigned roaks3 and unassigned edwardmedia Oct 19, 2023
@edwardmedia
Copy link
Contributor

I can repro the issue where it happens only on destroy with v5.x.

Comparing the processes between v4.x and v5.x, removePeering is no longer called in the v5.x which seems to be the failing step.

Below is the note about changes in google_service_networking_connection for v5 upgrade
https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/version_5_upgrade#terraform-destroy-now-fully-deletes-the-resource-instead-of-abandoning

@roaks3 what do you think?

@melinath melinath added enhancement feature-request and removed forward/review In review; remove label to forward labels Oct 27, 2023
@melinath melinath removed their assignment Oct 27, 2023
@alnoki
Copy link

alnoki commented Nov 1, 2023

Noting that I've experienced similar issues, and the blocker seems to be deleting "VPC peering" from the GCP console

See the first tip in this section for the procedure we've had to follow as a workaround here: https://econia.dev/off-chain/dss/terraform#take-down-infrastructure

@MarijnMB
Copy link

MarijnMB commented Nov 13, 2023

I am also experiencing this issue. Cannot destroy service networking connection - after doing so manually in the gcp console, the rest destroys just fine.

My current workaround is an extra CI job that runs before the destroy step to remove the VPC peering using gcloud cli.

@alnoki
Copy link

alnoki commented Nov 13, 2023

I am also experiencing this issue. Cannot destroy service networking connection - after doing so manually in the gcp console, the rest destroys just fine.

My current workaround is an extra CI job that runs before the destroy step to remove the VPC peering using gcloud cli.

@MarijnMB something like this?

resource "google_service_networking_connection" "sql_network_connection" {
  network                 = google_compute_network.sql_network.id
  provider                = google-beta
  reserved_peering_ranges = [google_compute_global_address.private_ip_address.name]
  service                 = "servicenetworking.googleapis.com"
  provisioner "local-exec" {
    when = destroy
    # Manually destroy VPC peering.
    # This is because the dependency solver doesn't properly destroy.
    # https://github.com/hashicorp/terraform-provider-google/issues/16275
    command = join(" ", [
      "gcloud compute networks peerings delete",
      "servicenetworking-googleapis-com",
      "--network sql-network",
      "--quiet"
    ])
  }
}

I'm trying this out and can delete via terraform destroy, but I have to run the command twice: it errors out the first time, then when I run the second time it runs successfully

@MarijnMB
Copy link

@alnoki more like (.gitlab-ci.yml):

destroy-first:
  extends: .google-base
  stage: cleanup
  when: manual
  script:
    - gcloud compute networks peerings delete servicenetworking-googleapis-com --network $CI_COMMIT_REF_SLUG-default
  rules:
    - if: $CI_COMMIT_BRANCH =~ "/^mr-meeseeks\/.*/"

terraform:destroy:
  extends: .terraform-base
  stage: cleanup
  needs:
    - destroy-first
  script:
    - gitlab-terraform destroy
  rules:
    - if: $CI_COMMIT_BRANCH =~ "/^mr-meeseeks\/.*/"

@gygitlab
Copy link
Author

gygitlab commented Nov 15, 2023

In that sense - this isn't really a bug as far as I can tell. The resource can be deleted - we do so successfully in our nightly tests. But it sounds like there's a use case for abandoning in some cases.

I'm not sure I'm following this logic sorry. In our case Terraform simply fails to destroy every time and leaves itself with a problematic partial state, which didn't happen in 4.x. Others look to be affected as well. How is this not considered a regressive issue? We're blocked on upgrading to 5.x until this is fixed.

@mike-callahan
Copy link

documented in the API

I was able to reproduce this issue. removePeering is not documented because the last revision switched the underlying implementation from compute api library to the service networking library. You can see remove peering is still available from the compute api:
https://cloud.google.com/compute/docs/reference/rest/v1/networks/removePeering

I guess for some reason it is not implemented in the service networking library. So we can either abandon terraform state or add back removePeering functionality that uses the compute api.

@poj89
Copy link

poj89 commented Nov 21, 2023

Is there an update to this? We have teams using the latest version but are unable to destroy cleanly as in version 4.x.

@Agotfrid
Copy link

Agotfrid commented Nov 22, 2023

I am also having this issue. Here is my setup

resource "google_compute_global_address" "cloudsql_staging_private_ip_range" {
  name          = "cloudsql-staging-private-ip-range"
  project       = var.project_id
  purpose       = var.address_purpose
  address_type  = var.address_type
  prefix_length = var.prefix_length
  network       = google_compute_network.network.id
}

resource "google_compute_global_address" "redis_staging_private_ip_range_1" {
  name          = "redis-staging-private-ip-range-1"
  project       = var.project_id
  purpose       = var.address_purpose
  address_type  = var.address_type
  prefix_length = 29
  network       = google_compute_network.network.id
}

resource "google_compute_global_address" "redis_staging_private_ip_range_2" {
  name          = "redis-staging-private-ip-range-2"
  project       = var.project_id
  purpose       = var.address_purpose
  address_type  = var.address_type
  prefix_length = 29
  network       = google_compute_network.network.id
}

resource "google_service_networking_connection" "vpc_connection" {
  network = google_compute_network.network.id
  service = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [
    google_compute_global_address.cloudsql_staging_private_ip_range.name,
    google_compute_global_address.redis_staging_private_ip_range_1.name,
    google_compute_global_address.redis_staging_private_ip_range_2.name,
  ]
}

within the network.
I am using terraform remote state to attach my cloudsql with private connection by using allocated_ip_range and private_network and redis with authorized_network.

The issue happens when I terraform delete my cloudsql and redis in seperate folders and seperate terraform state, the deletion of google_service_networking_connection fails with

Unable to remove Service Networking Connection, err: Error waiting for Delete Service Networking Connection: Error code 9, message: Failed to delete connection; Producer services (e.g. CloudSQL, Cloud Memstore, etc.) are still using this connection.

as it still thinks there are services using it.

I manually deleted VPC Network Peering from within the network's VPC NETWORK PEERING tab and it allowed the destruction to complete. Investigating more it seemed like some NEGs attached to one of the cluster's nodes was still around even though the cluster and its nodepools were deleted with terraform...

@q-leobrack
Copy link

Also hitting this since upgrading to version 5 of the provider. Any news on a fix?

@q-markglozier
Copy link

Workaround by using google-beta on provider version 4 and specifying the google-beta provider on the google_service_networking_connection resource, as below:

required_providers {
    google = {
      source = "hashicorp/google"
      version = "~>5"
    }
    google-beta = {
      source = "hashicorp/google-beta"
      version = "~>4"
    }
  }

provider "google" {
  project = <PROJECT_ID>
  region  = <REGION>
}

provider "google-beta" {
  project = <PROJECT_ID>
  region  = <REGION>
}

resource "google_service_networking_connection" "google_managed_services_peering" {
  network                 = <VPC_ID>
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = <RESERVED_PEERING_RANGES>
  provider = google-beta
}

@c2thorn
Copy link
Collaborator

c2thorn commented Dec 14, 2023

The Google-internal ticket has an owner who is investigating, assigning modular-magician to mark that.

Copy link

github-actions bot commented Feb 9, 2024

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.