Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PreconditionFailedEtagMismatch - Manual outbound rule deleted from LB at AKS changed #13886

Closed
fraozy opened this issue Oct 25, 2021 · 4 comments

Comments

@fraozy
Copy link

fraozy commented Oct 25, 2021

I have a manual "backend pool" (aks01_app_nodepool) and "outbound rule" (avmOutboundRule) created at kubernetes load balancer, but when I made any update (tags, paramters) at AKS or nodepool, the outbound rule (avmOutboundRule) manually created is deleted (the backend pool (aks01_app_nodepool)manually created remain).
The error at kubernetes load balance is:

"statusMessage": "{\"error\":{\"code\":\"PreconditionFailed\",\"message\":\"Precondition failed.\",\"details\":[
{\"code\":\"PreconditionFailedEtagMismatch\",\"message\":\"Etag provided in if-match header W/\\\"b5a7bea6-a795-4765-b97b-7f454769157a\\\" does not match etag W/\\\"301f711d-ae8e-4035-bd28-49a167ae12fd\\\" of resource /subscriptions/5a00037b-0c8e-4b31-aece-578d252d6ef3/resourceGroups/MC_riot-rg-runtime-01-we-dev_riot-aks-01-we-dev_westeurope/providers/Microsoft.Network/loadBalancers/kubernetes in NRP data store.\"}

Terraform (and AzureRM Provider) Version

Terraform 1.08
AzureRM Provider 2.79.1

Affected Resource(s)

azurerm_kubernetes_cluster
azurerm_kubernetes_cluster_node_pool

Terraform Configuration Files

module "kubernetes_cluster" {
  source                         = "../../modules/kubernetes_cluster"
  name                           = "${var.project_name}-aks-01-${var.region}-${var.stage}"
  subnet_name                    = "${var.project_name}-snet-aks-01-${var.region}-${var.stage}"
  sp_display_name                = "${var.project_name}-aks-01-${var.region}-${var.stage}-agentpool"
  resource_group_name            = module.resource_group_pf_runtime.name
  location                       = module.resource_group_pf_runtime.location
  kubernetes_version             = var.aks_01_kubernetes_version
  orchestrator_version           = var.aks_01_orchestrator_version
  vm_size                        = var.aks_01_vm_size
  max_pods                       = var.aks_01_max_pods
  vnet_name                      = module.virtual_network.name
  vnet_address_space             = [var.aks_01_subnet_address_prefix]
  outbound_ports_allocated       = var.aks_01_outbound_ports_allocated
  managed_outbound_ip_count      = var.aks_01_managed_outbound_ip_count
  serviceprincipal_appid         = var.pf_aks_service_principal_appid
  serviceprincipal_secret        = var.pf_aks_service_principal_secret
  public_ssh_certificate         = var.aks_public_ssh_certificate
  load_balancer_sku              = var.aks_01_load_balancer_sku
  min_count                      = var.aks_01_auto_scaling_min_count
  max_count                      = var.aks_01_auto_scaling_max_count
  os_disk_size_gb                = var.aks_01_os_disk_size_gb
  os_disk_type                   = var.aks_01_os_disk_type
  eventhub_name                  = module.event_hub_logs01.name
  eventhub_authorization_rule_id = module.eventhub_namespace_authorization_rule_ehns_logging01.id

  service_endpoints = [
    "Microsoft.Storage",
    "Microsoft.AzureCosmosDB",
    "Microsoft.KeyVault",
    "Microsoft.ServiceBus",
    "Microsoft.Sql"
  ]

  tags = {
    creator    = local.tag_creator
    tf-version = local.tag_tf_version
    env        = var.stage
  }
}

module "aks01_app_nodepool" {
  source                = "../../modules/kubernetes_cluster_node_pool"
  name                  = var.aks_01_agentpool
  kubernetes_cluster_id = module.kubernetes_cluster.id
  mode                  = "System"
  vm_size               = var.aks_01_avm_vm_size
  min_count             = var.aks_01_avm_auto_scaling_min_count
  max_count             = var.aks_01_avm_auto_scaling_max_count
  max_pods              = var.aks_01_avm_max_pods
  os_disk_size_gb       = var.aks_01_os_disk_size_gb
  os_disk_type          = var.aks_01_os_disk_type
  vnet_subnet_id        = module.kubernetes_cluster.subnet_id
  orchestrator_version  = var.aks_01_agent_orchestrator_version
  node_labels = {
    "app" = "bundle-app"
  }
  node_taints = [
    "agentpool=avmpool:NoSchedule"
  ]

  tags = {
    creator    = local.tag_creator
    tf-version = local.tag_tf_version
    env        = var.stage
  }
}

- Manually created rule at kubernetes load balancer:

Name: avmOutboundRule
Frontend IP address: aks01_app_nodepool
Protocol: TCP
Idle timeout (minutes): 30
TCP Reset: Enabled
Backend pool: avmPool
Choose by: Ports per instance
Port per instance: 6400

Debug Output

Panic Output

Expected Behaviour

The manually created outbound rule (avmOutboudRule) remain when AKS or nodepool are updated using terraform.

Actual Behaviour

The outbound rule (avmOutboudRule) is deleted on any update.

Steps to Reproduce

Execute any update at kubernetes_cluster module (outbound_ports_allocated, tags or any other parameters) and execute the terraform apply

Important Factoids

I created a case with Azure support and they indicated that no issue exist at their side (I made the same steps direcly from Azure Portal and problem does not happen).

@aristosvo
Copy link
Collaborator

Hi @fraozy! Interesting issue, to manage expectations, I might not have the solution for you. I have a few questions though :)

First of all, I'd really like to know why there is an outbound rule specified in the first place to get the scenario straight, as it is normally not advised to specify anything on the AKS resources themselves, unless it's really necessary. If possible, I would advise to modify egress traffic with UDR instead of manually creating an outbound rule (not completely sure if I understand your use-case here). If there is a use-case which should be supported in Terraform, I'd like to know.

Secondly, it is interesting to know what the error exactly is and what triggered it. I'm interested if this is a reconciliation error which would bite you potentially in a later stage (for instance when updating AKS to a higher version) even if you'd do this in the Portal, or that it's something specifically for Terraform (which I find unlikely).

Let's start with that, and work further from there 👍🏽

@fraozy
Copy link
Author

fraozy commented Oct 26, 2021

Hello @aristosvo. About your points:

  • I have an application running at AKS pods that required a dedicated frontend IP to work.
    This application run on dedicated nodes (using node selector and tolerations at application deployment manifest, and taint at aks01_app_nodepool).

    I have and AKS Service type loadbalancer pointing to kubernetes load balancer (using annotation "service.beta.kubernetes.io/azure-load-balancer-resource-group: " and parameter "loadBalancerIP: ").
    To finish the implementation, I manually created the backend pool and the outbound rule, so just the nodes of aks01_app_nodepool can use the frontend ip.

    I created the backend pool and outbound rule manually, because I can not bind the frontend IP direcly at LB (it should be done by AKS service loadbalancer), and without the frontend IP I can not create the outbound rule. Aditionally, I still do not use the "kubernetes provider", so terraform only create the infra and Azure DevOps pipeline implement the AKS manifest.

  • I do not have any error at terraform plan or apply (both complete with success. The change I am doing is the number of allocated ports, parameter "outbound_ports_allocated" at AKS module), the error happen while my changes are been implement. I am sharing some of the logs I colletcted from Azure Load Balancer logs.

Outbound_rule_issue.zip

Thanks a lot.

@aristosvo
Copy link
Collaborator

aristosvo commented Oct 26, 2021

Well, that's something @fabianofranz!

Can't this be solved with Public IPs for a specific node pool or a different subnet for your nodepool? It sounds like a mission impossible to do it from outside the cluster, by using K8s LoadBalancer services you may be able to do a bit more.

I can't blame or relate this problem in any way on azurerm, as the behaviour is triggered by the reconciler of AKS, and although that is triggered by your Terraform actions, it is not possible to not trigger the reconciliation of AKS on its loadbalancer resources. When actions like this are done from the Portal, you can do these actions directly on the resources instead of calling AKS APIs, thus not triggering AKS reconcilers to do their thing. When for some reason these reconcilers are triggered in any other way (like upgrading), you'll probably see the same behaviour happening.

I'm sorry, I think that is all we can do for you from here.

@fraozy fraozy closed this as completed Nov 3, 2021
@github-actions
Copy link

github-actions bot commented Dec 4, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants