default_node_pool not upgraded with automatic_channel_upgrade = 'patch' #301

aescrob · 2023-02-14T15:32:41Z

Is there an existing issue for this?

I have searched the existing issues

Greenfield/Brownfield provisioning

brownfield

Terraform Version

1.3.2

Module Version

6.6.0

AzureRM Provider Version

3.43.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

module "aks" {
  source  = "Azure/aks/azurerm"
  version = "6.6.0"

  # General
  cluster_name              = var.cluster_name
  admin_username            = "azureuser"
  prefix                    = var.prefix
  resource_group_name       = azurerm_resource_group.aks.name
  kubernetes_version        = var.cluster_version
  automatic_channel_upgrade = "patch"
  node_resource_group       = "${azurerm_resource_group.aks.name}-${var.infrastructure_resource_group_suffix}"
  sku_tier                  = "Paid"
  disk_encryption_set_id    = azurerm_disk_encryption_set.des_os_disk.id

 
  # Default node pool for system components like CoreDNS
  agents_pool_name     = "system"
  enable_auto_scaling  = true
  agents_size          = var.system_agents_size
  orchestrator_version = var.cluster_version
  #A minimum of two nodes 4 vCPUs is recommended (for example, Standard_DS4_v2), especially for large clusters (Multiple CoreDNS Pod replicas, 3-4+ add-ons, etc.)
  agents_min_count          = 2
  agents_max_count          = var.agents_max_count
  agents_availability_zones = ["1", "2", "3"]

  # KMS encryption
  kms_enabled                  = true
  kms_key_vault_key_id         = azurerm_key_vault_key.kms.id
  kms_key_vault_network_access = "Private"

}

#----------------------------
# additional node pools for workloads
#----------------------------
resource "azurerm_kubernetes_cluster_node_pool" "nodepool1" {
  kubernetes_cluster_id = module.aks.aks_id
  name                  = "nodepool1"
  vm_size               = var.agents_size
  vnet_subnet_id        = var.subnet_id
  zones                 = ["1"]
  enable_auto_scaling   = true
  min_count             = var.agents_min_count
  max_count             = var.agents_max_count
  orchestrator_version  = var.cluster_version

  upgrade_settings {
    max_surge = var.upgrade_settings_max_surge
  }

  lifecycle {
    create_before_destroy = true
  }
}

tfvars variables values

cluster_version = 1.24

Debug Output/Panic Output

Error:      	Received unexpected error:
        	            	FatalError{Underlying: error while running command: exit status 1; ╷
        	            	│ Error: Resource precondition failed
        	            	│ 
        	            	│   on .terraform/modules/aks_post.aks/main.tf line 287, in resource "azurerm_kubernetes_cluster" "main":
        	            	│  287:       condition     = local.automatic_channel_upgrade_check
        	            	│     ├────────────────
        	            	│     │ local.automatic_channel_upgrade_check is false
        	            	│ 
        	            	│ Either disable automatic upgrades, or only specify up to the minor version
        	            	│ when using `automatic_channel_upgrade=patch` or don't specify
        	            	│ `kubernetes_version` at all when using
        	            	│ `automatic_channel_upgrade=stable|rapid|node-image`. With automatic
        	            	│ upgrades `orchestrator_version` must be set to `null`.
        	            	╵}

Expected Behaviour

'default_node_pool' is upgraded based on 'orchestrator_version'

Actual Behaviour

Upgrading cluster_version from 1.23.x to 1.24.x does not upgrade orchestrator version for 'default_node_pool' but for all 'azurerm_kubernetes_cluster_node_pools' as we set 'orchestrator_version=cluster_version'.

Setting 'orchestrator_version=cluster_version' on 'default_node_pool' is prevented by precondition

Steps to Reproduce

No response

Important Factoids

No response

References

No response

lonegunmanb · 2023-02-15T03:06:29Z

Thanks @aescrob for opening this issue, the failed check was:

  automatic_channel_upgrade_check = var.automatic_channel_upgrade == null ? true : var.orchestrator_version == null && (
    (contains(["patch"], var.automatic_channel_upgrade) && can(regex("^[0-9]{1,}\\.[0-9]{1,}$", var.kubernetes_version)))
    || (contains(["rapid", "stable", "node-image"], var.automatic_channel_upgrade) && var.kubernetes_version == null
  ))

If you assigned patch to automatic_channel_upgrade, then you cannot set both var.automatic_channel_upgrade and var.orchestrator_version. This restriction is introduced by #281 and #283.

We'd like to hear your voice if you have any further question or thought on this.

the-technat · 2023-02-15T07:21:56Z

To give some additional context: As @aescrob described we deployed multiple clusters with the kubernetes_version set to 1.24, omitting the orchestrator_version and deploying additional node pools where the orchestrator_version is set to 1.24 as well. This is the result:

NAME                                STATUS   ROLES   AGE     VERSION
aks-nodepool1-75779344-vmss000000   Ready    agent   22h     v1.24.9
aks-nodepool2-45509703-vmss00004x   Ready    agent   18h     v1.24.9
aks-nodepool3-28306104-vmss000051   Ready    agent   5h42m   v1.24.9
aks-system-26220322-vmss000000      Ready    agent   6d17h   v1.23.15
aks-system-26220322-vmss000001      Ready    agent   6d17h   v1.23.15

So we assume that the system node pool (which is the one that has orchestrator_version=null doesn't upgrade itself automatically as he should according to the docs).

Upgrade Behavior Docs: https://learn.microsoft.com/en-us/azure/aks/auto-upgrade-cluster#auto-upgrade-limitations

@aescrob is currently testing what happens when we set orchestrator_version=1.24 on the system node pool, but as the issue initially said, there is a precondition preventing this.

aescrob · 2023-02-15T08:00:36Z

In this case (patch) we need to be able to set orchestrator_version. I've just tested successful upgrade of the default_node_pool as well with the following change for the precondition:

`diff --git a/locals.tf b/locals.tf
index b022516..7fe24b8 100644
--- a/locals.tf
+++ b/locals.tf
@@ -3,12 +3,12 @@ locals {
   auto_scaler_profile_scale_down_delay_after_delete = var.auto_scaler_profile_scale_down_delay_after_delete == null ? var.auto_scaler_profile_scan_interval : var.auto_scaler_profile_scale_down_delay_after_delete
   # automatic upgrades are either:
   # - null
-  # - patch, but then the kubernetes_version must not specify a patch number and orchestrator_version must be null
+  # - patch, but then neither the kubernetes_version nor orchestrator_version must specify a patch number
   # - rapid/stable/node-image, but then the kubernetes_version and the orchestrator_version must be null
-  automatic_channel_upgrade_check = var.automatic_channel_upgrade == null ? true : var.orchestrator_version == null && (
-    (contains(["patch"], var.automatic_channel_upgrade) && can(regex("^[0-9]{1,}\\.[0-9]{1,}$", var.kubernetes_version)))
-    || (contains(["rapid", "stable", "node-image"], var.automatic_channel_upgrade) && var.kubernetes_version == null
-  ))
+  automatic_channel_upgrade_check = var.automatic_channel_upgrade == null ? true : (
+    (contains(["patch"], var.automatic_channel_upgrade) && can(regex("^[0-9]{1,}\\.[0-9]{1,}$", var.kubernetes_version)) && can(regex("^[0-9]{1,}\\.[0-9]{1,}$", var.orchestrator_version))) ||
+    (contains(["rapid", "stable", "node-image"], var.automatic_channel_upgrade) && var.kubernetes_version == null && var.orchestrator_version == null)
+  )

   # Abstract the decision whether to create an Analytics Workspace or not.
   create_analytics_solution  = var.log_analytics_workspace_enabled && var.log_analytics_solution_id == null`

the-technat · 2023-02-20T07:47:11Z

@aescrob can we close this now that the PR is merged?

zioproto · 2023-02-22T09:47:17Z

Hello Folks, I know the issue is closed, but I want to understand if there is any leftover work for the Azurerm provider or for AKS.

@the-technat can you please confirm this sequence of events ?

Step 1) Terraform deploy with:

automatic_channel_upgrade = patch
kubernetes_version =  1.23.15
orchestrator_version = null

Step 2) Terraform deploy with:

automatic_channel_upgrade = patch
kubernetes_version =  1.24.9
orchestrator_version = null

At this point you noticed that the control plane was upgraded to 1.24.9 but the default node pool was stuck forever at 1.23.15

Is this correct ?

I checked the docs about autoupgrade.
The confusion is that the upgrade from 1.23.15 to 1.24.9 was not an autoupgrade. This upgrade was triggered manually by Terraform when you changed the variable value kubernetes_version.

My doubt is, when orchestrator_version = null, should the Azurerm provider have upgraded also the default node pool to 1.24.9 ?

zioproto · 2023-02-22T09:51:05Z

From the docs it is not clear to me what should be the provider behaviour when orchestrator_version = null and the value of kubernetes_version changes after deployment.

the-technat · 2023-02-22T09:51:38Z

Yes that's correct. We had the clusters running on v1.23.15 and upgraded them that way and the default node pool was stuck. According to @aescrob the default node pool is updated if you do the update via azure-cli, so it might actually be a problem of the provider.

zioproto · 2023-02-22T10:00:20Z

I would expect the provider to call the equivalent API call of az aks upgrade --control-plane-only, this is because the API call should not upgrade the other node pools that could be managed in a different resource of type azurerm_kubernetes_cluster_node_pool.

It seems to me everything works as expected. I am just thinking if we can improve the documentation or the module to make the user experience better.

@the-technat Do you agree everything works as expected ?

the-technat · 2023-02-22T10:06:39Z

Ah I see...
Yeah I would agree that the provider should only upgrade the control-plane. But then the node-pools should also be completely separate from the control-plane. But this seems currently not to be the case with AKS as the API expects a list of node pools in the cluster object.

aescrob added the bug Something isn't working label Feb 14, 2023

aescrob mentioned this issue Feb 15, 2023

fix: allow orchestrator_version if auto-upgrade is 'patch' to allow default_node_pool upgrade #302

Merged

3 tasks

aescrob closed this as completed Feb 20, 2023

zioproto mentioned this issue Oct 5, 2023

Updating var.kubernetes_version or var.orchestrator_version does not update kubernetes version of the system pools #403

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

default_node_pool not upgraded with automatic_channel_upgrade = 'patch' #301

default_node_pool not upgraded with automatic_channel_upgrade = 'patch' #301

aescrob commented Feb 14, 2023

lonegunmanb commented Feb 15, 2023 •

edited

Loading

the-technat commented Feb 15, 2023 •

edited

Loading

aescrob commented Feb 15, 2023

the-technat commented Feb 20, 2023

zioproto commented Feb 22, 2023

zioproto commented Feb 22, 2023

the-technat commented Feb 22, 2023

zioproto commented Feb 22, 2023

the-technat commented Feb 22, 2023

default_node_pool not upgraded with automatic_channel_upgrade = 'patch' #301

default_node_pool not upgraded with automatic_channel_upgrade = 'patch' #301

Comments

aescrob commented Feb 14, 2023

Is there an existing issue for this?

Greenfield/Brownfield provisioning

Terraform Version

Module Version

AzureRM Provider Version

Affected Resource(s)/Data Source(s)

Terraform Configuration Files

tfvars variables values

Debug Output/Panic Output

Expected Behaviour

Actual Behaviour

Steps to Reproduce

Important Factoids

References

lonegunmanb commented Feb 15, 2023 • edited Loading

the-technat commented Feb 15, 2023 • edited Loading

aescrob commented Feb 15, 2023

the-technat commented Feb 20, 2023

zioproto commented Feb 22, 2023

zioproto commented Feb 22, 2023

the-technat commented Feb 22, 2023

zioproto commented Feb 22, 2023

the-technat commented Feb 22, 2023

lonegunmanb commented Feb 15, 2023 •

edited

Loading

the-technat commented Feb 15, 2023 •

edited

Loading