Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default_node_pool not upgraded with automatic_channel_upgrade = 'patch' #301

Closed
1 task done
aescrob opened this issue Feb 14, 2023 · 9 comments
Closed
1 task done
Labels
bug Something isn't working

Comments

@aescrob
Copy link
Contributor

aescrob commented Feb 14, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Greenfield/Brownfield provisioning

brownfield

Terraform Version

1.3.2

Module Version

6.6.0

AzureRM Provider Version

3.43.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

module "aks" {
  source  = "Azure/aks/azurerm"
  version = "6.6.0"

  # General
  cluster_name              = var.cluster_name
  admin_username            = "azureuser"
  prefix                    = var.prefix
  resource_group_name       = azurerm_resource_group.aks.name
  kubernetes_version        = var.cluster_version
  automatic_channel_upgrade = "patch"
  node_resource_group       = "${azurerm_resource_group.aks.name}-${var.infrastructure_resource_group_suffix}"
  sku_tier                  = "Paid"
  disk_encryption_set_id    = azurerm_disk_encryption_set.des_os_disk.id

 
  # Default node pool for system components like CoreDNS
  agents_pool_name     = "system"
  enable_auto_scaling  = true
  agents_size          = var.system_agents_size
  orchestrator_version = var.cluster_version
  #A minimum of two nodes 4 vCPUs is recommended (for example, Standard_DS4_v2), especially for large clusters (Multiple CoreDNS Pod replicas, 3-4+ add-ons, etc.)
  agents_min_count          = 2
  agents_max_count          = var.agents_max_count
  agents_availability_zones = ["1", "2", "3"]

  # KMS encryption
  kms_enabled                  = true
  kms_key_vault_key_id         = azurerm_key_vault_key.kms.id
  kms_key_vault_network_access = "Private"

}

#----------------------------
# additional node pools for workloads
#----------------------------
resource "azurerm_kubernetes_cluster_node_pool" "nodepool1" {
  kubernetes_cluster_id = module.aks.aks_id
  name                  = "nodepool1"
  vm_size               = var.agents_size
  vnet_subnet_id        = var.subnet_id
  zones                 = ["1"]
  enable_auto_scaling   = true
  min_count             = var.agents_min_count
  max_count             = var.agents_max_count
  orchestrator_version  = var.cluster_version

  upgrade_settings {
    max_surge = var.upgrade_settings_max_surge
  }

  lifecycle {
    create_before_destroy = true
  }
}

tfvars variables values

cluster_version = 1.24

Debug Output/Panic Output

Error:      	Received unexpected error:
        	            	FatalError{Underlying: error while running command: exit status 1; ╷
        	            	│ Error: Resource precondition failed
        	            	│ 
        	            	│   on .terraform/modules/aks_post.aks/main.tf line 287, in resource "azurerm_kubernetes_cluster" "main":
        	            	│  287:       condition     = local.automatic_channel_upgrade_check
        	            	│     ├────────────────
        	            	│     │ local.automatic_channel_upgrade_check is false
        	            	│ 
        	            	│ Either disable automatic upgrades, or only specify up to the minor version
        	            	│ when using `automatic_channel_upgrade=patch` or don't specify
        	            	│ `kubernetes_version` at all when using
        	            	│ `automatic_channel_upgrade=stable|rapid|node-image`. With automatic
        	            	│ upgrades `orchestrator_version` must be set to `null`.
        	            	╵}

Expected Behaviour

'default_node_pool' is upgraded based on 'orchestrator_version'

Actual Behaviour

Upgrading cluster_version from 1.23.x to 1.24.x does not upgrade orchestrator version for 'default_node_pool' but for all 'azurerm_kubernetes_cluster_node_pools' as we set 'orchestrator_version=cluster_version'.

Setting 'orchestrator_version=cluster_version' on 'default_node_pool' is prevented by precondition

Steps to Reproduce

No response

Important Factoids

No response

References

No response

@aescrob aescrob added the bug Something isn't working label Feb 14, 2023
@lonegunmanb
Copy link
Member

lonegunmanb commented Feb 15, 2023

Thanks @aescrob for opening this issue, the failed check was:

  automatic_channel_upgrade_check = var.automatic_channel_upgrade == null ? true : var.orchestrator_version == null && (
    (contains(["patch"], var.automatic_channel_upgrade) && can(regex("^[0-9]{1,}\\.[0-9]{1,}$", var.kubernetes_version)))
    || (contains(["rapid", "stable", "node-image"], var.automatic_channel_upgrade) && var.kubernetes_version == null
  ))

If you assigned patch to automatic_channel_upgrade, then you cannot set both var.automatic_channel_upgrade and var.orchestrator_version. This restriction is introduced by #281 and #283.

We'd like to hear your voice if you have any further question or thought on this.

@the-technat
Copy link
Contributor

the-technat commented Feb 15, 2023

To give some additional context: As @aescrob described we deployed multiple clusters with the kubernetes_version set to 1.24, omitting the orchestrator_version and deploying additional node pools where the orchestrator_version is set to 1.24 as well. This is the result:

NAME                                STATUS   ROLES   AGE     VERSION
aks-nodepool1-75779344-vmss000000   Ready    agent   22h     v1.24.9
aks-nodepool2-45509703-vmss00004x   Ready    agent   18h     v1.24.9
aks-nodepool3-28306104-vmss000051   Ready    agent   5h42m   v1.24.9
aks-system-26220322-vmss000000      Ready    agent   6d17h   v1.23.15
aks-system-26220322-vmss000001      Ready    agent   6d17h   v1.23.15

So we assume that the system node pool (which is the one that has orchestrator_version=null doesn't upgrade itself automatically as he should according to the docs).

Upgrade Behavior Docs: https://learn.microsoft.com/en-us/azure/aks/auto-upgrade-cluster#auto-upgrade-limitations

@aescrob is currently testing what happens when we set orchestrator_version=1.24 on the system node pool, but as the issue initially said, there is a precondition preventing this.

@aescrob
Copy link
Contributor Author

aescrob commented Feb 15, 2023

In this case (patch) we need to be able to set orchestrator_version. I've just tested successful upgrade of the default_node_pool as well with the following change for the precondition:

`diff --git a/locals.tf b/locals.tf
index b022516..7fe24b8 100644
--- a/locals.tf
+++ b/locals.tf
@@ -3,12 +3,12 @@ locals {
   auto_scaler_profile_scale_down_delay_after_delete = var.auto_scaler_profile_scale_down_delay_after_delete == null ? var.auto_scaler_profile_scan_interval : var.auto_scaler_profile_scale_down_delay_after_delete
   # automatic upgrades are either:
   # - null
-  # - patch, but then the kubernetes_version must not specify a patch number and orchestrator_version must be null
+  # - patch, but then neither the kubernetes_version nor orchestrator_version must specify a patch number
   # - rapid/stable/node-image, but then the kubernetes_version and the orchestrator_version must be null
-  automatic_channel_upgrade_check = var.automatic_channel_upgrade == null ? true : var.orchestrator_version == null && (
-    (contains(["patch"], var.automatic_channel_upgrade) && can(regex("^[0-9]{1,}\\.[0-9]{1,}$", var.kubernetes_version)))
-    || (contains(["rapid", "stable", "node-image"], var.automatic_channel_upgrade) && var.kubernetes_version == null
-  ))
+  automatic_channel_upgrade_check = var.automatic_channel_upgrade == null ? true : (
+    (contains(["patch"], var.automatic_channel_upgrade) && can(regex("^[0-9]{1,}\\.[0-9]{1,}$", var.kubernetes_version)) && can(regex("^[0-9]{1,}\\.[0-9]{1,}$", var.orchestrator_version))) ||
+    (contains(["rapid", "stable", "node-image"], var.automatic_channel_upgrade) && var.kubernetes_version == null && var.orchestrator_version == null)
+  )

   # Abstract the decision whether to create an Analytics Workspace or not.
   create_analytics_solution  = var.log_analytics_workspace_enabled && var.log_analytics_solution_id == null`

@the-technat
Copy link
Contributor

@aescrob can we close this now that the PR is merged?

@aescrob aescrob closed this as completed Feb 20, 2023
@zioproto
Copy link
Collaborator

Hello Folks, I know the issue is closed, but I want to understand if there is any leftover work for the Azurerm provider or for AKS.

@the-technat can you please confirm this sequence of events ?

Step 1) Terraform deploy with:

automatic_channel_upgrade = patch
kubernetes_version =  1.23.15
orchestrator_version = null

Step 2) Terraform deploy with:

automatic_channel_upgrade = patch
kubernetes_version =  1.24.9
orchestrator_version = null

At this point you noticed that the control plane was upgraded to 1.24.9 but the default node pool was stuck forever at 1.23.15

Is this correct ?

I checked the docs about autoupgrade.
The confusion is that the upgrade from 1.23.15 to 1.24.9 was not an autoupgrade. This upgrade was triggered manually by Terraform when you changed the variable value kubernetes_version.

My doubt is, when orchestrator_version = null, should the Azurerm provider have upgraded also the default node pool to 1.24.9 ?

@zioproto
Copy link
Collaborator

From the docs it is not clear to me what should be the provider behaviour when orchestrator_version = null and the value of kubernetes_version changes after deployment.

Screenshot 2023-02-22 at 10 49 49

@the-technat
Copy link
Contributor

Yes that's correct. We had the clusters running on v1.23.15 and upgraded them that way and the default node pool was stuck. According to @aescrob the default node pool is updated if you do the update via azure-cli, so it might actually be a problem of the provider.

@zioproto
Copy link
Collaborator

I would expect the provider to call the equivalent API call of az aks upgrade --control-plane-only, this is because the API call should not upgrade the other node pools that could be managed in a different resource of type azurerm_kubernetes_cluster_node_pool.

It seems to me everything works as expected. I am just thinking if we can improve the documentation or the module to make the user experience better.

@the-technat Do you agree everything works as expected ?

@the-technat
Copy link
Contributor

Ah I see...
Yeah I would agree that the provider should only upgrade the control-plane. But then the node-pools should also be completely separate from the control-plane. But this seems currently not to be the case with AKS as the API expects a list of node pools in the cluster object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants