Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: add create_before_destroy=true to node pools #357

Merged
merged 1 commit into from
May 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
259 changes: 130 additions & 129 deletions README.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion examples/multiple_node_pools/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ resource "azurerm_subnet" "test" {
locals {
nodes = {
for i in range(3) : "worker${i}" => {
name = substr("worker${i}${random_id.prefix.hex}", 0, 12)
name = substr("worker${i}${random_id.prefix.hex}", 0, 8)
vm_size = "Standard_D2s_v3"
node_count = 1
vnet_subnet_id = azurerm_subnet.test.id
Expand Down
2 changes: 1 addition & 1 deletion examples/multiple_node_pools/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ variable "location" {
variable "resource_group_name" {
type = string
default = null
}
}
23 changes: 22 additions & 1 deletion main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -470,7 +470,7 @@ resource "azurerm_kubernetes_cluster_node_pool" "node_pool" {
for_each = var.node_pools

kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
name = each.value.name
name = "${each.value.name}${substr(md5(jsonencode(each.value)), 0, 4)}"
vm_size = each.value.vm_size
capacity_reservation_group_id = each.value.capacity_reservation_group_id
custom_ca_trust_enabled = each.value.custom_ca_trust_enabled
Expand Down Expand Up @@ -592,10 +592,31 @@ resource "azurerm_kubernetes_cluster_node_pool" "node_pool" {
depends_on = [azapi_update_resource.aks_cluster_post_create]

lifecycle {
create_before_destroy = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least README notes are required if we "hardcode" this. Nodepool names need to be unique and create_before_destroy = true will create another nodepool with the same name by default.

Copy link
Contributor Author

@the-technat the-technat Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jep definitely. Since the change is hard coded (and can only be hard coded), I think we should also hard code a random suffix to the node pool?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like this random suffix idea.

Btw, it's possible to make this create_before_destroy = true dynamically, we can declare two similar resources with a toggle switcher, one contains create_before_destroy = true and another doesn't. Like:

resource "azurerm_kubernetes_cluster_node_pool" node_pool {
  count = var.create_before_destroy ? 0 : 1

 ...
}

resource "azurerm_kubernetes_cluster_node_pool" node_pool_create_before_destroy {
  count = var.create_before_destroy ? 1 : 0

 ...
  create_before_destroy = true
}

But in this case, we won't need this trick, I agree with you, the node pool should contain create_before_destroy.

I have another question @the-technat, even with create_before_destroy = true how can we achieve seamless node pool updates? The old pool would be destroyed immediately once the new pool is ready, would it cause downtime?

Another option is we maintain the node pool in green-blue pattern, the upgrade could create a blue pool, then drain the green pool, then destroy the green pool. I would consider it as plan B if re-creation of a node pool with create_before_destroy still cause downtime.

Copy link
Contributor Author

@the-technat the-technat Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the random suffix is implemented, WDYT?

Regarding the dynamically, I think it's not DRY if we have a switch-case for the node pools and all features must be implemented on both. In addition, we just add many more edge-cases I'm not sure we want to test them all, that's why I'm for hard coding that thing.

Regarding your other question, that totally depends on how the AzureRM API implements deletion of the old node pool. I assume it starts with cordening all the nodes and then draining them, which results in workload beeing moved to the new nodes. That should give you a seamless update. And in order to ensure total 100% availability of the apps, they have to use PDBs (which they should use anyway), to prevent K8s from draining nodes too fast (draining will respect PDBs and wait until the pods are started on the new nodes).

Copy link
Member

@lonegunmanb lonegunmanb Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking would the current implementation of using random_string's result be refreshed on the node pool's recreation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we do need sufficient tests for this Terraform circus show 😃

Copy link
Member

@lonegunmanb lonegunmanb Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @the-technat, I've done some tests on my side and the results are good, I've tried updating name, force new attributes and update-in-place attributes, and all scenarios work as we expected. I would encourage you to do the same tests on your side so we can be on the same page.

Copy link
Contributor Author

@the-technat the-technat Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lonegunmanb I did some tests with your suggestion and it worked fine. Now I also see why my apporach wouldn't work (it recreates the nodepool for every non-breaking attribute too...).

I updated the PR with your approach and fixed one test (now nodepool names can only be up to 8 chars since the module itself adds 4). Still not very happy with the null_resource, but I guess that's our only option. Maybe it's worth creating a feature request for the AKS API to implement a new field use_name_prefix=true on the nodepool that will automatically use your given name as prefix and inject a random suffix if the node pool get's recreated.

Sorry that it took so long unterstanding why your apporach was needed. Definitely underestimated the effect / effort to implement this "feature".

Copy link
Member

@lonegunmanb lonegunmanb Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@the-technat Terraform 1.4 has added a new built-in resource terraform_data to replace null_resource, so I'm thinking of whether we should use this new resource or not. Using this new resource could set us free from null provider, but it also requires the Terraform Core's version must be >= 1.4.

This null_resource + md5 + replace_triggered_by + create_before_destroy pattern can be used in multiple scenarios, and I think pulumi has faced the very same problem so they've introduced auto-naming.

Copy link
Contributor Author

@the-technat the-technat Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, terraform_data definitely looks like a better approach.

Because this is a breaking change anyway I guess we could do this. Otherwise I'd be afraid of bumping the Terraform version to >= 1.4 as this has affect on all parent / sibling-modules.

@mkilchhofer WDYT about using terraform_data?

ignore_changes = [
name
]
replace_triggered_by = [
null_resource.pool_name_keeper[each.key],
]

precondition {
condition = var.agents_type == "VirtualMachineScaleSets"
error_message = "Multiple Node Pools are only supported when the Kubernetes Cluster is using Virtual Machine Scale Sets."
}

precondition {
condition = can(regex("[a-z0-9]{1,8}", each.value.name))
error_message = "A Node Pools name must consist of alphanumeric characters and have a maximum lenght of 8 characters (4 random chars added)"
}
}
}

resource "null_resource" "pool_name_keeper" {
for_each = var.node_pools

triggers = {
pool_name = each.value.name
}
}

Expand Down
2 changes: 1 addition & 1 deletion variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -779,7 +779,7 @@ variable "node_pools" {
description = <<-EOT
A map of node pools that about to be created and attached on the Kubernetes cluster. The key of the map can be the name of the node pool, and the key must be static string. The value of the map is a `node_pool` block as defined below:
map(object({
name = (Required) The name of the Node Pool which should be created within the Kubernetes Cluster. Changing this forces a new resource to be created. A Windows Node Pool cannot have a `name` longer than 6 characters.
name = (Required) The name of the Node Pool which should be created within the Kubernetes Cluster. Changing this forces a new resource to be created. A Windows Node Pool cannot have a `name` longer than 6 characters. A random suffix of 4 characters is always added to the name to avoid clashes during recreates.
node_count = (Optional) The initial number of nodes which should exist within this Node Pool. Valid values are between `0` and `1000` (inclusive) for user pools and between `1` and `1000` (inclusive) for system pools and must be a value in the range `min_count` - `max_count`.
tags = (Optional) A mapping of tags to assign to the resource. At this time there's a bug in the AKS API where Tags for a Node Pool are not stored in the correct case - you [may wish to use Terraform's `ignore_changes` functionality to ignore changes to the casing](https://www.terraform.io/language/meta-arguments/lifecycle#ignore_changess) until this is fixed in the AKS API.
vm_size = (Required) The SKU which should be used for the Virtual Machines used in this Node Pool. Changing this forces a new resource to be created.
Expand Down