Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows node not joining the eks cluster #195

Closed
guru1602 opened this issue Aug 29, 2024 · 2 comments · Fixed by #200
Closed

Windows node not joining the eks cluster #195

guru1602 opened this issue Aug 29, 2024 · 2 comments · Fixed by #200
Labels
bug 🐛 An issue with the system

Comments

@guru1602
Copy link

guru1602 commented Aug 29, 2024

Describe the Bug

I am using below config to create a windows node group using the latest version of the module, node gets created but fails to join the cluster.

module "worker_label_green" {
  source = "cloudposse/label/null"

  namespace  = var.namespace
  name       = var.name
  stage      = var.stage
  delimiter  = var.delimiter
  attributes = var.attributes
  tags       = merge(var.tags, {
    "kubernetes.io/cluster/${var.cluster_name}" = "owned"
  })
}

module "eks_web_node_group_green" {
  source  = "cloudposse/eks-node-group/aws"
  version = "3.1.0"

  enabled = var.green_enabled
  context = module.worker_label_green.context

  instance_types     = var.instance_types
  subnet_ids         = local.worker_subnet_ids
  min_size           = var.min_size
  max_size           = var.max_size
  desired_size       = var.desired_size
  cluster_name       = data.terraform_remote_state.eks_cluster.outputs.eks_cluster_id
  kubernetes_version = var.kubernetes_version == null || var.kubernetes_version == "" ? [data.terraform_remote_state.eks_cluster.outputs.eks_cluster_version] : [var.kubernetes_version]
  kubernetes_labels  = var.labels

  ami_type = var.ami_type

  before_cluster_joining_userdata = [
    data.template_file.pre_eks_worker_nt.rendered
  ]
  after_cluster_joining_userdata = [
    data.template_file.post_eks_worker_nt.rendered
  ]
  kubernetes_taints = [{
    key    = "OS"
    value  = "Windows"
    effect = "NO_SCHEDULE"
  }]

  update_config = [{ max_unavailable = var.desired_size }]

  capacity_type = var.capacity_type

  detailed_monitoring_enabled = true

  node_role_arn                = [data.aws_iam_role.worker_role.arn]
  node_role_cni_policy_enabled = false #We use the Service Account as per best practice

  associated_security_group_ids = [
    data.terraform_remote_state.network.outputs.rancher_sg,
    data.terraform_remote_state.network.outputs.ops_ssh,
    data.terraform_remote_state.eks_cluster.outputs.security_group_id
  ]

  # Enable the Kubernetes cluster auto-scaler to find the auto-scaling group
  cluster_autoscaler_enabled = var.cluster_autoscaler_enabled

  create_before_destroy = true

  node_role_policy_arns = ["arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"]

  block_device_mappings = [
    {
      "delete_on_termination" : true,
      "device_name" : "/dev/xvda",
      "encrypted" : true,
      "volume_size" : 90,
      "volume_type" : "gp3"
    }
  ]

  node_group_terraform_timeouts = [{
    create = "40m"
    update = "40m"
    delete = "20m"
  }]

  #Valid types are "instance", "volume", "elastic-gpu", "spot-instances-request", "network-interface".
  resources_to_tag = var.capacity_type == "SPOT" ? ["instance", "spot-instances-request", "volume", "network-interface"] : ["instance", "volume", "network-interface"]
}

Expected Behavior

Node should join the cluster

Steps to Reproduce

If you have existing cluster just try creating the windows node group into that

Screenshots

No response

Environment

No response

Additional Context

No response

@guru1602 guru1602 added the bug 🐛 An issue with the system label Aug 29, 2024
@ChrisMcKee
Copy link
Contributor

It's failing because the userscript contains the bootstrapper in the middle; but the script that is stored in the launch template contains the bootstrapper again at the end.

@ChrisMcKee
Copy link
Contributor

The change in how the windows nodes are assigned has caused this.
If the ami-type is defined and AWS is supplying the AMI it will show in the console as ami release version
image

The v2 module was fetching the windows ami so it was being set as 'custom' and showing the ami ala
image

The first one has the advantage that updates to the AMI show in the console; but AWS automatically augments your Userdata by adding the bootstrapper to the end of your userscript in the launch template. This doesnt show in the state when you do your plan.

It's not a huge issue to work-around but it does make the current user script broken; I assume it does the same for linux too.

If you have a before_cluster_joining_userdata and after_cluster_joining_userdata set and it's not a CUSTOM ami_type AWS will inject the EKSBootstrapScript execution at the end of the userdata.

ChrisMcKee added a commit to ChrisMcKee/terraform-aws-eks-node-group that referenced this issue Sep 18, 2024
…s in unexpected behaviour

Nodes that use custom `userdata` but don't use a custom-ami are creating a launch-template with the userdata in place
but AWS is then injecting their bootstrapper at the end of the userscript.
This means that `after_cluster_joining_userdata` will execute before cluster registration.

* Split the bootstrap out of the userdata templates into separate files, add ${bootstrap_script} into files in its place
* `launch_template.tf` Add precondition check; If `after_cluster_joining_userdata` is set but `ami_image_id` isn't and the OS is AL2/WINDOWS, show error
* `userdata.tf` Add `bootstrap_script` to local.userdata_vars; load in the userdata_bootstrap* file for the OS if the OS is AL2/Windows, otherwise, use empty string.
* `variables.tf` Add further detail to `after_cluster_joining_userdata`
ChrisMcKee added a commit to ChrisMcKee/terraform-aws-eks-node-group that referenced this issue Sep 19, 2024
…s in unexpected behaviour

Nodes that use custom `userdata` but don't use a custom-ami are creating a launch-template with the userdata in place
but AWS is then injecting their bootstrapper at the end of the userscript.
This means that `after_cluster_joining_userdata` will execute before cluster registration.

* Split the bootstrap out of the userdata templates into separate files, add ${bootstrap_script} into files in its place
* `launch_template.tf` Add precondition check; If `after_cluster_joining_userdata` is set but `ami_image_id` isn't and the OS is AL2/WINDOWS, show error
* `userdata.tf` Add `bootstrap_script` to local.userdata_vars; load in the userdata_bootstrap* file for the OS if the OS is AL2/Windows, otherwise, use empty string.
* `variables.tf` Add further detail to `after_cluster_joining_userdata`
@Nuru Nuru closed this as completed in #200 Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 An issue with the system
Projects
None yet
2 participants