Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

desired capacity update does not work for node groups #835

Closed
3 tasks
amitsehgal opened this issue Apr 16, 2020 · 46 comments
Closed
3 tasks

desired capacity update does not work for node groups #835

amitsehgal opened this issue Apr 16, 2020 · 46 comments
Labels

Comments

@amitsehgal
Copy link

desired capacity update does not work for node groups

I'm submitting an issue, where I have tried to update min, max, desired variables for node groups. The terraform does shows min and max being changed, however the desired does not updated

  • [ *] bug report
  • feature request
  • support request - read the FAQ first!
  • kudos, thank you, warm fuzzy

What is the current behavior?

terraform code.

node_groups = {
    eks_nodegroup = {
      desired_capacity = 2
      max_capacity     = 4
      min_capacity     = 2

      instance_type = var.instance_type
      k8s_labels = {
        Environment = "sbx"
      }
      additional_tags = {
        "k8s.io/cluster-autoscaler/${local.cluster_name}" = "owned"
        "k8s.io/cluster-autoscaler/enabled" = "true"
      }
    }
  }

output from plan and apply:

  ~ resource "aws_eks_node_group" "workers" {
.
.
.
.
.
      ~ scaling_config {
            desired_size = 1
          ~ max_size     = 3 -> 4
          ~ min_size     = 1 -> 2
        }
}

Error: error updating EKS Node Group (ce-eks-sbx:ce-eks-sbx-eks_nodegroup-lenient-blowfish) config: InvalidParameterException: Minimum capacity 2 can't be greater than desired size 1
{
ClusterName: "test-eks-sbx",
Message_: "Minimum capacity 2 can't be greater than desired size 1",
NodegroupName: "ce-eks-sbx-eks_nodegroup-lenient-blowfish"
}

i have also tried updating desired capacity through node_groups_defaults

If this is a bug, how to reproduce? Please include a code sample if relevant.

change the min, max and desired capacity

What's the expected behavior?

new scaling policies should take place.

Are you able to fix this problem and submit a PR? Link here if you have already.

No

Environment details

  • Affected module version:
  • OS:
  • Terraform version:

Any other relevant info

@karlderkaefer
Copy link

it was caused by #691 it works in 8.0.0

@meysammeisam
Copy link

it was caused by #691 it works in 8.0.0

@karlderkaefer
What do you mean by 8.0.0?

@karlderkaefer
Copy link

@mandeburka
Copy link

desired_capacity doesn't work for me as well. terraform-aws-eks version installed is v11.1.0

@kuritonasu
Copy link

kuritonasu commented May 5, 2020

I'm encountering this issue in v11.1.0 as well. I see the significance of #691 however the present issue prevents node group resizing as the author points out.

@max-rocket-internet
Copy link
Contributor

#681
#678
#510 (comment)

🙂

@max-rocket-internet
Copy link
Contributor

TLDR; you should be using the cluster-autoscaler. If not, you need to make the change manually.

@kuritonasu
Copy link

Consider the case where autoscaling is not desired but still want to resize my node group. And we do not wish to resize manually through the console, for the usual reasons. I don't believe this scenario is uncommon.

Also to be clear, I don't believe desired should be modified by this module by default as this could cause confusion and undesirable consequences. I am not arguing against #691, however there should be a way to override this behaviour.

@max-rocket-internet
Copy link
Contributor

Sure but the problem is that there is no way to have optional lifecycle on resources, therefor we choose to support the most common option.

@dmanchikalapudi
Copy link

dmanchikalapudi commented Jul 27, 2020

TLDR; you should be using the cluster-autoscaler. If not, you need to make the change manually.

@max-rocket-internet - Did you mean configure cluster-autoscaler within this module? If so, how? I dont see it in examples.

Also, I provisioned my node group with the following value

      desired_capacity = 4
      max_capacity     = 10
      min_capacity     = 4

At what point does the capacity go beyond 4? I tried to standup a t2.micro instance and try to scale pods beyond t2.micro's capacity. My cluster does not scale up to have more nodes.

PS: I am using V 12.2.0 of this module.

@dmanchikalapudi
Copy link

Did anyone here get the nodes to scale properly? I cannot get it to work no matter what I do.

@kuritonasu
Copy link

kuritonasu commented Jul 29, 2020

@max-rocket-internet - Did you mean configure cluster-autoscaler within this module? If so, how? I dont see it in examples.

@dmanchikalapudi cluster-autoscaler is not connected to nor can be configured through this module.

At what point does the capacity go beyond 4? I tried to standup a t2.micro instance and try to scale pods beyond t2.micro's capacity. My cluster does not scale up to have more nodes.

The desired_capacity value is ignored by the module. You have to modify it by hand through the console.

@dmanchikalapudi
Copy link

Thanks for the response @kuritonasu. Doing it by hand pretty much negates the idea behind "managed" nodegroups. There is no point in defining the min/max node counts either. It is just an illusion of autoscaling.

My need is simple. When my replicasets scale to initialize more pods than the nodes can run with, I need the nodes to scale to accommodate (assuming there is HW capacity underneath and is within the max node count). How do I go about making that happen via terraform?

@max-rocket-internet
Copy link
Contributor

cluster-autoscaler is not connected to nor can be configured through this module.

Correct ✅

Doing it by hand pretty much negates the idea behind "managed" nodegroups.

Perhaps his doc might help you to see what is "managed" and what is not, specifically this image:

mng

There is no point in defining the min/max node counts either. It is just an illusion of autoscaling.

I wouldn't say it's an illusion, it's just not a "turn-key" thing. ASGs have been around for years and work very well when configured correctly 🙂

My need is simple. When my replicasets scale to initialize more pods than the nodes can run with, I need the nodes to scale to accommodate (assuming there is HW capacity underneath and is within the max node count). How do I go about making that happen via terraform?

This is how typical autoscaling works in k8s but this module is only for the AWS resources. The cluster-autoscaler runs in your cluster and is not supported by us or this module in any way, it's a completely separate thing. But there is some doc here that might help you: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/autoscaling.md

@dploeger
Copy link

dploeger commented Sep 1, 2020

@max-rocket-internet I'd say, that "managed" in the broader sense together with Terraform also means scaling the worker nodes by setting the desired size as I'm also managing the VPC configuration with Terraform (and everything in between actually :) )

So IMHO, IF setting the desired size is possible through the API it SHOULD be supported by this ressource.

@elebertus
Copy link

elebertus commented Sep 23, 2020

The practical use case that I have for this is that if I set a managed node group to desired 3, max 6, min 3, cluster autoscaler will respect this. There isn't a technical reason why we can't change the min_size, nor should it be dismissed as "not a feature"

So some concrete examples, since this has been a bit of noisy thread.

Here's an example initial definition of a scaling config as passed through node_groups in the eks cluster module:

compute_1 = {
  desired_capacity = 1
  max_capacity     = 6
  min_capacity     = 1
}

Then update this nodegroup's minimum to:

compute_1 = {
  desired_capacity = 1
  max_capacity     = 6
  min_capacity     = 3
}

You'll get an error like:

Error: error updating EKS Node Group (eks_cluster_1:compute_1) config: InvalidParameterException: Minimum capacity 3 can't be greater than desired size 1
{
  RespMetadata: {
    StatusCode: 400,
    RequestID: "<requestID>"
  },
  ClusterName: "eks_cluster_1",
  Message_: "Minimum capacity 3 can't be greater than desired size 1",
  NodegroupName: "compute_1"
}

Then running a state show, shows the obvious:

terraform state show module.eks_cluster_1.module.node_groups.aws_eks_node_group.workers[\"compute_1\"] | grep -A4 scaling_config
    scaling_config {
        desired_size = 1
        max_size     = 6
        min_size     = 1
    }

So this means that from a terraform perspective desired_capacity which translates into the scaling_config.desired_size is immutable. Which also means that desired_capacity can never be > the initial desired_capacity and min_capacity is effectively limited by this while you can still happily raise the max_capacity.

There's ways to work around this, such as getting the ASG id from the module and modifying it in terraform as part of the workflow, but that's a hack at best.

@rsmets
Copy link

rsmets commented Nov 30, 2020

One hacky workaround that I have found works is you can specify a different instance size which will then force a totally new node group to be created which will then respect your (new, "initial") desired_capacity setting. I sure any other hack which forces a new node group to be created would work as well.

I agree with many of the other thread comments, it really feels odd that desired_capacity is not actually mutable by terraform. That said I do not have a clear picture of what the aws interface is like - I'm sure it's easier said than done!

@tolajuwon
Copy link

I hacked it for now by using the value for the desired capacity in place of minimum capacity. At least if that's not a problem for your design, it works.

worker_groups = [
{
name = "worker-group-1"
key_name = var.worker_ssh_key_name
instance_type = var.worker_instance_type
asg_desired_capacity = var.worker_asg_desired_capacity
asg_max_size = var.worker_asg_max_size
asg_min_size = var.worker_asg_desired_capacity

},

]

@stale
Copy link

stale bot commented Mar 11, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 11, 2021
@shoekstra
Copy link

shoekstra commented Mar 22, 2021

This issue is still relevant and needs fixing/review. We shouldn't need to create a new node group to change the desired size.

@stale
Copy link

stale bot commented Apr 22, 2021

This issue has been automatically closed because it has not had recent activity since being marked as stale.

@stale stale bot closed this as completed Apr 22, 2021
@ericlake
Copy link

ericlake commented May 4, 2021

This is still an issue for our team. Terraform should be able to handle this.

@bit-herder
Copy link

please re open this. this is a major issue.

@barryib barryib reopened this May 6, 2021
@stale stale bot removed the stale label May 6, 2021
@barryib
Copy link
Member

barryib commented May 6, 2021

Re-opening this to let us track this issue. But so far we don't have a ideal fix for now.

Maybe hashicorp/terraform#24188 worth something.

@khatritariq
Copy link

Looking for its resolution.

@loliveira-itp
Copy link

Same, this issue affects my use case as well.

@amazingguni
Copy link

It is also problem for me

@ayk33
Copy link

ayk33 commented May 27, 2021

ditto. Need some kind of solution for this.

@letusfly85
Copy link

I fixed it by adding the following field... but is it right?

Before

  node_groups = {
    main = {
      desired_capacity = 1
      max_capacity     = 1
      min_capacity     = 1
      instance_type    = "t2.small"
      subnets          = module.vpc.private_subnets
    }

After

  node_groups = {
    main = {
      // desired_capacity = 1
      desired_size = 1
      max_capacity     = 1
      min_capacity     = 1
      instance_type    = "t2.small"
      subnets          = module.vpc.private_subnets
    }

@ayk33
Copy link

ayk33 commented May 28, 2021

I fixed it by adding the following field... but is it right?

Before

  node_groups = {
    main = {
      desired_capacity = 1
      max_capacity     = 1
      min_capacity     = 1
      instance_type    = "t2.small"
      subnets          = module.vpc.private_subnets
    }

After

  node_groups = {
    main = {
      // desired_capacity = 1
      desired_size = 1
      max_capacity     = 1
      min_capacity     = 1
      instance_type    = "t2.small"
      subnets          = module.vpc.private_subnets
    }

I'm not sure how that would work? The lifecycle for the node_group ignores changes to desired_size
https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/modules/node_groups/node_groups.tf#L75

@letusfly85
Copy link

letusfly85 commented May 29, 2021

@ayk33

Thank you for replying, hmmm, you're right.
And this is my misunderstanding, thank you.

@ayk33
Copy link

ayk33 commented May 29, 2021

Re-opening this to let us track this issue. But so far we don't have a ideal fix for now.

Maybe hashicorp/terraform#24188 worth something.

Would it be possible to only have the desired capacity in the lifecycle rule if autoscaling is disabled?

@daern91
Copy link

daern91 commented Aug 5, 2021

We're also running into problems with this one.

@fitchtech
Copy link

The problem with desired_size = each.value["desired_capacity"] is that if your node group auto scaling has scaled out then on a subsequent run of Terraform apply it will set the desired back to whatever is in your code. The problem is that the desired_capacity is required on create of the node group. You can then comment out desired_capacity or change the value to what's currently in the group, however that's a pain. Since if you need to recreate the node group you need to put desired_capacity back in.

Additionally, most updates to existing node groups with the EKS module in general fail. They do not trigger an in place update as they should. Instead triggering a replacement, which then fails cause it says the node group name already exists. Only way to fix that is to create a new node group with a different name.

@taragurung
Copy link

For me it was

lifecycle {
    ignore_changes = [scaling_config[0].desired_size]
  }

which ignores the changes in desired_size causing the issue so I commented it out and it worked and uncomment it if you want the feature back.

https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/eks_node_group

@stale
Copy link

stale bot commented Sep 16, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 16, 2021
@stale
Copy link

stale bot commented Sep 23, 2021

This issue has been automatically closed because it has not had recent activity since being marked as stale.

@stale stale bot closed this as completed Sep 23, 2021
@vladimir259
Copy link

still an issue

@vladimir259
Copy link

please reopen

@daroga0002
Copy link
Contributor

this is duplicated in #1568

@tech4242
Copy link

This is really tiresome and while the docs have slightly improved with the FAQ change I still feel like this module is useless for manual changes to EKS. I ended up doing it with the console:

image

Because going from:

node_groups = {
    first = {
      desired_capacity = 1
      max_capacity     = 5
      min_capacity     = 1

      instance_type = "m5.large"
    }
  }

to

node_groups = {
    first = {
      desired_capacity = 2
      max_capacity     = 5
      min_capacity     = 2

      instance_type = "m5.large"
    }
  }

produces Error: error updating EKS Node Group (xxx) config: InvalidParameterException: Minimum capacity 2 can't be greater than desired size 1.

If you do the change manually and to terraform plan you will see that the change has been made. Funnily enough if you then look at the actual tfstate:

only the min_size has changed:

"scaling_config": [
              {
                "desired_size": 1,
                "max_size": 5,
                "min_size": 2
              }
            ],

Proposal: can't we add something like autoscaling=false to let the module know that autoscaling is off, so desired_size is not as toothless?

In the meantime I did scale to 2 nodes and my tfstate is more or less correct (I have to swallow down the wrong desired_size) but this could not be any less fun to work with.

@psyhomb
Copy link

psyhomb commented Jan 30, 2022

Still an issue, anything new on this?

@daroga0002
Copy link
Contributor

Still an issue, anything new on this?

This is not threaten as issue as this is working as expected and this is some compromise which we implemented in this module and no plans to change it (as this will impact for exmaple autoscaling)

@psyhomb
Copy link

psyhomb commented Mar 25, 2022

We could maybe create two separate resources resource "aws_eks_node_group" "this" {} and resource "aws_eks_node_group" "this_autoscaling" {}, introduce a new input variable of type bool e.g. use_autoscaling and then set count value for first resource to something like count = var.create && !var.use_autoscaling ? 1 : 0 and count = var.create && var.use_autoscaling ? 1 : 0 for second resource, this resource will have simple lifecycle meta-argument w/o ignore_changes and the other one this_autoscaling w/ ignore_changes.

Example:

variable "use_autoscaling" {
  description = "Determines whether autoscaling will be used or not"
  type        = bool
  default     = true
}
resource "aws_eks_node_group" "this" {
  count = var.create && !var.use_autoscaling ? 1 : 0

  ...

  lifecycle {
    create_before_destroy = true
  }

  ...
}
resource "aws_eks_node_group" "this_autoscaling" {
  count = var.create && var.use_autoscaling ? 1 : 0

  ...

  lifecycle {
    create_before_destroy = true
    ignore_changes = [
      scaling_config[0].desired_size,
    ]
  }

  ...
}

We should also update variables in outputs.tf file accordingly.

I agree it is not the most elegant solution, redundant code and everything but it's the only solution I can think of given that conditional expressions are still not supported within lifecycle meta-argument block.

@Speculor
Copy link

This is still a major issue over two years after it was raised

@bryantbiggs
Copy link
Member

This is still a major issue over two years after it was raised

It is not a major issue, it is a design decision the module has taken. The majority of Kubernetes/EKS users utilize some form of autoscaling and without variable support for ignore_changes by Terraform core, thats what we currently have

@terraform-aws-modules terraform-aws-modules locked as resolved and limited conversation to collaborators Apr 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests