Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for tagging EC2 instances/nodes (and other AWS resources) which are a part of the cluster #306

Closed
saurav-agarwalla opened this issue Feb 17, 2022 · 26 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@saurav-agarwalla
Copy link
Contributor

What would you like to be added:

Support for tagging EC2 instances and other AWS resources which are created/used as part of the cluster. We can make this optional by providing a flag using which customers of AWS Cloud Provider can enable or disable this feature.

Why is this needed:

A lot of EKS customers today (and I am sure other customers running Kubernetes on AWS) want to tag all the resources associated with a cluster for better management as well as cost allocation. There's no easy way to do that today across all customer use-cases because customers may use different means of creating these resources like Cloudformation, Terraform, Controllers, etc. and hence they have to come up with their own solution for this problem. We want to add support for it in the AWS Cloud Provider itself so that all customers get this by default without having to separately tag these resources.

We can start off by adding this support for EC2 instances but evolve this to other resources like load balancers, EBS volumes, etc.

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 17, 2022
@justinsb
Copy link
Member

I think this is a good idea, particularly if it is optional. We do have a cluster-tag for resources that KCM/CCM creates (volumes, load balancers).

One thing we should consider is how nodes join the cluster; we want to avoid a circular dependency here where they can't join the cluster without a tag but the tag is added by this mechanism. We also want to not subvert the security mechanisms, but per discussion in provider-aws meeting we think that checking the IAM role is a better practice anyway.

@nckturner
Copy link
Contributor

I assume the concern is about the case where we are using this mechanism to tag the nodes with the kubernetes.io/cluster/clustername=owned tag which is used by the CCM to filter ec2 instances. So how would the ccm tag the nodes or know which nodes to tag if its missing the tag that it normally uses to find ec2 instances in the first place? If we were to replace that tag or intend for it to be set by this component then we would need to think through that workflow for both the EKS and non-EKS use case. But if we leave that tag alone, which is currently set by some out-of-band process (like defined in cloudformation--kubelet just expects it to exist) then we don't have to consider that workflow?

@olemarkus
Copy link
Member

My $0.02 would be:

  • CCM doesn't "own" instances and its tags. Installers expect to, and one can easily get into conflicts here. At least for kOps this would be problematic. CCM doesn't create or otherwise change instances (or launch templates), so I think the boundaries are clear. I am fairly sure at least Karpenter users will run into conflicts here as well.
  • For resouces like volumes and load balancers it makes a lot of sense for CCM to add common tags.
  • EBS volume driver and AWS Load Balancer Controller both have CLI flags where one can specify common tags.
  • Volume controller in CCM perhaps should go away soon, but we are somewhat far away from LBC to fully replace the service controller.

So my suggestion would be to add a config option to cloud config for specifying these tags, and then apply them to ELBs and Volumes.

@nckturner
Copy link
Contributor

nckturner commented Feb 19, 2022

CCM doesn't "own" instances and its tags. Installers expect to, and one can easily get into conflicts here. At least for kOps this would be problematic. CCM doesn't create or otherwise change instances (or launch templates), so I think the boundaries are clear. I am fairly sure at least Karpenter users will run into conflicts here as well.

Yeah I think the primary use case we are trying to solve with this is when users want to label all Kubernetes adjacent AWS resources for cost allocation. We can default to some kind of namespace prefix for the tag names to avoid conflicts. I think we don't touch the existing cluster tag, this would only be an additional tag or set of tags that users would define, and this would be in a separate controller that could be entirely disabled if a user wanted.

@nckturner
Copy link
Contributor

cc @ellistarn thoughts on the relationship with Karpenter?

@olemarkus
Copy link
Member

How far do you see "all adjecent" go. Would it tag everything already having the kubernetes.io/cluster/clustername=owned tag? Like kOps would also create the node/control plane security group, control plane load balancer etc. There are quite a lot of other resources CCM does not provision. What about what CSI drivers may provision?

Just trying to get a feeling for what the scope would be, as "resources CCM provisions + EC2 instances" seems unexpected to me.

Also, adding custom tags to the resources CCM do provision should be fairly easy without any new controller. In fact, it probably shouldn't as users may want to tag these resources upon creation.

@olemarkus
Copy link
Member

If we enforce a given prefix, it resolves all concerns, I think. But it may also make the feature less useful. I think many have organisation-wide cost tags and cannot necessarily work with an enforced prefix.

@nckturner
Copy link
Contributor

How far do you see "all adjecent" go.

Very good question. My statement may have been hyperbolic, I think initially the proposal is to only tag EC2 instances, and the solution should be extensible to at least include load balancers and volumes. Not sure what else users might be interested in tagging. When it comes to resources that it does not provision, and I'm primarily thinking of CSI/AWS-LBC here--I think it makes sense to allow the CCM to tag those resources.

Also, adding custom tags to the resources CCM do provision should be fairly easy without any new controller. In fact, it probably shouldn't as users may want to tag these resources upon creation.

The other option as you point out is to build tag-on-create into all these components and ideally have them all draw from the same configuration. However I'm not sure how important tag-on-create is for cost allocation. Seems rather unimportant as long as the controller doesn't take too long to tag. Perhaps if we consider other use cases it may be more important.

@olemarkus
Copy link
Member

I am fairly sure LBC reconciles tags. So there will definitely be conflicts there unless they also implement support for ignoring the CCM prefix. Perhaps the controller should take in a list of AWS resources to tag.

I know some kops users have organisation-wide tag policies that enforce tag on create, so it was those I had in mind. These may or may not be cost tags.

Come to think of it, EBS driver does not support custom tags and we (employer) use https://github.com/mtougeron/k8s-aws-ebs-tagger to make volumes work with AWS Backup. See kubernetes-sigs/aws-ebs-csi-driver#180

@olemarkus
Copy link
Member

olemarkus commented Feb 19, 2022

Ah nevermind the last paragraph. The challenge there was with per volume tags, not global tags. But the reconciliations issue is even more complex with per resource tags in the mix as well.

EBS CSI driver only ensures volumes have the given global tags, but does not remove any tags, I think.

@ellistarn
Copy link

ellistarn commented Feb 19, 2022

I assume the concern is about the case where we are using this mechanism to tag the nodes with the kubernetes.io/cluster/clustername=owned tag which is used by the CCM to filter ec2 instances.

In the limit, this tag should be optional. It's pretty restrictive that nodes joining the cluster must be in the same account, and be tagged correctly.

cc @ellistarn thoughts on the relationship with Karpenter?

Karpenter currently has a tagging mechanism: https://github.com/aws/karpenter/blob/main/pkg/cloudprovider/aws/apis/v1alpha1/provider.go#L63. We don't reconcile tags, and do tag-on-create.

We do as much as we can to avoid knowing ClusterName since it creates so much pain for us. We never describe instances via tags -- instead, we know instance IDs by looking at node.spec.providerID.

On a separate note, if CCM is doing this tagging, it will need to know cluster name (and potentially things like cluster endpoint). Is it possible that this could be exposed via some API (configmap?) that controllers like karpenter could depend on?

@olemarkus
Copy link
Member

On a separate note, if CCM is doing this tagging, it will need to know cluster name (and potentially things like cluster endpoint). Is it possible that this could be exposed via some API (configmap?) that controllers like karpenter could depend on?

This functionality is also mentioned here:
kubernetes-sigs/aws-load-balancer-controller#2430 (comment)

But I think this is a separate issue.

@saurav-agarwalla
Copy link
Contributor Author

Thank you all for the feedback. I see that @nckturner has responded to most of them but I just want to point out that we want to keep this separate from the current tags that CCM applies. The idea is to let this run as a separate controller with a separate set of tags (which would likely be configurable and might be passed as a CLI flag) and apply to all AWS resources associated with a cluster (although we're only going to start with EC2 instances and add support for others later).

In the future, we may want to extend this controller to also handle the tags that are added by CCM today in which case we'll have to solve for some of the concerns above but we're trying to develop this in an incremental way and absorb feedback like the above as we go (especially when it relates to community use-cases since we don't have an exhaustive list of them at the moment).

Based on some of the comments above, it might seem that we may not want to tag certain types of resources so it may make sense to also make that configurable similar to how we'd make the tags configurable.

On a separate note, if CCM is doing this tagging, it will need to know cluster name (and potentially things like cluster endpoint). Is it possible that this could be exposed via some API (configmap?) that controllers like karpenter could depend on?

Cluster name is definitely something that we want to use at least when it comes to these tags for EKS since a lot of our customers want to know the cost allocation per cluster. We can evaluate how we can make it available for use by other components.

@olemarkus
Copy link
Member

What is the reason for focusing on EC2 first? Can't this more easily be handled by launch templates?

Also, do you see this controller only trying to add tags, and then only try once?

Just trying to understand how this can work alongside other controllers that own tags (and may try to remove tags they don't know about).

We (as in employer) are also after custom cost tags, but EC2 can already be tagged. Its the resources that CCM itself creates where we cannot define global tags.

@olemarkus
Copy link
Member

With my kOps hat on, I think we can easily support this as we'll send the global tags to both CCM and all addons. So any other controller that may try to reconcile would try to reconcile the same set of tags.

@saurav-agarwalla
Copy link
Contributor Author

What is the reason for focusing on EC2 first? Can't this more easily be handled by launch templates?

It can but tagging EC2 instances uniformly is one of the biggest pain points for our customers today - not all of them use LT - some of them use Managed Nodes, some use self-managed nodes and some use their own AMIs too. We're trying to make sure that all of them can benefit from this.

Also, do you see this controller only trying to add tags, and then only try once?

We may want to try reconciling this on every node event but nothing more frequent at this time (we'd expect the customers to not remove these tags anyway). If we see this become a major problem for our customers then we can add support for periodic reconciliations later.

Just trying to understand how this can work alongside other controllers that own tags (and may try to remove tags they don't know about).

We (as in employer) are also after custom cost tags, but EC2 can already be tagged. Its the resources that CCM itself creates where we cannot define global tags.

Can you please help me understand the scenarios about controllers who'd remove the tags they don't know about? So far we've been assuming that other controllers would play nice with tags they don't own (and I believe it is the right thing to do as well).

@ellistarn
Copy link

ellistarn commented Feb 24, 2022

Beyond cost allocation above, do we know where the node tagging requirement comes from? Why is it necessary for an instance to be tagged to a cluster for it to join? Shouldn't the fact that it's configured to talk to the cluster be enough? When CCM checks existence, can't it do so using the providerID?

@olemarkus
Copy link
Member

@saurav-agarwalla AWS load balancer controller would be one example:
https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/df6d2da9af7af342d9968165119d84d4352fc51d/pkg/deploy/ec2/tagging_manager.go

As I understand it, it will remove all tags it doesn't know about and that it hasn't explicitly been told to ignore from ELBs.

I see the challenge here a little bit differently. A controller shouldn't tag AWS resources it doesn't own (unless told to).

If you don't implement reconciliation, things are simpler. But I can easily imagine that you quickly get "hey, I changes the cost tag, but the old tag still exists on the resources".

The various mechanisms for managing instance groups you mentioned above. Do they not all offer ways to manage instance tags?

@saurav-agarwalla
Copy link
Contributor Author

Beyond cost allocation above, do we know where the node tagging requirement comes from? Why is it necessary for an instance to be tagged to a cluster for it to join? Shouldn't the fact that it's configured to talk to the cluster be enough? When CCM checks existence, can't it do so using the providerID?

Cost allocation is the biggest driver for this change but I am not a 100% sure about the tagging requirement for a node to join the cluster and if there was a particular reason for it to be that way. @nckturner might know more about this.

@saurav-agarwalla
Copy link
Contributor Author

@saurav-agarwalla AWS load balancer controller would be one example: https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/df6d2da9af7af342d9968165119d84d4352fc51d/pkg/deploy/ec2/tagging_manager.go

As I understand it, it will remove all tags it doesn't know about and that it hasn't explicitly been told to ignore from ELBs.

This is an interesting case. I didn't really expect a controller to remove tags it doesn't know about but we'll probably need to solve this somehow when we get to supporting tags on LBs.

I see the challenge here a little bit differently. A controller shouldn't tag AWS resources it doesn't own (unless told to).

If you don't implement reconciliation, things are simpler. But I can easily imagine that you quickly get "hey, I changes the cost tag, but the old tag still exists on the resources".

In the EKS case at least, these tags aren't meant to be configurable (at least for now). We want to tag these resources with a fixed prefix and cluster name so that all customers can rely on that tag for cost allocation.

The various mechanisms for managing instance groups you mentioned above. Do they not all offer ways to manage instance tags?

They do but customers today have to do this by themselves and there's no way to enforce that all customers do it for sure. For organizations with multiple teams managing their own clusters, it may make it very difficult to gain visibility into the cost because not all the teams might use the same tag (if at all).
This feature helps us make it standardized across all EKS clusters without the customer needing to do anything. I can foresee similar value for customers using this controller for their own clusters on AWS.

@olemarkus
Copy link
Member

Thanks. It clarifies what you want to achieve here. It may be a bit EKS specific and perhaps not entirely fitting as a CCM controller, but I can appreciate the usefulness of having it here.

@olemarkus
Copy link
Member

Would anyone object to me adding a PR for setting global tags for the service controller (volume controller I consider deprecated)?
I would assume this would be configured in cloudconfig, so we just need to avoid a conflict on these properties. Maybe something like TagControllerTags and ServiceControllerTags

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 8, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants