Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't use aws-node without VPC Resource Controller? #2584

Closed
mattburgess opened this issue Sep 26, 2023 · 12 comments
Closed

Can't use aws-node without VPC Resource Controller? #2584

mattburgess opened this issue Sep 26, 2023 · 12 comments
Assignees
Labels

Comments

@mattburgess
Copy link

What happened:

I tried upgrading our v1.12.6 aws-node daemonset to v1.15.0 but the pods fail to start up. They log the following:

{"level":"info","ts":"2023-09-26T14:17:02.870Z","caller":"ipamd/ipamd.go:550","msg":"Get Node Info for: ip-10-50-97-227.eu-west-1.compute.internal"}
{"level":"error","ts":"2023-09-26T14:17:02.974Z","caller":"ipamd/ipamd.go:423","msg":"Failed to add feature custom networking into CNINode%!(EXTRA *fmt.wrapError=failed to get API group resources: unable to retrieve the complete list of server APIs: vpcresources.k8s.aws/v1alpha1: the server could not find the requested resource)"}
{"level":"error","ts":"2023-09-26T14:17:02.974Z","caller":"aws-k8s-agent/main.go:32","msg":"Initialization failure: failed to get API group resources: unable to retrieve the complete list of server APIs: vpcresources.k8s.aws/v1alpha1: the server could not find the requested resource"}

That seems to be due to #2503. I took a look at the various env vars but couldn't see anything there, or in the code, that makes this feature optional. Having a hard dependency on a controller that looks specifically designed for EKS means it looks like we can't upgrade to this release on a non-EKS-but-still-hosted-in-AWS cluster. Have I understood things correctly or have I missed something in the docs?

Attach logs

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.24.17
  • CNI Version: 1.15.0
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04
  • Kernel (e.g. uname -a): 5.19.0-1029-aws
@jdn5126
Copy link
Contributor

jdn5126 commented Sep 26, 2023

@mattburgess how did you upgrade from v1.12.6 to v1.15.0? Did you install the full manifest or helm chart?

The VPC CNI has always had a dependency on the VPC Resource Controller for certain features, and a new CRD was introduced in v1.15.0. It sounds like that CRD is not installed in your cluster

@mattburgess
Copy link
Author

As per the release note at https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.15.0 I applied the full manifest from https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.15.0/config/master/aws-k8s-cni.yaml. Well, full disclosure, I applied all of that except for the aws-network-policy-agent container as we have enable-network-policy-controller: "false" set in the ConfigMap so didn't think it was necessary to run it. I can see the policyendpoints.networking.k8s.aws CRD in our cluster but can't see anything related to vpcresources.k8s.aws.

@mattburgess
Copy link
Author

Just for clarity, these are the 2 cni-related CRDs I have installed:

$ kubectl sandbox get crd | grep aws
eniconfigs.crd.k8s.amazonaws.com                     2023-09-25T01:24:32Z
policyendpoints.networking.k8s.aws                   2023-09-26T13:55:33Z

@jdn5126
Copy link
Contributor

jdn5126 commented Sep 26, 2023

Got it, so it looks like the VPC CNI does have a hard dependency on the CNINode CRD that the VPC Resource Controller installs: https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/config/crd/bases/vpcresources.k8s.aws_cninodes.yaml due to its Kubernetes client needing to load the schema: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/k8sapi/k8sutils.go#L115

In EKS, the VPC Resource Controller installs this CRD, so your issue lines up. You could argue that the VPC CNI should also try to install this CRD to prevent this issue, as otherwise there is a hard dependency on the controller being present

@jdn5126
Copy link
Contributor

jdn5126 commented Sep 26, 2023

In the meantime, you can manually install the CRD to avoid this issue, as you do not depend on any VPC RC features

@jdn5126
Copy link
Contributor

jdn5126 commented Sep 26, 2023

@mattburgess I am discussing with the team internally how we should handle this, as we definitely need to support VPC CNI running in Kubernetes without a dependency on the EKS control plane

@mattburgess
Copy link
Author

Thanks for the super quick turnaround on this @jdn5126. Does the following suggest I still might need the controller in place though? This is after I've installed the CNINode CRD as you previously suggested:

{"level":"info","ts":"2023-09-27T08:39:32.365Z","caller":"ipamd/ipamd.go:550","msg":"Get Node Info for: ip-10-50-96-39.eu-west-1.compute.internal"}
E0927 08:39:32.471903      10 reflector.go:148] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1alpha1.CNINode: unknown (get cninodes.vpcresources.k8s.aws)
{"level":"error","ts":"2023-09-27T08:39:32.570Z","caller":"ipamd/ipamd.go:423","msg":"Failed to add feature custom networking into CNINode%!(EXTRA *errors.StatusError=CNINode.vpcresources.k8s.aws \"ip-10-50-96-39.eu-west-1.compute.internal\" not found)"}
{"level":"error","ts":"2023-09-27T08:39:32.570Z","caller":"aws-k8s-agent/main.go:32","msg":"Initialization failure: CNINode.vpcresources.k8s.aws \"ip-10-50-96-39.eu-west-1.compute.internal\" not found"}

@jdn5126
Copy link
Contributor

jdn5126 commented Sep 27, 2023

Ah sorry @mattburgess, I should have looked more closely at the error. You have custom networking configured, so you are failing at https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L564 as IPAMD is trying to patch the CNINode resource to let the controller know that custom networking is enabled, but the resource does not exist as it was not created by the controller.

The intent here is for VPC CNI to be able to run without the controller, but for advanced features to only be possible with the controller. So the issue you are seeing is a bug that we need a code change for. We only need to let the controller know that custom networking is enabled when Security Groups for Pods (a controller-only feature) is enabled. I can get this fix in v1.15.1, which is targeting mid-October.

@jdn5126 jdn5126 self-assigned this Sep 27, 2023
@mattburgess
Copy link
Author

Ah sorry @mattburgess, I should have looked more closely at the error.

Although they look similar it's definitely a different error without then with the CRD in place.

We only need to let the controller know that custom networking is enabled when Security Groups for Pods (a controller-only feature) is enabled. I can get this fix in v1.15.1, which is targeting mid-October.

That's awesome! Thanks again.

@jdn5126
Copy link
Contributor

jdn5126 commented Sep 28, 2023

Yep, the error is different, but it resolves to the same root cause: running a Kubernetes operation (GET, PATCH, WATCH) against a resource that either does not exist and/or does not have a CRD loaded.

#2591 resolves this by making sure that we never patch the CNINode resource unless a controller feature is enabled. #2570 makes sure that we never issue a WATCH for CNINode ever.

@jdn5126
Copy link
Contributor

jdn5126 commented Oct 13, 2023

Closing now that v1.15.1 is released on GitHub

@jdn5126 jdn5126 closed this as completed Oct 13, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants