-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CrashLoopBackOff due to EC2MetadataError: failed to make EC2Metadata request, status code: 401 #455
Comments
Hello. What is the controller version you are using? Could you provide a more detailed error log message? |
@SarumObjects What kind of controller version and AWS integration do you use? |
@szuecs v0.12 (I downloaded :latest) and created the cluster with Kops (1.22.22). I've built several similar clusters in the last 24 months (we're running one as prod) and I have burned and built a QA cluster (same script) some 4 times. The cluster validates successfully but when I install kube-ingress-aws-controller/skipper (same manifest as our prod cluster - different name) I get this error: "EC2MetadataError: failed to make EC2Metadata request" |
@SarumObjects I think just paste the logs here until the crash would be great! Latest version meaning v0.12.12? Also interesting would be if you would past the output of We don't have really knowledge about Kops. Is the version you are referring to the same as the Kubernetes version? |
@szuecs the 'latest' still restarts. kops is version 1.22.2 |
@SarumObjects Could you get ingress controller logs as well ( |
@AlexanderYastrebov this the command and the complete log: |
Could you try to run with |
there's no --debug at the command line. |
kube-ingress-aws-controller/controller.go Line 191 in 5c66137
|
kubectl -n kube-system logs -f pod/kube-ingress-aws-controller-65775b947-dx9tl --ignore-errors=false time="2021-12-03T14:50:47Z" level=info msg="starting /kube-ingress-aws-controller v0.12.14" 2021/12/03 14:52:50 DEBUG: Response ec2metadata/GetMetadata Details: 2021/12/03 14:52:50 DEBUG: Validate Response ec2metadata/GetMetadata failed, attempt 0/3, error EC2MetadataError: failed to make EC2Metadata request
time="2021-12-03T14:52:50Z" level=fatal msg="EC2MetadataError: failed to make EC2Metadata request\n\n\tstatus code: 401, request id: " |
This log here:
169.254.169.254 is the metadata service by AWS. It sent a TCP RST packet, instead of sending us the data required to access AWS APIs. What Kubernetes IAM integration do you use? |
Thats helpful. I'll look into the IAM permissions. |
@SarumObjects let us know what the error was to share with other folks that might find this issue. After that we can close it. |
still investigating: https://kops.sigs.k8s.io/releases/1.22-notes/ |
In the end, I simply had to change the Nodes.instanceMetadata form httpPutResponseHopLimit: 1 to httpPutResponseHopLimit: 3 and then the metadata query can run - but I'm blocked again (failed to get ingress list). |
Im having this exact same issue, out of nowhere, on ONE out of 80 clusters.....makes no sense. Where exactly did you change that setting @SarumObjects ? did you get it to work? |
@jbilliau-rcd I had to "kops edit cluster" the changes (httpPutResponseHopLimit: 3) rather than update them with a script (I have only 4 clusters of 3 nodes each). They continue to work but if I upgrade the clusters I now have terminate the nodes - which I do with a script giving time for the replacement nodes to start. |
@SarumObjects @jbilliau-rcd can you create a docs PR for kops update to highlight the version k8s update can trigger this? Our current cluster setup is Kubernetes 1.21 and not kops, so I can not test on our side if it's kops related or not. We are migrating from crd v1beta1 and ingress v1beta1 since >1/2 year and soon we will update to 1.22. |
@szuecs apologies, I dont quite understand what you are asking. You want me to put in a PR to update docs for what exactly? That this can happen if you go to 1.22? Do we know that for sure? I have plenty of clusters running EKS 1.22 just fine with 0.14.0 of this controller, with the following argument set in the pod spec So we are already on 1.22, already using the new v1 ingress API, and it works on all clusters except one. Mind you...that one isnt even on 1.22! Its on 1.21, so I don't think this has anything to do with 1.22, looks more oidc/iam related. |
@jbilliau-rcd oh interesting so we need to investigate more. Right now we have to rely on you contributors. |
So I ended up running this command:
With the instance-id being the ec2 node that the Zalando pod was running on, and that fixed it! How this (so far) has only happened on one node is still puzzling to me, but that is the issue. It seems like the fix would need to be that the pod should never contact (or at least have the configuration option to never contact) the ec2 instance metadata service, instead only ever using OIDC to use it's own IAM role and not the role of the worker node. We give our Zalando ingress it's own role so the fact that it broke due to not being able to call the worker nodes metadata URL itself (presumably to use it's own if it needed to) kinda sucked :( |
@jbilliau-rcd With this it should be possible to run the controller without needing to contact the ec2 instance metadata service: #376 |
Ah interesting....looks like that was merged 2 years ago!? Has this hidden argument always been available? Don't see it in any documentation anywhere. |
Yeah, we should get this documented so it's more clear. |
I've followed the guidelines here: https://github.com/zalando-incubator/kube-ingress-aws-controller/blob/master/deploy/kops.md but kube-ingress-aws-controller restarts every 2-10 minutes.
When I follow the log of the pod I get this error "EC2MetadataError: failed to make EC2Metadata request"
I have rebuilt the cluster and deleted it several times and cannot create the load balancer or target groups - although I have in the past. One of our clusters is still running so I have compared it in detail - and have got no differences except in the name.
We are blocked. This is our development environment. The instances have public & private IPs and the VPCs & SGs have been generated correctly.
Where should I look now please?
John
The text was updated successfully, but these errors were encountered: