Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrashLoopBackOff due to EC2MetadataError: failed to make EC2Metadata request, status code: 401 #455

Open
SarumObjects opened this issue Nov 30, 2021 · 25 comments

Comments

@SarumObjects
Copy link

I've followed the guidelines here: https://github.com/zalando-incubator/kube-ingress-aws-controller/blob/master/deploy/kops.md but kube-ingress-aws-controller restarts every 2-10 minutes.
When I follow the log of the pod I get this error "EC2MetadataError: failed to make EC2Metadata request"
I have rebuilt the cluster and deleted it several times and cannot create the load balancer or target groups - although I have in the past. One of our clusters is still running so I have compared it in detail - and have got no differences except in the name.

We are blocked. This is our development environment. The instances have public & private IPs and the VPCs & SGs have been generated correctly.

Where should I look now please?
John

@AlexanderYastrebov
Copy link
Member

Hello. What is the controller version you are using? Could you provide a more detailed error log message?

@szuecs
Copy link
Member

szuecs commented Dec 2, 2021

@SarumObjects What kind of controller version and AWS integration do you use?
Kube2iam and all others had issues with jtblin/kube2iam#130.
The error message looks like aws/aws-sdk-go#870 and this is quite old and should be fixed by recent kubernetes AWS iam integrations.

@SarumObjects
Copy link
Author

@szuecs v0.12 (I downloaded :latest) and created the cluster with Kops (1.22.22). I've built several similar clusters in the last 24 months (we're running one as prod) and I have burned and built a QA cluster (same script) some 4 times. The cluster validates successfully but when I install kube-ingress-aws-controller/skipper (same manifest as our prod cluster - different name) I get this error: "EC2MetadataError: failed to make EC2Metadata request"
@AlexanderYastrebov : This is the total log! I don't know how to debug this controller. I've searched the documentation for 'debug and 'verbose' - and I am stuck for over a week.

@szuecs
Copy link
Member

szuecs commented Dec 2, 2021

@SarumObjects I think just paste the logs here until the crash would be great!

Latest version meaning v0.12.12?
We just merged updates to aws-sdk, maybe you want to try v0.12.14, when it's released in some minutes (automated process).

Also interesting would be if you would past the output of kubectl describe pods kube-ingress-aws-controller-....

We don't have really knowledge about Kops. Is the version you are referring to the same as the Kubernetes version?

@SarumObjects
Copy link
Author

@szuecs the 'latest' still restarts.
here's the output from kubectl describe pods kube-ingress-aws-controller-..
kiac-describe.txt

kops is version 1.22.2
kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:41:42Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:42:41Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

@AlexanderYastrebov
Copy link
Member

@SarumObjects Could you get ingress controller logs as well (kubectl logs kube-ingress-aws-controller-...)?

@SarumObjects
Copy link
Author

@AlexanderYastrebov this the command and the complete log:
kubectl -n kube-system logs -f kube-ingress-aws-controller-5fbcd9fff8-vqrvg
time="2021-12-03T11:52:45Z" level=info msg="starting /kube-ingress-aws-controller v0.12.14"
time="2021-12-03T11:54:48Z" level=fatal msg="EC2MetadataError: failed to make EC2Metadata request\n\n\tstatus code: 401, request id:

@AlexanderYastrebov
Copy link
Member

AlexanderYastrebov commented Dec 3, 2021

Could you try to run with --debug option (it would print more details in the logs)?
401 suggests some kind of problem with AWS credentials.

@SarumObjects
Copy link
Author

there's no --debug at the command line.

@AlexanderYastrebov
Copy link
Member

there's no --debug at the command line.

~$ docker run -it --rm registry.opensource.zalan.do/teapot/kube-ingress-aws-controller:latest --help
INFO[0000] starting /kube-ingress-aws-controller v0.12.14 
usage: kube-ingress-aws-controller [<flags>]

Flags:
  --help                         Show context-sensitive help (also try --help-long and --help-man).
  --version                      Print version and exit
  --debug                        Enables debug logging level
...

kingpin.Flag("debug", "Enables debug logging level").Default("false").BoolVar(&debugFlag)

@AlexanderYastrebov AlexanderYastrebov changed the title CrashLoopBackOff CrashLoopBackOff due to EC2MetadataError: failed to make EC2Metadata request, status code: 401 Dec 3, 2021
@SarumObjects
Copy link
Author

kubectl -n kube-system logs -f pod/kube-ingress-aws-controller-65775b947-dx9tl --ignore-errors=false
time="2021-12-03T14:50:47Z" level=debug msg=aws.NewAdapter
time="2021-12-03T14:50:47Z" level=debug msg=aws.ec2metadata.GetMetadata
2021/12/03 14:50:47 DEBUG: Request ec2metadata/GetToken Details:
---[ REQUEST POST-SIGN ]-----------------------------
PUT /latest/api/token HTTP/1.1
Host: 169.254.169.254
User-Agent: aws-sdk-go/1.42.16 (go1.17.1; linux; amd64)
Content-Length: 0
X-Aws-Ec2-Metadata-Token-Ttl-Seconds: 21600
Accept-Encoding: gzip


time="2021-12-03T14:50:47Z" level=info msg="starting /kube-ingress-aws-controller v0.12.14"
2021/12/03 14:52:50 DEBUG: Send Request ec2metadata/GetToken failed, attempt 0/3, error RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": read tcp 100.96.4.21:34662->169.254.169.254:80: read: connection reset by peer
2021/12/03 14:52:50 DEBUG: Request ec2metadata/GetMetadata Details:
---[ REQUEST POST-SIGN ]-----------------------------
GET /latest/meta-data/instance-id HTTP/1.1
Host: 169.254.169.254
User-Agent: aws-sdk-go/1.42.16 (go1.17.1; linux; amd64)
Accept-Encoding: gzip


2021/12/03 14:52:50 DEBUG: Response ec2metadata/GetMetadata Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 401 Unauthorized
Connection: close
Content-Type: text/plain
Date: Fri, 03 Dec 2021 14:52:50 GMT
Server: EC2ws
Content-Length: 0


2021/12/03 14:52:50 DEBUG: Validate Response ec2metadata/GetMetadata failed, attempt 0/3, error EC2MetadataError: failed to make EC2Metadata request

status code: 401, request id: 

time="2021-12-03T14:52:50Z" level=fatal msg="EC2MetadataError: failed to make EC2Metadata request\n\n\tstatus code: 401, request id: "

@szuecs
Copy link
Member

szuecs commented Dec 3, 2021

This log here:

 caused by: Put "http://169.254.169.254/latest/api/token": read tcp 100.96.4.21:34662->169.254.169.254:80: read: connection reset by peer

169.254.169.254 is the metadata service by AWS. It sent a TCP RST packet, instead of sending us the data required to access AWS APIs.

What Kubernetes IAM integration do you use?
For me this looks like not to be an issue by the controller, but either AWS or the IAM integration that should support in getting the IAM done.
Maybe also your EC2 nodes have not the right permissions to access metadata service to access AWS APIs with sts::assumeRole, which is required for all Kubernetes AWS IAM integrations.

@SarumObjects
Copy link
Author

Thats helpful. I'll look into the IAM permissions.

@szuecs
Copy link
Member

szuecs commented Dec 7, 2021

@SarumObjects let us know what the error was to share with other folks that might find this issue. After that we can close it.

@SarumObjects
Copy link
Author

still investigating: https://kops.sigs.k8s.io/releases/1.22-notes/

@SarumObjects
Copy link
Author

In the end, I simply had to change the Nodes.instanceMetadata form httpPutResponseHopLimit: 1 to httpPutResponseHopLimit: 3 and then the metadata query can run - but I'm blocked again (failed to get ingress list).
Closing this one with thanks.

@jbilliau-rcd
Copy link
Contributor

Im having this exact same issue, out of nowhere, on ONE out of 80 clusters.....makes no sense. Where exactly did you change that setting @SarumObjects ? did you get it to work?

@SarumObjects
Copy link
Author

@jbilliau-rcd I had to "kops edit cluster" the changes (httpPutResponseHopLimit: 3) rather than update them with a script (I have only 4 clusters of 3 nodes each). They continue to work but if I upgrade the clusters I now have terminate the nodes - which I do with a script giving time for the replacement nodes to start.
It's a very odd behaviour - but I've not got enough time to explore it. (If it ain't broke, don't fix it)

@szuecs
Copy link
Member

szuecs commented Jan 3, 2023

@SarumObjects @jbilliau-rcd can you create a docs PR for kops update to highlight the version k8s update can trigger this?

Our current cluster setup is Kubernetes 1.21 and not kops, so I can not test on our side if it's kops related or not. We are migrating from crd v1beta1 and ingress v1beta1 since >1/2 year and soon we will update to 1.22.

@szuecs szuecs reopened this Jan 3, 2023
@jbilliau-rcd
Copy link
Contributor

@szuecs apologies, I dont quite understand what you are asking. You want me to put in a PR to update docs for what exactly? That this can happen if you go to 1.22? Do we know that for sure? I have plenty of clusters running EKS 1.22 just fine with 0.14.0 of this controller, with the following argument set in the pod spec - --ingress-api-version=networking.k8s.io/v1.

So we are already on 1.22, already using the new v1 ingress API, and it works on all clusters except one. Mind you...that one isnt even on 1.22! Its on 1.21, so I don't think this has anything to do with 1.22, looks more oidc/iam related.

@szuecs
Copy link
Member

szuecs commented Jan 3, 2023

@jbilliau-rcd oh interesting so we need to investigate more. Right now we have to rely on you contributors.

@jbilliau-rcd
Copy link
Contributor

So I ended up running this command:

aws ec2 modify-instance-metadata-options \
    --instance-id i-1234567898abcdef0 \
    --http-put-response-hop-limit 3 \
    --http-endpoint enabled

With the instance-id being the ec2 node that the Zalando pod was running on, and that fixed it! How this (so far) has only happened on one node is still puzzling to me, but that is the issue. It seems like the fix would need to be that the pod should never contact (or at least have the configuration option to never contact) the ec2 instance metadata service, instead only ever using OIDC to use it's own IAM role and not the role of the worker node. We give our Zalando ingress it's own role so the fact that it broke due to not being able to call the worker nodes metadata URL itself (presumably to use it's own if it needed to) kinda sucked :(

@mikkeloscar
Copy link
Collaborator

@jbilliau-rcd With this it should be possible to run the controller without needing to contact the ec2 instance metadata service: #376

@jbilliau-rcd
Copy link
Contributor

Ah interesting....looks like that was merged 2 years ago!? Has this hidden argument always been available? Don't see it in any documentation anywhere.

@mikkeloscar
Copy link
Collaborator

Yeah, we should get this documented so it's more clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants