-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS node do not get IPv6 address from wicked-dhcp6 #3143
Comments
Thanks for the issue @sdomme ! This is an interesting failure case. Appreciate the debug work you've done. Do you have an estimation of how often it happens in your clusters? Something like "x failures in x instance launches". On an instance that fails to get a DHCP6 lease, does I'd also be curious to see the output of the following (debug mode set to Not sure |
Hi @zmrow Unfortunately (#2254) there is not much in the logs except of For the estimation it is a bit hard to say. Totally depending on the Cluster. Currently there is one Cluster with 16 karpenter provisioned nodes. The oldest one with an age of 32h. 1 Node is in a broken state. My estimation would be something every 12-16 nodes / 24h there is one with this issue. |
Thanks @sdomme . I've launched a bunch of instances and seem seem to have reproduced the same behavior you're seeing. I'm digging into it and am looking for a fix. I'll provide additional details here when I have them. |
Update: I turned on some additional debug logging in I had an initial theory about the timing of the I'm digging through the additional logs of this instance and will follow up tomorrow as I'm launching more as we speak. |
Update: Continuing the search! I found that in all cases where the instance failed to get an IPv6 address,
In code, if the transmit delay is set to 0, the function cancels the timer and doesn't continue. A DHCPv6 Solicit message is never sent and everything grinds to a halt. @markusboehme wrote a few exploratory patches today to adjust the solicitation delay. In my testing of these patches, I was still able to produce 1 node in 2000 launches that failed to get a DHCPv6 address. The interface never seems to transition to "managed" mode, which we see in successful instances:
We'll continue to dig into this. |
@sdomme good news! I think we found the culprit(s) and have a good bead on a fix. I have a PR out now for one half of the fix, and @markusboehme is working on a PR for some patches to We went pretty far down the rabbit hole on this one. As previously mentioned, we came across an issue where After digging a bit deeper in wicked/kernel code we discovered my initial suspicion about the Using my PR and the aforementioned patches to The |
Hello @zmrow It is indeed very good news. Thank you and @markusboehme for this awesome debugging. |
Yet again! thanks for your in-depth debugging and resolving this quite fast 🏆, could you also let us know when this would released? @zmrow and @markusboehme |
We don't have an exact release date yet, but want to include this in the next release. We'll follow up here when we have more details. |
Fixes merged to |
Image I'm using: Bottlerocket OS 1.14.0 (aws-k8s-1.24)
What I expected to happen:
In an EKS IPv6 Cluster I expect to get an IPv6 addr on the
eth0
interface (primary ENI) of the nodes, what also happen the most of the time. For example:What actually happened:
It happen on multiple EKS Clusters (nodes are provisioned by Karpenter dynamically) that the nodes don't get an IPv6 addr on
eth0
even if the ENI has this IP addr configured (from the AWS console). Tools (in our case cilium) using IPv6 to talk to the kubernetes API run into timeout and break the whole network stack.What we could find so far:
IPv6 request is stuck. It says
granted
on a working node.Trying to get an IP manually leads to an timeout.
Gives up after 10 seconds:
More verbose:
Interestingly restarting the whole wicked helps to get an IP
How to reproduce the problem:
We couldn't find a pattern, why this happen. In a Cluster where we do dynamic scaling and exchange the nodes very often per day it happen more often.
We were recommended to open up the issue from aws/amazon-vpc-cni-k8s#2391
The text was updated successfully, but these errors were encountered: