-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.17 alpha versions causing regression for kiam? #8562
Comments
That kubernetes issue just linked is likely at the root of all this. |
Update: This seems to be specific to using the flannel/canal CNI with the vxlan backend by some accounts, and further testing seems to support that. |
So the problem clearly isn't with kops itself, however, it might be worthwhile to warn users in documentation, or even make invalid configurations with network CNI flannel/canal and vxlan backend with 1.17 versions, as it's going to result it more odd reports like this one. :) |
What, specifically, are the invalid configurations? |
See the flannel issue for more info here: flannel-io/flannel#1243 |
It seems there's not enough information to identify a particular bad configuration. It looks like the issue is still being triaged and is likely a bug in Flannel and/or Canal. There's time before kops 1.17 is released for the bug(s) to be fixed. If it later turns out to be a more permanent situation, we could add an api validation check then. |
See comment above for what constitutes a non-working configuration, which I've detailed as requested. The bug is in Flannel (which Canal uses), and I've linked the issue involved. Yes, it's possible that there will be a fix made available, but I'm not holding my breath as the project seems to be trending towards dormancy. |
So you're proposing kops should disallow a CNI of Canal or Flannel with Backend of vxlan for Kubernetes versions equal to or greater than 1.17? |
Thanks for reporting @jhohertz. The current theory is that it's related to the kernel version, and some kernels have bugs with computation of the checksums which can be worked around by turning off offload of that computation. Which image (AMI) are you using (or are you using the default kops image)? |
We're currently using the latest Flatcar stable release. I am currently looking at trying to patch in the ethtool thing for testing. |
I maybe have found hints as to "what's different between 1.16 and 1.17". A dependency in a netlink library used was bumped, and within that there are specific changes to vxlan and the handling of checksums. It looks like it should have really only added IPv6 UDP support for checksums, but... after searching around for whats different between 1.16 and 1.17, this kind of stands out. Comment on flannel issue: flannel-io/flannel#1243 (comment) Perhaps this will help folks find out what's going on? (Or possibly prove to be a red herring...) That update also includes new ethtoool-related code. |
Is someone able to write up a release note for Kops 1.17? I would prefer we not hold up 1.17 indefinitely for a new version of Flannel. |
Can this be closed now that #9074 has been merged and cherrypicked to 1.17? |
Probably? Any way you could cut another beta with this in place for wider testing? |
/close |
@johngmyers: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@jhohertz I think next release will be more of a RC or final. Not sure anything else can be done to improve things with Flannel until a new release comes. |
Is there a kops 1.17.0 build available with this fix included? We have encountered kiam issues when testing kops 1.17.0-beta.2 with flannel networking, which we need for our windows worker nodes to join. |
No release yet. It will go into the next one. |
Just a note to warn, this nightmare may also have just landed in 1.16 as of 1.16.10 k8s. Still investigating but it's behaving the exact same way. |
We do run flannel on a non standard port, so for us the suggested fix wont help. But it's easy to already today address this flannel issue, using a custom hook in the cluster manifest. Replace 4096 with 1 if you run with standard flannel setup.
|
I guess that was a bit dramatic of me. 😄 it just bothered me I couldn't explain why, though looking at the .10 patch looks like an iptables version bump (which also showed up in 1.16.0 to 1.17.0), may be the only thing networking related in the .10 patch. I'm aware of that workaround but thank you for mentioning it anyways. |
1. What
kops
version are you running? The commandkops version
, will displaythis information.
Any of the 1.17 alphas so far.
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.seen in 1.17.0-rc.2 through 1.17.3. Works without issue on kops/k8s 1.15 and 1.16 built clusters. ONLY change is bump to 1.17.x.
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
Try to install kiam via it's included helm chart onto a kops 1.17.x built-cluster
5. What happened after the commands executed?
The kiam-agent daemonsets crashloop
6. What did you expect to happen?
No crashloop.
**7. Please provide your cluster manifest. Execute
Will follow up with this if asked for. Main thing applicable here is we are using CoreDNS
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
From the agent logs w/ gRPC debugging enabled:
9. Anything else do we need to know?
Bug also posted with kiam folks here: uswitch/kiam#378
The text was updated successfully, but these errors were encountered: