You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using Ubuntu 20 with kernel 5.8 (after apt upgrade), we see again the issue rancher/rke2#1541. The workaround is disabling the checksum offload in the calico.vxlan interface.
The issue appears when from a node we try to access a service which is implemented by a pod in another node, e.g. coredns service. In that case, the traffic must traverse de vxlan tunnel and in the receiving node, we see in tcpdump:
When deploying calico as cni plugin in Ubuntu 20 with kernel 5.8 (after apt upgrade), I expect to access successfully all kubernetes services from any node (e.g. dns). When doing tcpdump at the receiver node, I expect to see:
When deploying calico as cni plugin in Ubuntu 20 with kernel 5.8 (after apt upgrade), I can only access services (e.g. dns) from the node, if the pod implementing the service is in that node, i.e. as soon as the traffic must take the vxlan tunnel, things don't work. When doing tcpdump in the receiver node, I see:
sudo ethtool -K vxlan.calico tx-checksum-ip-generic off
or featureDetectOverride: "ChecksumOffloadBroken=true"
But this have a performance impact
Steps to Reproduce (for bugs)
1.Deploy Kubernetes on Ubuntu 20 with kernel 5.8 in 2 or more nodes
2. Run dig @10.43.0.10 www.google.com in all nodes. It will only work in one (assuming 10.43.0.10 is the clusterIP of the dns service)
3.
4.
Context
Your Environment
Calico version 3.19
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes
Operating System and version: Ubuntu 20 kernel 5.8
@fasaxc recently put in a fix to automatically disable checksum offload based on the kernel version, but it sounds like perhaps there are some kernels for which that fix isn't working properly?
I think the best solution I'm aware of at the moment is to explicitly disable the offload as you suggested in your post.
I'm not sure there is much else we can do on our side here - users either need to upgrade to a kernel that has the checksum fix included, or use one of the options above to turn off checksum offloading, or turn off --random-fully masquerade IIRC.
When using Ubuntu 20 with kernel 5.8 (after apt upgrade), we see again the issue rancher/rke2#1541. The workaround is disabling the checksum offload in the calico.vxlan interface.
The issue appears when from a node we try to access a service which is implemented by a pod in another node, e.g. coredns service. In that case, the traffic must traverse de vxlan tunnel and in the receiving node, we see in tcpdump:
Expected Behavior
When deploying calico as cni plugin in Ubuntu 20 with kernel 5.8 (after apt upgrade), I expect to access successfully all kubernetes services from any node (e.g. dns). When doing tcpdump at the receiver node, I expect to see:
Current Behavior
When deploying calico as cni plugin in Ubuntu 20 with kernel 5.8 (after apt upgrade), I can only access services (e.g. dns) from the node, if the pod implementing the service is in that node, i.e. as soon as the traffic must take the vxlan tunnel, things don't work. When doing tcpdump in the receiver node, I see:
Possible Solution
sudo ethtool -K vxlan.calico tx-checksum-ip-generic off
or
featureDetectOverride: "ChecksumOffloadBroken=true"
But this have a performance impact
Steps to Reproduce (for bugs)
1.Deploy Kubernetes on Ubuntu 20 with kernel 5.8 in 2 or more nodes
2. Run
dig @10.43.0.10 www.google.com
in all nodes. It will only work in one (assuming 10.43.0.10 is the clusterIP of the dns service)3.
4.
Context
Your Environment
The text was updated successfully, but these errors were encountered: