bad udp cksum when using vxlan #4865

manuelbuil · 2021-08-24T11:10:49Z

When using Ubuntu 20 with kernel 5.8 (after apt upgrade), we see again the issue rancher/rke2#1541. The workaround is disabling the checksum offload in the calico.vxlan interface.

The issue appears when from a node we try to access a service which is implemented by a pod in another node, e.g. coredns service. In that case, the traffic must traverse de vxlan tunnel and in the receiving node, we see in tcpdump:

06:29:06:4b:55:9a > 06:47:af:1e:bf:ca, ethertype IPv4 (0x0800), length 143: (tos 0x0, ttl 64, id 16380, offset 0, flags [none], proto UDP (17), length 129)
    10.0.10.14.7103 > 10.0.10.10.4789: [bad udp cksum 0xbd67 -> 0x4e7f!] VXLAN, flags [I] (0x08), vni 4096

Expected Behavior

When deploying calico as cni plugin in Ubuntu 20 with kernel 5.8 (after apt upgrade), I expect to access successfully all kubernetes services from any node (e.g. dns). When doing tcpdump at the receiver node, I expect to see:

06:29:06:4b:55:9a > 06:47:af:1e:bf:ca, ethertype IPv4 (0x0800), length 143: (tos 0x0, ttl 64, id 21070, offset 0, flags [none], proto UDP (17), length 129)
    10.0.10.14.50078 > 10.0.10.10.4789: [udp sum ok] VXLAN, flags [I] (0x08), vni 4096

Current Behavior

When deploying calico as cni plugin in Ubuntu 20 with kernel 5.8 (after apt upgrade), I can only access services (e.g. dns) from the node, if the pod implementing the service is in that node, i.e. as soon as the traffic must take the vxlan tunnel, things don't work. When doing tcpdump in the receiver node, I see:

06:29:06:4b:55:9a > 06:47:af:1e:bf:ca, ethertype IPv4 (0x0800), length 143: (tos 0x0, ttl 64, id 16380, offset 0, flags [none], proto UDP (17), length 129)
    10.0.10.14.7103 > 10.0.10.10.4789: [bad udp cksum 0xbd67 -> 0x4e7f!] VXLAN, flags [I] (0x08), vni 4096

Possible Solution

sudo ethtool -K vxlan.calico tx-checksum-ip-generic off
or
featureDetectOverride: "ChecksumOffloadBroken=true"

But this have a performance impact

Steps to Reproduce (for bugs)

1.Deploy Kubernetes on Ubuntu 20 with kernel 5.8 in 2 or more nodes
2. Run dig @10.43.0.10 www.google.com in all nodes. It will only work in one (assuming 10.43.0.10 is the clusterIP of the dns service)
3.
4.

Context

Your Environment

Calico version 3.19
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes
Operating System and version: Ubuntu 20 kernel 5.8
Link to your project (optional): RKE2 Cluster running Calico seemingly losing UDP traffic when transiting through service IP to remotely located pod rancher/rke2#1541

The text was updated successfully, but these errors were encountered:

caseydavenport · 2021-08-31T20:27:58Z

@fasaxc recently put in a fix to automatically disable checksum offload based on the kernel version, but it sounds like perhaps there are some kernels for which that fix isn't working properly?

I think the best solution I'm aware of at the moment is to explicitly disable the offload as you suggested in your post.

caseydavenport · 2021-09-21T16:16:06Z

I'm not sure there is much else we can do on our side here - users either need to upgrade to a kernel that has the checksum fix included, or use one of the options above to turn off checksum offloading, or turn off --random-fully masquerade IIRC.

brandond mentioned this issue Aug 26, 2021

the overlay network between nodes seems to be broken in Debian 11 (bullseye) k3s-io/k3s#3863

Closed

caseydavenport added the kind/support label Aug 31, 2021

caseydavenport closed this as completed Sep 21, 2021

maxpain mentioned this issue Jun 16, 2022

VXLAN: bad UDP checksums kubernetes-sigs/kubespray#8992

Closed

xuanyuanaosheng mentioned this issue Aug 28, 2023

The VPC Redundant router "Virtual routers" can not work as expected apache/cloudstack#7838

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bad udp cksum when using vxlan #4865

bad udp cksum when using vxlan #4865

manuelbuil commented Aug 24, 2021

caseydavenport commented Aug 31, 2021

caseydavenport commented Sep 21, 2021

bad udp cksum when using vxlan #4865

bad udp cksum when using vxlan #4865

Comments

manuelbuil commented Aug 24, 2021

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

caseydavenport commented Aug 31, 2021

caseydavenport commented Sep 21, 2021