Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDP: bad checksum on VXLAN interface #1279

Closed
dmitry-irtegov opened this issue Apr 7, 2020 · 6 comments
Closed

UDP: bad checksum on VXLAN interface #1279

dmitry-irtegov opened this issue Apr 7, 2020 · 6 comments
Labels

Comments

@dmitry-irtegov
Copy link

dmitry-irtegov commented Apr 7, 2020

On k8s 1.17 cluster with RHeL 7 nodes, service IPs for pods on other nodes are not accessible.
Pod IP seem to work fine. Most noticeably, CoreDNS does not work.
Target node dmesg is filled with messages like:

[ 1423.035722] UDP: bad checksum. From 172.16.9.99:27503 to 172.16.80.252:8472 ulen 75
[ 1426.036873] UDP: bad checksum. From 172.16.9.99:4894 to 172.16.80.252:8472 ulen 75
[ 1427.537392] UDP: bad checksum. From 172.16.9.99:32693 to 172.16.80.252:8472 ulen 75
[ 1429.037910] UDP: bad checksum. From 172.16.9.99:46133 to 172.16.80.252:8472 ulen 75

Turning IP checksum offloading on the flannel.1 interface fixes the issue:

[root@ip-172-16-102-241 ~]# nslookup www.google.com 100.64.0.10
;; connection timed out; no servers could be reached
[root@ip-172-16-102-241 ~]# ethtool -K flannel.1 tx-checksum-ip-generic off
Actual changes:
tx-checksumming: off
	tx-checksum-ip-generic: off
tcp-segmentation-offload: off
	tx-tcp-segmentation: off [requested on]
	tx-tcp-ecn-segmentation: off [requested on]
	tx-tcp6-segmentation: off [requested on]
	tx-tcp-mangleid-segmentation: off [requested on]
udp-fragmentation-offload: off [requested on]
[root@ip-172-16-102-241 ~]# nslookup www.google.com 100.64.0.10
Server:		100.64.0.10
Address:	100.64.0.10#53
Non-authoritative answer:
Name:	www.google.com
Address: 172.217.13.228
Name:	www.google.com
Address: 2607:f8b0:4004:80a::2004
[root@ip-172-16-102-241 ~]# 

Other people also hit this: https://t.du9l.com/2020/03/kubernetes-flannel-udp-packets-dropped-for-wrong-checksum-workaround/

This happens both with cni-canal and pure cni-flannel, so we decided to report the issue here.

Expected Behavior

I do not have to adjust interface settings to get flannel to work.

Your Environment

  • Flannel version: 0.11.0
  • Backend used (e.g. vxlan or udp): vxlan
  • Etcd version: 3.4.3
  • Kubernetes version (if used): 1.17.4
  • Operating System and version: RHeL 7.8
  • Link to your project (optional): https://www.kublr.com
@CMajeri
Copy link

CMajeri commented May 14, 2020

We hit this too. It was an absolute pain to figure out.

It seems to only affect service IPs (I'm guessing because of masquerade??), and specifically UDP (nslookup doesn't work, nslookup in tcp mode does)

If anyone knows where this bug comes from (besides checksum offloading) I'd be very interested.
I'll keep looking for a bit, but I'm not great with networking.

Our environment:

  • flannel: 0.9.0 (vxlan mode)
  • kubernetes: 1.16.9 (kube-proxy in iptables mode)
  • OS: Centos 8
  • etcd: 3.4.7

Weird thing is we're running almost the same versions of things (etcd is different, but I really doubt it comes from that) on a fedora 30 server and things work fine. Settings are the same, and while routing tables differ the base idea is the same...
Could it be something kernel or virtio related?

@holooloo
Copy link

check iptables versions on Centos 7 and 8
must be upper then 1.6.2

@ksancheti
Copy link

ksancheti commented May 26, 2020

We are facing the same issue as mentioned by @dmitry-irtegov and @CMajeri.

Environment:

  • Flannel version: 0.11.0
  • Backend used (e.g. vxlan or udp): vxlan
  • Etcd version: 3.4.3
  • Kubernetes version: 1.18.2 (installed using kubeadm)
  • Operating System and version: Ubuntu 18.04.4
  • iptables version: 1.6.1

This workaround worked for us -

ethtool -K flannel.1 tx-checksum-ip-generic off

@brucedlg
Copy link

brucedlg commented Jun 22, 2020

It's definitely related to this one: kubernetes/kubernetes#88986 The solution kubernetes/kubernetes#92035 has a good description on the issue. It's the change on iptables rule exposing some existing kernel bug, especially in RHEL7.

Here is another workaround for the issue not requiring turning off chksum offload:

sudo iptables -A OUTPUT -p udp -m udp --dport 8472 -j MARK --set-xmark 0x0

UDP port 8472 is the default port for flannel encapsulating packet. It clears the mark to avoid doing SNAT on the encapsulating packet, thus no double SNAT.
This assumes that you use iptables. ipvs should have similar commands.

oilbeater added a commit to kubeovn/kube-ovn that referenced this issue Aug 17, 2021
Similar to flannel-io/flannel#1279, unmark output to bypaas kernel bug and enable checksum for better performance.
oilbeater added a commit to kubeovn/kube-ovn that referenced this issue Aug 18, 2021
Similar to flannel-io/flannel#1279, unmark output to bypaas kernel bug and enable checksum for better performance.

(cherry picked from commit dcda11d)
@stale
Copy link

stale bot commented Jan 26, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@simondeting
Copy link

检查 Centos 7 和 8 上的 iptables 版本 必须高于 1.6.2

Can you tell me why or give a link? Please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants