-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP offloading on vxlan.calico adaptor causing 63 second delays in VXLAN communications node->nodeport or node->clusterip:port. #3145
Comments
Hello, We have the exact same issue on CentOS 7 (3.10.0-1062.9.1.el7.x86_64) running Calico/Canal and flannel. After spending the better part of a week trying to figure out why two thirds of our cluster was unable to reliably to talk to the other third I stumbled upon this issue. I can report that disabling offloading completely works around this issue. In our case, the command is: Your Environment
|
Same issue here, we have 63 second delay while connecting to ClusterIP services in CentOS 7 / k8s 1.7.2 / Calico 3.12.0 cluster running on Hetzner Cloud. Disabling ethernet offloading resolves mentioned connection issues. |
thx it works. flannel+k8s1.17.2 centos7. |
@jelaryma I am having a similar problem. kubernetes/kubernetes#88986 I came to think it was a flannel issue, but you are also seeing this with calico.. Is it your thought that this is a k8s bug? Also would i run the following command on every node of the cluster? |
There's a thread in SIG network about this: https://groups.google.com/forum/#!topic/kubernetes-sig-network/JxkTLd4M8WM Summary so far seems to be that this is a kernel bug related to VXLAN offload where the checksum calculation is not properly offloaded. |
Same issue with k8s 1.18 + ubuntu 16.04 /w calico 3.11, it causes 3s delays only.
|
Flannel now has an open PR addressing this. flannel-io/flannel#1282 |
see this flannel-io/flannel#1282 (comment) |
I have been hit by this bug as well. Does anyone know where to add the ethtool command to make it persistent after a reboot on Centos7? I tried adding it to rc.local but it looks like the device is being created after the script runs because I am getting a |
63 seconds maybe refers to 5 times retransmission. But i'm still confused about the cause of this issue. So do you kown some articles or blog posts about the 'no cksum' flag. |
https://zhangguanzhang.github.io/2020/05/23/k8s-vxlan-63-timeout/ |
Did you found a solution for persistent fix after reboot? |
@balleon Nope, thankfully the servers don't get rebooted very often... |
any better solution? |
On my Kubernetes 1.18.5 and CentOS 7 cluster i use a kube-proxy custom image.
|
THIS WORKS!!! |
Last Kubernetes release shoud fix this issue.
I tried it in the following environment:
Problem still there, i have to |
@danwinship PTAL |
I have the same issue here, tested on Kubernetes v1.18.12 with Calico v3.17. I tried the new option With
Without
I was able to solve it by manually adding a MASQUERADE rule without
But that's not reboot proof :( We should have an option in kube-proxy to disable the |
If anyone is interested by a reboot-proof solution to apply the workarround (disable offloading on /etc/systemd/system/disable-offloading-on-vxlan.service
/usr/local/bin/disable-offloading-on-vxlan
|
FWIW, the upstream kernel patch that fixed the issues we had root-caused for OpenShift and some Kubernetes use-cases is torvalds/linux@ea64d8d and is present in the 5.7 and later kernels, and the RHEL 8.2's kernel-4.18.0-193.13.2.el8_2 and later as of 2020-Jul-21. I presume CentOS 8.2 has this fix already. Other distros (Ubuntu 20.04.1) may not yet have it, if they haven't updated their kernel or backported the patch. |
FTR Calico does SNAT traffic from hosts to service VIPs that get load balanced to remote pods when in VXLAN (or IPIP mode). We do that because the source IP selection algorithm chooses the source IP based on the the route that appears to apply to the service VIP. The, after the source IP selection, the dest IP is changed by kube-proxy's DNAT. The source IP is then "wrong"; it'll be the eth0 IP, when it needs to be the Calico "tunnel IP" to reach a remote pod. Hence, we SNAT to the tunnel IP in that case. |
I am running a kubernetes 1.20 k8s cluster with calico. I just configured a 10 node cluster from baremetal about 2 weeks ago. I have a simple application tier with a jellyfin streaming service and a restful application service (express+react). I see flaky communications between my react application and the jellyfin HLS service. These applications work and interact just fine outside of this kubernetes+calico environment. Though its been a year since I had to worry about such things (since prior workarounds served me), the flaky behaviour reminds me of what I saw that led to the recognition of this 63 second delay. These are the interfaces I see on one of my worker nodes: I've been tracking this issue for a year now, since I was the author of a related k8s issue created when I spent over a week isolating such a network unreliability problem. I am using Centos 7 with the latest calico and the latest kubernetes. I've seen some discussion (both in the k8s issue, and in the calico issue) about disabling tx offloading. I've seen the k8s issue closed (despite the fact that I very much see this as a k8s problem) and I've seen 3145 stay open (glad its still open, if the issue really still lurks and bites people).
These are the kinds of painful issues that drive people away from powerful infrastructure elements like kubernetes and calico. I really, really, hope to see some well articulated guidance. |
|
We can continue discussing, but I'm going to close this for now since the kernel has been patched, and Calico now provides a way to turn off the feature that triggers it. |
I'm just updating this thread based on a recent Slack conversation with regards to a very similar problem with a valid workaround. We were seeing traffic fail after converting the overlay to VXLAN, specifically for UDP traffic when going from kubenode -> kubernetes service clusterIP, TCP seemed fine. @fasaxc recommended that we try Calico 3.20.0 and use a new feature detection override that's available, which I can confirm as making this traffic flow now work. Details of the setup are Kubernetes: 1.20.8 Example configuration used to apply the kubectl get felixconfiguration default -o yaml
apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
name: default
spec:
bpfLogLevel: ""
featureDetectOverride: ChecksumOffloadBroken=true
logSeverityScreen: Info
reportingInterval: 0s |
I am experiencing a 63 second delay in VXLAN communications node->node:nodeport or node->clusterip:port. After inspecting pcaps on both sending and receiving nodes it appears related to TCP offloading on the vxlan.calico interface. Disabling this through ethtool appears to 'resolve' the issue, but I'm entirely unsure whether this is a good idea or not, or if there's a better fix?
Expected Behavior
From a node, the following should work in all cases:
Current Behavior
consider the following:
A: node running pod [web], a simple web service (containous/whoami)
B: node running pod [alpine], a base container; exec sh
C: node not doing anything in particular.
With service defined:
I get the following results when trying to access the web service:
Further, if I change replicas for the pods so that pods are on A & C (rr load balancing b/t 2 nodes), the 63 second delay will occur half of the time from those hosting nodes:
The problem seems to stem from trying to route sourced from a node to another node. I did a trace and tcp dump from C -> A (via localhost:32081 on C... see below). On both nodes, the TCPDUMP shows repeated SYN packets attempting to establish the connection. They all show "bad udp cksum 0xffff -> 0x76dc!" in the results. After 63 seconds, a SYN packet is set with 'no cksum' and the connection is established.
I had to disable TCP Offloading... after issuing this command, curl localhost:32081 worked consistently on all nodes.
So.... I'm entirely unsure whether this is a good idea or not? Or whether there's a way to fix this through iptables? Or whether this needs fixed in the OS/hosting (VSphere)?
TRACE from Node C (client)
TCPDUMP from Node C (client)
TCPDUMP from Node A (hosting service pod)
Possible Solution
# ethtool --offload vxlan.calico rx off tx off
Steps to Reproduce (for bugs)
Context
There are times when we need to be able to access a service from a node (eg, log shipping from the node to a hosted service, hosted app api access, k8s hosted registry) where this defect will interfere with normal communications
Your Environment
The text was updated successfully, but these errors were encountered: