Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP establishment timeouts #2017

Closed
johansmitsnl opened this issue Jul 13, 2020 · 13 comments
Closed

TCP establishment timeouts #2017

johansmitsnl opened this issue Jul 13, 2020 · 13 comments

Comments

@johansmitsnl
Copy link

Environmental Info:
K3s Version: v1.18.4+k3s1 (97b7a0e)

Node(s) CPU architecture, OS, and Version: Debian Buster 64bits Linux k00 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux

Cluster Configuration: 1 master, 3 workers

Describe the bug:

When running Gitlab runners on the K3s cluster I experience timeouts. See a full report here: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/25380

Steps To Reproduce:

  • Installed K3s: here

Expected behavior:

The docker image build process is running successfully.

Actual behavior:

The docker image build process crashes due to TCP connection establishment fail or unknown DNS host error.

Additional context / logs:

Step 7/14 : RUN bundle install --retry=4
 ---> Running in f118d3bcb3df
Fetching gem metadata from https://rubygems.org/...........
Fetching public_suffix 4.0.5
Installing public_suffix 4.0.5
Fetching addressable 2.7.0
Retrying download gem from https://rubygems.org/ due to error (2/5): Gem::RemoteFetcher::UnknownHostError timed out (https://rubygems.org/gems/addressable-2.7.0.gem)
@brandond
Copy link
Member

brandond commented Jul 16, 2020

Upstream kubernetes just released a fix for a long-standing issue that triggered a kernel bug when using vxlan. I suspect this is root cause for your problem. This fix has been backported and is available in the most recent k3s releases of 1.16, 1.17, and 1.18. Can you try updating and see if things work better?

See: #2013

@johansmitsnl
Copy link
Author

@brandond I just installed the newest version v1.18.6+k3s1 and the issue is not solved yet. After installing the new version a reboot is required or should it just work with restarting the k3s service?
The kernel version I have is: Linux k00 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux

@brandond
Copy link
Member

Hmm, from the message it's hard to tell if it's a timeout or DNS failure or both. What is your node using for its DNS server?

@johansmitsnl
Copy link
Author

$ cat /etc/resolv.conf 
search infrastructure.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.43.0.10
options ndots:5

@brandond
Copy link
Member

Not in the container, on the node.

@johansmitsnl
Copy link
Author

Sorry,

Direct name server of our colocation provider:

$ cat /etc/resolv.conf 
nameserver 195.135.195.135
nameserver 195.8.195.8

@johansmitsnl
Copy link
Author

My job failed at a other position, hope it might help:

Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:2 http://security.debian.org/debian-security buster/updates/main amd64 Packages [212 kB]
Err:3 http://deb.debian.org/debian buster InRelease
  Connection failed [IP: 151.101.36.204 80]
Get:4 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:5 http://deb.debian.org/debian buster-updates/main amd64 Packages [7868 B]
Fetched 337 kB in 60s (5575 B/s)
Reading package lists...
W: Failed to fetch http://deb.debian.org/debian/dists/buster/InRelease  Connection failed [IP: 151.101.36.204 80]
W: Some index files failed to download. They have been ignored, or old ones used instead.
Reading package lists...
Building dependency tree...

@johansmitsnl
Copy link
Author

I notices other services having difficulties communicating:

E0720 17:50:13.072142       1 controller.go:131] cert-manager/controller/orders "msg"="re-queuing item  due to error processing" "error"="error creating new order: Get https://acme-v02.api.letsencrypt.org/directory: dial tcp: i/o timeout" "key"="blog-32-production/production-auto-deploy-tls-1681138177"

@johansmitsnl
Copy link
Author

@brandond can I provide you with more information?

@johansmitsnl
Copy link
Author

johansmitsnl commented Aug 18, 2020

It seems that it is package related within the DockerInDocker environment: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/25380#note_393553417

There's an MTU issue here with the default flannel network for k3s. The default CNI for k3s is flannel with a vxlan backend - which is going to set the MTU of the flannel network to the max MTU of the primary interface minus 50 bytes for the vxlan header.
The problem with this is that with the Docker-in-Docker container used as a service for AutoDevops - the interface for docker0 in that build container doesn't inherit the MTU from flannel - it just gets the Docker default for 1500 bytes - what then happens - particularly with rubygems.org - is that the initial TCP exchange sets up a Maximum Segment Size for the TCP packets of 1460 (docker0's 1500 bytes - 40 bytes for the IP & TCP headers) - and rubygems.org backs off that a bit more - but not always enough to fit within the flannel network MTU and inevitably you get a bunch of fragmented packets, out of order packets, retransmits, TCP resets, et al. It's quite the mess when you view a packet capture.

They also make a suggestion:

Or you can use MSS clamping in the host's IP Tables configuration (or at whatever network routing level this might make sense for your infrastructure)

To me the iptables sounds like a good option that K3S might inject?

@MrSaints
Copy link

MrSaints commented Aug 25, 2020

I'm too experiencing this problem. Another user reported that switching to Cilium resolved their issue.

EDIT: setting the MTU fixed the problem

@brandond
Copy link
Member

brandond commented Dec 4, 2020

I believe this should now be resolved upstream.

@brandond brandond closed this as completed Dec 4, 2020
@johansmitsnl
Copy link
Author

FYI, specifically fir Gitlb runners you can solve it like this: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/25380#note_456368774

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants