Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDP access to a service from another node is broken with hostNetworking #6664

Closed
damonmaria opened this issue Dec 20, 2022 · 7 comments
Closed

Comments

@damonmaria
Copy link

Environmental Info:
K3s Version:

k3s version v1.23.14+k3s1 (c62b03fb)
go version go1.17.13

Node(s) CPU architecture, OS, and Version: Linux ****** 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

  • external etcd
  • 2 servers
  • 1 agent

Describe the bug:
We run multiple sites/clusters on k3s. We had no issues with v1.23.8+k3s1. But after upgrading to v1.23.14+k3s1 we regularly have issues looking up service names in coredns. It does not happen all the time (rebooting sometimes fixes it) but I have narrowed it down to a combination of all the following:

  • coredns is running on a different node to the pod
  • pod is using hostNetworking
  • accessing coredns over UDP (TCP works)
  • accessing coredns using is service IP (cluster IP works)

Steps To Reproduce:

  • Installed K3s: download from https://get.k3s.io/ and then: INSTALL_K3S_VERSION=v1.23.14+k3s1 K3S_TOKEN=**** install.sh --datastore-endpoint=*** --datastore-cafile=*** --datastore-certfile=*** --datastore-keyfile=*** --disable=metrics-server,traefik --node-name=****

Expected behavior:
In this example coredns is running on proc2. I expect the following to work but it fails to connect:

# kubectl run -it --rm exclude --image=tutum/dnsutils --restart=Never --overrides='{"apiVersion": "v1", "spec": {"hostNetwork": true,"dnsPolicy": "ClusterFirstWithHostNet","nodeName": "proc1"}}' -- dig google.com
If you don't see a command prompt, try pressing enter.

; <<>> DiG 9.9.5-3ubuntu0.2-Ubuntu <<>> google.com
;; global options: +cmd
;; connection timed out; no servers could be reached

All of the following examples do work tho.

From the same node that the coredns service is on (proc2):

# kubectl run -it --rm exclude --image=tutum/dnsutils --restart=Never --overrides='{"apiVersion": "v1", "spec": {"hostNetwork": true,"dnsPolicy": "ClusterFirstWithHostNet","nodeName": "proc2"}}' -- dig google.com

Using TCP instead of UDP:

# kubectl run -it --rm exclude --image=tutum/dnsutils --restart=Never --overrides='{"apiVersion": "v1", "spec": {"hostNetwork": true,"dnsPolicy": "ClusterFirstWithHostNet","nodeName": "proc1"}}' -- dig +tcp google.com

Not using hostNetowkr:

# kubectl run -it --rm exclude --image=tutum/dnsutils --restart=Never --overrides='{"apiVersion": "v1", "spec": {"dnsPolicy": "ClusterFirst","nodeName": "proc1"}}' -- dig google.com

Using the cluster IP of coredns:

# kubectl run -it --rm exclude --image=tutum/dnsutils --restart=Never --overrides='{"apiVersion": "v1", "spec": {"hostNetwork": true,"dnsPolicy": "ClusterFirstWithHostNet","nodeName": "proc1"}}' -- dig google.com @10.42.1.173

Additional context / logs:
There is nothing in the coredns or k3s logs that seems relevant.

@brandond
Copy link
Member

brandond commented Dec 20, 2022

This sounds kind of like: flannel-io/flannel#1279

Do you see any different behavior if you run the ethtool command from that issue to disable tx checksum offloading?

I see that you're running Ubuntu - what version? What infrastructure is this running on? Bare metal, vsphere, ec2, etc?

@brandond
Copy link
Member

also cc @thomasferrandiz @manuelbuil

@damonmaria
Copy link
Author

@brandond Thanks for the fast response. This is running on Ubuntu 20.04 bare metal. We use Puppet to setup our OS and k3s so it should be consistent across the 4 clusters we have.

I'll follow through that flannel issue and report back.

@damonmaria
Copy link
Author

I have since rebooted that machine again and the problem no longer happens so I can't test ethtool disabling checksum offloading. It seems to be intermittent with some reboots fixing it.

I'll close this for now. If it happens again I'll re-open the issue and provide more info.

@damonmaria
Copy link
Author

@brandond Ended up in the same situation again.

Can confirm that running ethtool -K flannel.1 tx-checksum-ip-generic off on the target node host (the node where the coredns pod is) solves the issue.

What is the appropriate action from here?

@damonmaria damonmaria reopened this Dec 21, 2022
@brandond
Copy link
Member

Your node has a kernel with broken tx checksum offload. You should disable it using that ethtool command during node startup.

@damonmaria
Copy link
Author

Thanks @brandond.

I've applied that across all our machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants