UDP access to a service from another node is broken with hostNetworking #6664

damonmaria · 2022-12-20T10:34:58Z

Environmental Info:
K3s Version:

k3s version v1.23.14+k3s1 (c62b03fb)
go version go1.17.13

Node(s) CPU architecture, OS, and Version: Linux ****** 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

external etcd
2 servers
1 agent

Describe the bug:
We run multiple sites/clusters on k3s. We had no issues with v1.23.8+k3s1. But after upgrading to v1.23.14+k3s1 we regularly have issues looking up service names in coredns. It does not happen all the time (rebooting sometimes fixes it) but I have narrowed it down to a combination of all the following:

coredns is running on a different node to the pod
pod is using hostNetworking
accessing coredns over UDP (TCP works)
accessing coredns using is service IP (cluster IP works)

Steps To Reproduce:

Installed K3s: download from https://get.k3s.io/ and then: INSTALL_K3S_VERSION=v1.23.14+k3s1 K3S_TOKEN=**** install.sh --datastore-endpoint=*** --datastore-cafile=*** --datastore-certfile=*** --datastore-keyfile=*** --disable=metrics-server,traefik --node-name=****

Expected behavior:
In this example coredns is running on proc2. I expect the following to work but it fails to connect:

# kubectl run -it --rm exclude --image=tutum/dnsutils --restart=Never --overrides='{"apiVersion": "v1", "spec": {"hostNetwork": true,"dnsPolicy": "ClusterFirstWithHostNet","nodeName": "proc1"}}' -- dig google.com
If you don't see a command prompt, try pressing enter.

; <<>> DiG 9.9.5-3ubuntu0.2-Ubuntu <<>> google.com
;; global options: +cmd
;; connection timed out; no servers could be reached

All of the following examples do work tho.

From the same node that the coredns service is on (proc2):

# kubectl run -it --rm exclude --image=tutum/dnsutils --restart=Never --overrides='{"apiVersion": "v1", "spec": {"hostNetwork": true,"dnsPolicy": "ClusterFirstWithHostNet","nodeName": "proc2"}}' -- dig google.com

Using TCP instead of UDP:

# kubectl run -it --rm exclude --image=tutum/dnsutils --restart=Never --overrides='{"apiVersion": "v1", "spec": {"hostNetwork": true,"dnsPolicy": "ClusterFirstWithHostNet","nodeName": "proc1"}}' -- dig +tcp google.com

Not using hostNetowkr:

# kubectl run -it --rm exclude --image=tutum/dnsutils --restart=Never --overrides='{"apiVersion": "v1", "spec": {"dnsPolicy": "ClusterFirst","nodeName": "proc1"}}' -- dig google.com

Using the cluster IP of coredns:

# kubectl run -it --rm exclude --image=tutum/dnsutils --restart=Never --overrides='{"apiVersion": "v1", "spec": {"hostNetwork": true,"dnsPolicy": "ClusterFirstWithHostNet","nodeName": "proc1"}}' -- dig google.com @10.42.1.173

Additional context / logs:
There is nothing in the coredns or k3s logs that seems relevant.

The text was updated successfully, but these errors were encountered:

brandond · 2022-12-20T17:38:16Z

This sounds kind of like: flannel-io/flannel#1279

Do you see any different behavior if you run the ethtool command from that issue to disable tx checksum offloading?

I see that you're running Ubuntu - what version? What infrastructure is this running on? Bare metal, vsphere, ec2, etc?

brandond · 2022-12-20T17:40:08Z

also cc @thomasferrandiz @manuelbuil

damonmaria · 2022-12-20T18:58:49Z

@brandond Thanks for the fast response. This is running on Ubuntu 20.04 bare metal. We use Puppet to setup our OS and k3s so it should be consistent across the 4 clusters we have.

I'll follow through that flannel issue and report back.

damonmaria · 2022-12-20T22:16:28Z

I have since rebooted that machine again and the problem no longer happens so I can't test ethtool disabling checksum offloading. It seems to be intermittent with some reboots fixing it.

I'll close this for now. If it happens again I'll re-open the issue and provide more info.

damonmaria · 2022-12-21T05:33:35Z

@brandond Ended up in the same situation again.

Can confirm that running ethtool -K flannel.1 tx-checksum-ip-generic off on the target node host (the node where the coredns pod is) solves the issue.

What is the appropriate action from here?

brandond · 2022-12-21T06:14:55Z

Your node has a kernel with broken tx checksum offload. You should disable it using that ethtool command during node startup.

damonmaria · 2022-12-23T04:16:50Z

Thanks @brandond.

I've applied that across all our machines.

damonmaria closed this as completed Dec 20, 2022

damonmaria reopened this Dec 21, 2022

damonmaria closed this as completed Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UDP access to a service from another node is broken with hostNetworking #6664

UDP access to a service from another node is broken with hostNetworking #6664

damonmaria commented Dec 20, 2022

brandond commented Dec 20, 2022 •

edited

Loading

brandond commented Dec 20, 2022

damonmaria commented Dec 20, 2022

damonmaria commented Dec 20, 2022

damonmaria commented Dec 21, 2022

brandond commented Dec 21, 2022

damonmaria commented Dec 23, 2022

UDP access to a service from another node is broken with hostNetworking #6664

UDP access to a service from another node is broken with hostNetworking #6664

Comments

damonmaria commented Dec 20, 2022

brandond commented Dec 20, 2022 • edited Loading

brandond commented Dec 20, 2022

damonmaria commented Dec 20, 2022

damonmaria commented Dec 20, 2022

damonmaria commented Dec 21, 2022

brandond commented Dec 21, 2022

damonmaria commented Dec 23, 2022

brandond commented Dec 20, 2022 •

edited

Loading