Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod to Pod Communcation severely degraded in 4.11 on vSphere #1550

Closed
MattPOlson opened this issue Mar 31, 2023 · 51 comments
Closed

Pod to Pod Communcation severely degraded in 4.11 on vSphere #1550

MattPOlson opened this issue Mar 31, 2023 · 51 comments

Comments

@MattPOlson
Copy link

Describe the bug

We run okd in a vSphere environment with the below configuration:

vSphere:
ESXi version: 7.0 U3e
Seperate vDS (on version 6.5) for Front End and iSCSI

Hardware:
UCS B200-M4 Blade
	BIOS - B200M4.4.1.2a.0.0202211902
	Xeon(R) CPU E5-2667
	2 x 20Gb Cisco UCS VIC 1340 network adapter for front end connectivity (Firmware 4.5(1a))
	2 x 20Gb Cisco UCS VIC 1340 network adapter for iSCSI connectivity (Firmware 4.5(1a))
	
Storage:
Compellent SC4020 over iSCSI
	2 controller array with dual iSCSI IP connectivity (2 paths per LUN)
All cluster nodes on same Datastore

After upgrading the cluster from a 4.10.x version to anything above 4.11.x pod to pod communication is severely degraded where the nodes that the pods run on are hosted on different esx hosts. We ran a benchmark test on the cluster before the upgrade with the below results:

Benchmark Results

Name : knb-2672
Date : 2023-03-29 15:26:01 UTC
Generator : knb
Version : 1.5.0
Server : k8s-se-internal-01-582st-worker-n2wtp
Client : k8s-se-internal-01-582st-worker-cv7cd
UDP Socket size : auto

Discovered CPU : Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
Discovered Kernel : 5.18.5-100.fc35.x86_64
Discovered k8s version : v1.23.5-rc.0.2076+8cfebb1ce4a59f-dirty
Discovered MTU : 1400
Idle :
bandwidth = 0 Mbit/s
client cpu = total 12.31% (user 9.41%, nice 0.00%, system 2.83%, iowait 0.07%, steal 0.00%)
server cpu = total 9.04% (user 6.28%, nice 0.00%, system 2.74%, iowait 0.02%, steal 0.00%)
client ram = 4440 MB
server ram = 3828 MB
Pod to pod :
TCP :
bandwidth = 6306 Mbit/s
client cpu = total 26.15% (user 5.19%, nice 0.00%, system 20.96%, iowait 0.00%, steal 0.00%)
server cpu = total 29.39% (user 8.13%, nice 0.00%, system 21.26%, iowait 0.00%, steal 0.00%)
client ram = 4460 MB
server ram = 3820 MB
UDP :
bandwidth = 1424 Mbit/s
client cpu = total 26.08% (user 7.21%, nice 0.00%, system 18.82%, iowait 0.05%, steal 0.00%)
server cpu = total 24.82% (user 6.72%, nice 0.00%, system 18.05%, iowait 0.05%, steal 0.00%)
client ram = 4444 MB
server ram = 3824 MB
Pod to Service :
TCP :
bandwidth = 6227 Mbit/s
client cpu = total 27.90% (user 5.12%, nice 0.00%, system 22.73%, iowait 0.05%, steal 0.00%)
server cpu = total 29.85% (user 5.86%, nice 0.00%, system 23.99%, iowait 0.00%, steal 0.00%)
client ram = 4439 MB
server ram = 3811 MB
UDP :
bandwidth = 1576 Mbit/s
client cpu = total 32.31% (user 6.41%, nice 0.00%, system 25.90%, iowait 0.00%, steal 0.00%)
server cpu = total 26.12% (user 5.68%, nice 0.00%, system 20.39%, iowait 0.05%, steal 0.00%)
client ram = 4449 MB
server ram = 3818 MB

After upgrading to version 4.11.0-0.okd-2023-01-14-152430 the latency between the pods is so high the benchmark test, qperf test, and iperf test all timeout and fail to run. This is the result of curling the network check pod across nodes, it takes close to 30 seconds.

sh-4.4# time curl http://10.129.2.44:8080
Hello, 10.128.2.2. You have reached 10.129.2.44 on k8s-se-internal-01-582st-worker-cv7cd
real    0m26.496s

We have been able to reproduce this issue consistently on multiple different clusters.

Version
4.11.0-0.okd-2023-01-14-152430
IPI on vSphere

How reproducible
Upgrade or install a 4.11.x or higher version of OKD and observe the latency.

@rvanderp3
Copy link
Contributor

What is the VMware hardware version of the VMs?

@MattPOlson
Copy link
Author

They are:
ESXi 6.7 U2 and later (VM version 15)

@vrutkovs
Copy link
Member

Is it reproducible in 4.12?

@MattPOlson
Copy link
Author

Yes we upgraded a cluster to 4.12 and were able to reproduce it.

@vrutkovs
Copy link
Member

Right, so its possibly kernel module or OVN have regressed. Could you check if node-to-node performance has degraded too? If yes, its probably a Fedora / kernel regression

@MattPOlson
Copy link
Author

Node to node performance is good, I tested on the nodes themselves using the toolbox

[  1] local 10.33.154.189 port 32934 connected with 10.33.154.187 port 5001 (icwnd/mss/irtt=14/1448/241)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.01 sec  8.01 GBytes  6.87 Gbits/sec

@vrutkovs
Copy link
Member

In that case its probably OVN. I wonder if we could confirm its OKD specific?

@MattPOlson
Copy link
Author

We have a few OCP cluster at 4.11 and I haven't been able to reproduce the problem in them.

@ptudor
Copy link
Contributor

ptudor commented Apr 4, 2023

When you jump from 4.11 (4.11.0-0.okd-2022-08-20-022919 or so) to 4.12.0-0.okd-2023-02-18-033438 or newer you should see better network performance.

@MattPOlson
Copy link
Author

When you jump from 4.11 (4.11.0-0.okd-2022-08-20-022919 or so) to 4.12.0-0.okd-2023-02-18-033438 or newer you should see better network performance.

That is not the case for us, I have upgraded to version 4.12.0-0.okd-2023-04-01-051724 and am still seeing the same issues. I still can't run any test across pods without them timing out.

@MattPOlson
Copy link
Author

Do we need to open an issue in the ovn-kubernetes repository?
How do I figure out what release of ovn-kubernetes is in a specific version of okd?

@Reamer
Copy link
Contributor

Reamer commented Apr 6, 2023

I think you should open an issue in the repository https://github.com/openshift/ovn-kubernetes/.
Depending on your okd version you are using, you should select the branch in https://github.com/openshift/ovn-kubernetes/ to see the current source code of ovn-kubernetes.

@MattPOlson
Copy link
Author

We re-deployed the cluster with version 4.10.0-0.okd-2022-07-09-073606 on the same hardware and the issue went away. There is clearly an issue with 4.11 and above. Benchmark results are below:

=========================================================
 Benchmark Results
=========================================================
 Name            : knb-17886
 Date            : 2023-04-10 19:46:01 UTC
 Generator       : knb
 Version         : 1.5.0
 Server          : k8s-se-platform-01-t4fb6-worker-vw2d9
 Client          : k8s-se-platform-01-t4fb6-worker-jk2wm
 UDP Socket size : auto
=========================================================
  Discovered CPU         : Intel(R) Xeon(R) Gold 6334 CPU @ 3.60GHz
  Discovered Kernel      : 5.18.5-100.fc35.x86_64
  Discovered k8s version : v1.23.5-rc.0.2076+8cfebb1ce4a59f-dirty
  Discovered MTU         : 1400
  Idle :
    bandwidth = 0 Mbit/s
    client cpu = total 4.06% (user 2.17%, nice 0.00%, system 1.82%, iowait 0.07%, steal 0.00%)
    server cpu = total 2.96% (user 1.48%, nice 0.00%, system 1.48%, iowait 0.00%, steal 0.00%)
    client ram = 925 MB
    server ram = 1198 MB
  Pod to pod :
    TCP :
      bandwidth = 8348 Mbit/s
      client cpu = total 26.07% (user 1.78%, nice 0.00%, system 24.27%, iowait 0.02%, steal 0.00%)
      server cpu = total 26.59% (user 1.94%, nice 0.00%, system 24.63%, iowait 0.02%, steal 0.00%)
      client ram = 930 MB
      server ram = 1196 MB
    UDP :
      bandwidth = 1666 Mbit/s
      client cpu = total 19.21% (user 2.14%, nice 0.00%, system 17.02%, iowait 0.05%, steal 0.00%)
      server cpu = total 22.51% (user 2.91%, nice 0.00%, system 19.55%, iowait 0.05%, steal 0.00%)
      client ram = 924 MB
      server ram = 1201 MB
  Pod to Service :
    TCP :
      bandwidth = 8274 Mbit/s
      client cpu = total 26.55% (user 1.78%, nice 0.00%, system 24.77%, iowait 0.00%, steal 0.00%)
      server cpu = total 26.37% (user 2.67%, nice 0.00%, system 23.68%, iowait 0.02%, steal 0.00%)
      client ram = 922 MB
      server ram = 1191 MB
    UDP :
      bandwidth = 1635 Mbit/s
      client cpu = total 20.19% (user 1.60%, nice 0.00%, system 18.54%, iowait 0.05%, steal 0.00%)
      server cpu = total 21.80% (user 2.82%, nice 0.00%, system 18.98%, iowait 0.00%, steal 0.00%)
      client ram = 913 MB
      server ram = 1179 MB
=========================================================

=========================================================
qperf
======================================================

/ # qperf 10.130.2.15 tcp_bw tcp_lat
tcp_bw:
    bw  =  907 MB/sec
tcp_lat:
    latency  =  70.6 us
/ # qperf 10.130.2.15 tcp_bw tcp_lat
tcp_bw:
    bw  =  1 GB/sec
tcp_lat:
    latency  =  68.2 us

===

@MattPOlson
Copy link
Author

I tested this on a cluster using openshiftSDN, deployed version 4.10 upgraded to 4.11 and was able to replicate the issue. So, it's not specific to OVN.

@imdmahajankanika
Copy link

imdmahajankanika commented Apr 14, 2023

I tested this on a cluster using openshiftSDN, deployed version 4.10 upgraded to 4.11 and was able to replicate the issue. So, it's not specific to OVN.

So, the issue #1563 for 4.11.0-0.okd-2022-12-02-145640 is reproducible that case

@jcpowermac
Copy link

  • Version and build numbers of ESXi please
  • FCOS kernel version
  • Have you tried a test where all the OKD nodes are on the same physical esxi host?

@imdmahajankanika
Copy link

  • Version and build numbers of ESXi please

    • FCOS kernel version

    • Have you tried a test where all the OKD nodes are on the same physical esxi host?

Yes, for #1563, I checked that all the nodes (except the remote worker nodes) master, storage and worker are already on the same esx host (ESXi 6.7 and later (VM version 14))

@jcpowermac
Copy link

I will try to reproduce here but it would be good to know if I am replicating to what was already provisioned. Again can I get ESXi version and build numbers, FCOS kernel version - please be specific.

Remember vSphere 6.x is EOL and some older versions have issues with VXLAN w/ESXi and kernel drivers.

@MattPOlson
Copy link
Author

I will try to reproduce here but it would be good to know if I am replicating to what was already provisioned. Again can I get ESXi version and build numbers, FCOS kernel version - please be specific.

Remember vSphere 6.x is EOL and some older versions have issues with VXLAN w/ESXi and kernel drivers.

In our case the ESxi version info is in the initial post

Linux version 6.0.18-200.fc36.x86_64 ([email protected]) (gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4), GNU ld version 2.37-37.fc36) #1 SMP PREEMPT_DYNAMIC Sat Jan 7 17:08:48 UTC 2023

vSphere:
ESXi version: 7.0 U3e
Seperate vDS (on version 6.5) for Front End and iSCSI

@jcpowermac
Copy link

jcpowermac commented Apr 14, 2023

okd version: 4.12.0-0.okd-2023-04-01-051724
fcos:
Kernel version
6.1.14-200.fc37.x86_64

ESXi: VMware ESXi, 8.0.0, 20513097

client

sh-5.2$ iperf3 -i 5 -t 60 -c  10.129.2.8
Connecting to host 10.129.2.8, port 5201
[  5] local 10.128.2.18 port 48400 connected to 10.129.2.8 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-5.00   sec  3.88 GBytes  6.67 Gbits/sec  301   1.19 MBytes       
[  5]   5.00-10.00  sec  3.77 GBytes  6.49 Gbits/sec  630   1.12 MBytes       
[  5]  10.00-15.00  sec  3.93 GBytes  6.75 Gbits/sec   92   1.67 MBytes       
[  5]  15.00-20.00  sec  3.83 GBytes  6.58 Gbits/sec  400   1.06 MBytes       
[  5]  20.00-25.00  sec  3.22 GBytes  5.54 Gbits/sec  5329   1.02 MBytes       
[  5]  25.00-30.00  sec  3.41 GBytes  5.85 Gbits/sec  184   1.45 MBytes       
^C[  5]  30.00-34.66  sec  3.21 GBytes  5.92 Gbits/sec  874   1.20 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-34.66  sec  25.3 GBytes  6.26 Gbits/sec  7810             sender
[  5]   0.00-34.66  sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated
sh-5.2$ 

server

Accepted connection from 10.128.2.18, port 48394
[  5] local 10.129.2.8 port 5201 connected to 10.128.2.18 port 48400
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-5.00   sec  3.88 GBytes  6.67 Gbits/sec                  
[  5]   5.00-10.00  sec  3.78 GBytes  6.49 Gbits/sec                  
[  5]  10.00-15.00  sec  3.93 GBytes  6.74 Gbits/sec                  
[  5]  15.00-20.00  sec  3.83 GBytes  6.58 Gbits/sec                  
[  5]  20.00-25.00  sec  3.22 GBytes  5.54 Gbits/sec                  
[  5]  25.00-30.00  sec  3.41 GBytes  5.85 Gbits/sec                  
[  5]  25.00-30.00  sec  3.41 GBytes  5.85 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-30.00  sec  25.3 GBytes  7.23 Gbits/sec                  receiver
iperf3: the client has terminated
-----------------------------------------------------------
Server listening on 5201 (test #3)
-----------------------------------------------------------

@jcpowermac
Copy link

Not really seeing a problem here
Each pod in the above test was on different FCOS nodes, residing on different physical esxi hosts.

@MattPOlson
Copy link
Author

Not really seeing a problem here Each pod in the above test was on different FCOS nodes, residing on different physical esxi hosts.

I can't even get iperf tests to run when the pods are on hosts that are on different esx hosts, it just times out.

@jcpowermac
Copy link

image

@MattPOlson
Copy link
Author

In my case I can't even get the console up anymore. I've reproduced it over and over and now it looks like someone else has also. Not sure what to do other than stay at 4.10.

@jcpowermac
Copy link

jcpowermac commented Apr 14, 2023

@MattPOlson based on your previous comments this looks to me like MTU or something with VXLAN. Have you checked all the virtual switches, physical device MTU?

And is it correct in stating with all the guests reside together there is no performance issue? Is it a specific ESXi host that is ok?

@MattPOlson
Copy link
Author

@MattPOlson based on your previous comments this looks to me like MTU or something with VXLAN. Have you checked all the virtual switches, physical device MTU?

And is it correct in stating with all the guests reside together is no performance issue? Is it a specific ESXi host that is ok?

We've tried MTU settings, set them to match the host. But why would that affect 4.11 and not 4.10? I can spin up a cluster on 4.10 and it works perfectly, upgrade it to 4.11 and change nothing else and it breaks.

And yea if all the nodes reside together there is no issue.

@jcpowermac
Copy link

And yea if all the nodes reside together there is no issue.

Don't you find that odd? If the problem occurs when packets are leaving the ESXi host then I would suspect something physical. I can't comment to why the version would make a difference but I can't reproduce.

@MattPOlson
Copy link
Author

And yea if all the nodes reside together there is no issue.

Don't you find that odd? If the problem occurs when packets are leaving the ESXi host then I would suspect something physical. I can't comment to why the version would make a difference but I can't reproduce.

Agree but I also find it odd that upgrading to 4.11 breaks it and someone else was able to reproduce it. To me that feels like it's not something specific to our environment.

@jcpowermac
Copy link

And yea if all the nodes reside together there is no issue.

Don't you find that odd? If the problem occurs when packets are leaving the ESXi host then I would suspect something physical. I can't comment to why the version would make a difference but I can't reproduce.

Agree but I also find it odd that upgrading to 4.11 breaks it and someone else was able to reproduce it. To me that feels like it's not something specific to our environment.

We do test OCP and OKD in multiple different vSphere environments and haven't seen this issue. Maybe you and @imdmahajankanika stumbled into the same problem?

The question is what is the commonality.

@MattPOlson
Copy link
Author

And yea if all the nodes reside together there is no issue.

Don't you find that odd? If the problem occurs when packets are leaving the ESXi host then I would suspect something physical. I can't comment to why the version would make a difference but I can't reproduce.

Agree but I also find it odd that upgrading to 4.11 breaks it and someone else was able to reproduce it. To me that feels like it's not something specific to our environment.

We do test OCP and OKD in multiple different vSphere environments and haven't seen this issue. Maybe you and @imdmahajankanika stumbled into the same problem?

The question is what is the commonality.

Right, that is the question.

Interestingly we have a few OCP clusters running at 4.11 on the exact same hardware and don't see the issue there.

@bo0ts
Copy link

bo0ts commented Apr 15, 2023

After upgrading our clusters from 4.10.0-0.okd-2022-07-09-073606 to 4.11.0-0.okd-2023-01-14-152430 connectivity betwenn kube-apiserver and all other apiservers was lost. Our master nodes all run on VSphere. We could fix the issue by running:

ethtool -K ens192 tx-udp_tnl-segmentation off
ethtool -K ens192 tx-udp_tnl-csum-segmentation off

on the master nodes. I have a feeling that we are looking at the very old bug: openshift/machine-config-operator#2482

Could you check the state of tunnel offloading on your nodes with ethtool -k <your-primary-interface> | grep tx-udp?

@jcpowermac
Copy link

That shouldn't be an issue as we haven't removed the workaround

https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/vsphere-disable-vmxnet3v4-features.yaml

And it is in specific older ESXi versions, if you are hitting the VXLAN offloading bug you need to upgrade your hosts.

@MattPOlson
Copy link
Author

That shouldn't be an issue as we haven't removed the workaround

https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/vsphere-disable-vmxnet3v4-features.yaml

And it is in specific older ESXi versions, if you are hitting the VXLAN offloading bug you need to upgrade your hosts.

I thnk I figured it out, your workaround isn't working anymore, looks like there is a permission issue and the NetworkManager-dispatcher.service is failing to apply the scripts in the directory /etc/NetworkManager/dispatcher.d/99-vsphere-disable-tx-udp-tnl

Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:12 'connectivity-change': find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:13 'up' [ens192]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:14 'up' [br-ex]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:15 'pre-up' [ens192]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d/pre-up.d': Error opening directory “/etc/NetworkManager/dispatcher.d/pre-up.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:16 'up' [ens192]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:17 'dhcp4-change' [br-ex]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:18 'pre-up' [br-ex]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d/pre-up.d': Error opening directory “/etc/NetworkManager/dispatcher.d/pre-up.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:19 'up' [br-ex]: find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:57:51 k8s-se-internal-01-582st-master-0 nm-dispatcher[1466]: req:20 'connectivity-change': find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied
Apr 03 14:58:01 k8s-se-internal-01-582st-master-0 systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.

Disabling tunnel offloading seems to fix the problem. I'm looking into why that script is now getting permission denied errors.

@jcpowermac
Copy link

VMware only updated their release notes for 6.7 that resolves this issue. I am unsure what 7.x build fixes it.
We are running the close to the latest versions of 7 and 8.

Certainly need to figure out the permission issue, which is strange. I would figure other dispatch scripts would be breaking too.

@MattPOlson
Copy link
Author

It looks like this bug was already reported and fixed, but I'm definitely still seeing the issue in our environment

#1317

@vrutkovs
Copy link
Member

Perhaps its also #1475?

@MattPOlson
Copy link
Author

I think I figured something out, the script the service rhcos-selinux-policy-upgrade.service executes to reload selinux is never running because its looking for RHEL_VERSION in /usr/lib/os-release

That exists in Red Hat Enterprise Linux CoreOS but not in Fedora, therefore it's never hitting the line that calls semodule -B

#!/bin/bash
# Executed by rhcos-selinux-policy-upgrade.service
set -euo pipefail

RHEL_VERSION=$(. /usr/lib/os-release && echo ${RHEL_VERSION:-})
echo -n "RHEL_VERSION=${RHEL_VERSION:-}"
case "${RHEL_VERSION:-}" in
  8.[0-6]) echo "Checking for policy recompilation";;
  *) echo "Assuming we have new enough ostree"; exit 0;;
esac

ls -al /{usr/,}etc/selinux/targeted/policy/policy.31
if ! cmp --quiet /{usr/,}etc/selinux/targeted/policy/policy.31; then
    echo "Recompiling policy due to local modifications as workaround for https://bugzilla.redhat.com/2057497"
    semodule -B
fi
 cat . /usr/lib/os-release
cat: .: Is a directory
NAME="Fedora Linux"
VERSION="37.20230303.3.0 (CoreOS)"
ID=fedora
VERSION_ID=37
VERSION_CODENAME=""
PLATFORM_ID="platform:f37"
PRETTY_NAME="Fedora CoreOS 37.20230303.3.0"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:37"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=37
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=37
SUPPORT_END=2023-11-14
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='37.20230303.3.0'

@bo0ts
Copy link

bo0ts commented Apr 17, 2023

VMware only updated their release notes for 6.7 that resolves this issue. I am unsure what 7.x build fixes it. We are running the close to the latest versions of 7 and 8.

@jcpowermac We are running on ESXi 7u3 and this issue should be fixed. Maybe this has come up again in newer versions of vmxnet3?

@jcpowermac
Copy link

VMware only updated their release notes for 6.7 that resolves this issue. I am unsure what 7.x build fixes it. We are running the close to the latest versions of 7 and 8.

@jcpowermac We are running on ESXi 7u3 and this issue should be fixed. Maybe this has come up again in newer versions of vmxnet3?

@bo0ts I would suggest opening a support request with vmware. They own both aspects of this, the linux kernel driver [0] and ESXi.

[0] - https://github.com/torvalds/linux/commits/master/drivers/net/vmxnet3

@kai-uwe-rommel
Copy link

Over in the Slack thread there is also discussion about why and where this occurs. We don't see that problem on our clusters. But perhaps we simply don't do enough intra cluster communications? Can you give us an easy test how I can verify if we indeed do not have the problem or just don't see it?

@MattPOlson
Copy link
Author

The easiest way is to deploy an iPerf client on one node and an iPerf server on another node, then run test between them to check performance.

@kai-uwe-rommel
Copy link

Ok, I guess something like that: https://github.com/InfuseAI/k8s-iperf

@MattPOlson
Copy link
Author

I've had good luck with this one:

https://github.com/InfraBuilder/k8s-bench-suite

@jcpowermac
Copy link

FROM quay.io/fedora/fedora:38 
RUN dnf install -y iperf3 ttcp qperf
ENTRYPOINT trap : TERM INT; sleep infinity & wait # Listen for kill signals and exit quickly.
cat Dockerfile| oc new-build --name perf -D -

then created a deployment for both client and server
just gotta watch which node it lands on and destroy if necessary

then just oc rsh to run the commands

found this on a blog post somewhere ;-)

@imdmahajankanika
Copy link

imdmahajankanika commented Jun 2, 2023

After upgrading our clusters from 4.10.0-0.okd-2022-07-09-073606 to 4.11.0-0.okd-2023-01-14-152430 connectivity betwenn kube-apiserver and all other apiservers was lost. Our master nodes all run on VSphere. We could fix the issue by running:

ethtool -K ens192 tx-udp_tnl-segmentation off
ethtool -K ens192 tx-udp_tnl-csum-segmentation off

on the master nodes. I have a feeling that we are looking at the very old bug: openshift/machine-config-operator#2482

Could you check the state of tunnel offloading on your nodes with ethtool -k <your-primary-interface> | grep tx-udp?

Hello! In my case, just by executing systemctl restart NetworkManager, tx-udp_tnl-segmentation and tx-udp_tnl-csum-segmentation got turned off and the issue resolved.

image

@jcpowermac
Copy link

Hello! In my case, just by executing systemctl restart NetworkManager, tx-udp_tnl-segmentation and tx-udp_tnl-csum segmentation got turned off and the issue resolved.

Disable is done via a networkmanager dispatch script, so that kinda makes sense. Wonder why it doesn't work the first time.

https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/vsphere-disable-vmxnet3v4-features.yaml

@imdmahajankanika
Copy link

imdmahajankanika commented Jun 2, 2023

Hello! In my case, just by executing systemctl restart NetworkManager, tx-udp_tnl-segmentation and tx-udp_tnl-csum segmentation got turned off and the issue resolved.

Disable is done via a networkmanager dispatch script, so that kinda makes sense. Wonder why it doesn't work the first time.

https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/vsphere-disable-vmxnet3v4-features.yaml

When I checked initially via "systemctl status NetworkManager-dispatcher.service", I found two types of errors

  • Permission denied on /etc/NetworkManager/dispatcher.d folder
  • Error: Device '' not found. (Device in this case i think is network interface, the variable "DEVICE_IFACE")

@Reamer
Copy link
Contributor

Reamer commented Jun 6, 2023

I also see the failed access in my environment. In my opinion, it is due to SeLinux.

May 25 09:25:11 localhost.localdomain NetworkManager[1088]: <info>  [1685006711.9105] manager: (patch-br-ex_worker1-cl1-dc3.s-ocp.cloud.mycompany.com-to-br-int): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/23)
May 25 09:25:11 localhost.localdomain audit[1099]: AVC avc:  denied  { read } for  pid=1099 comm="nm-dispatcher" name="dispatcher.d" dev="sda4" ino=109431351 scontext=system_u:system_r:NetworkManager_dispatcher_t:s0 tcontext=system_u:object_r:NetworkManager_initrc_exec_t:s0 tclass=dir permissive=0
May 25 09:25:11 localhost.localdomain audit[1099]: SYSCALL arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=56017b2615f0 a2=90800 a3=0 items=0 ppid=1 pid=1099 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="nm-dispatcher" exe="/usr/libexec/nm-dispatcher" subj=system_u:system_r:NetworkManager_dispatcher_t:s0 key=(null)
May 25 09:25:11 localhost.localdomain audit: PROCTITLE proctitle="/usr/libexec/nm-dispatcher"
May 25 09:25:11 localhost.localdomain nm-dispatcher[1099]: req:3 'hostname': find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied

@imdmahajankanika
Copy link

imdmahajankanika commented Jun 6, 2023

I also see the failed access in my environment. In my opinion, it is due to SeLinux.

May 25 09:25:11 localhost.localdomain NetworkManager[1088]: <info>  [1685006711.9105] manager: (patch-br-ex_worker1-cl1-dc3.s-ocp.cloud.mycompany.com-to-br-int): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/23)
May 25 09:25:11 localhost.localdomain audit[1099]: AVC avc:  denied  { read } for  pid=1099 comm="nm-dispatcher" name="dispatcher.d" dev="sda4" ino=109431351 scontext=system_u:system_r:NetworkManager_dispatcher_t:s0 tcontext=system_u:object_r:NetworkManager_initrc_exec_t:s0 tclass=dir permissive=0
May 25 09:25:11 localhost.localdomain audit[1099]: SYSCALL arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=56017b2615f0 a2=90800 a3=0 items=0 ppid=1 pid=1099 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="nm-dispatcher" exe="/usr/libexec/nm-dispatcher" subj=system_u:system_r:NetworkManager_dispatcher_t:s0 key=(null)
May 25 09:25:11 localhost.localdomain audit: PROCTITLE proctitle="/usr/libexec/nm-dispatcher"
May 25 09:25:11 localhost.localdomain nm-dispatcher[1099]: req:3 'hostname': find-scripts: Failed to open dispatcher directory '/etc/NetworkManager/dispatcher.d': Error opening directory “/etc/NetworkManager/dispatcher.d”: Permission denied

Hello, Did you try with sudo systemctl restart NetworkManager ?

or

restorecon -vR /etc/NetworkManager/dispatcher.d/;semodule -B;systemctl restart NetworkManager;systemctl restart kubelet

@Reamer
Copy link
Contributor

Reamer commented Jun 6, 2023

I had tried systemctl restart NetworkManager on one node after your message without thinking further. This breaks the SSH connection and kills the command probably because of the missing parent process. I had to reset the node manually. I have not found anything to open any kind of tmux or screen session in FedoraCoreOS.

I can confirm that the offload parameters are also set in my environment.

[root@worker1-cl1-dc3 ~]# ethtool -k ens192 | grep tx-udp
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-udp-segmentation: off [fixed]

I ran network performance tests using iperf before and after changing the offload parameters. I used ethtool for changing the offload parameter.

ethtool -K ens192 tx-udp_tnl-segmentation off
ethtool -K ens192 tx-udp_tnl-csum-segmentation off

The difference between the tests is tiny. The network speed between two pods on different nodes and two VMs is very large (between two VMs the speed is around 7x faster), but according to my current knowledge this is due to OVN.
I did not notice any network disconnections.

@MattPOlson
Copy link
Author

I had tried systemctl restart NetworkManager on one node after your message without thinking further. This breaks the SSH connection and kills the command probably because of the missing parent process. I had to reset the node manually. I have not found anything to open any kind of tmux or screen session in FedoraCoreOS.

I can confirm that the offload parameters are also set in my environment.

[root@worker1-cl1-dc3 ~]# ethtool -k ens192 | grep tx-udp
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-udp-segmentation: off [fixed]

I ran network performance tests using iperf before and after changing the offload parameters. I used ethtool for changing the offload parameter.

ethtool -K ens192 tx-udp_tnl-segmentation off
ethtool -K ens192 tx-udp_tnl-csum-segmentation off

The difference between the tests is tiny. The network speed between two pods on different nodes and two VMs is very large (between two VMs the speed is around 7x faster), but according to my current knowledge this is due to OVN. I did not notice any network disconnections.

Running this command fixes the issues for us:

restorecon -vR /etc/NetworkManager/dispatcher.d/;semodule -B;systemctl restart NetworkManager;systemctl restart kubelet

With offload on communication between nodes on different nodes is really bad. I have also found that if we upgrade the vSphere Distributed Switch to version 7.0.3 the problem goes away, speeds are normal with offload on.

@vrutkovs vrutkovs closed this as completed Jul 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants