-
Notifications
You must be signed in to change notification settings - Fork 670
Very slow network performance with FastDP+Encryption on linux kernel 4.12 #3075
Comments
@nesc58 Thanks for the report. I'm quite surprised to see that with How do you ensure that iperf client and server pods are not scheduled on the same machine? |
Hi @brb, The DaemonSet includes the following: That's it. Here is the .yml file
Exec in each pod and install iperf, run one as server and connected the others via pod ips displayed by the kubectl command I cannot test it the next days because I am testing the etcd snapshot and restore function and so on. So I can only test with the current setup which is working. In the next time I don't have access to an other infrastructure to test it again. Hope the error is reproducable. |
The problem still existing with Kernel 4.13.3-coreos-r1. |
I can reproduce but in a a different context.. -- not using weave or docker, but IPsec mGRE tunnels. w/ Ubuntu any kernel beyond 4.12 is performing very very bad. when downgrades to 4.10 the regression is over, performance is good. tested using AWS and GCP instances. didnt test on bare metal. (ixgbevf nic driver) you may want to add more data on this bug I filed couple days ago. |
Hi, I am glad to relay there is an available patch for kernel >=4.12 that could fix this problem. |
@zenvdeluca Thanks for investigating this. Well done for identifying what looks like the root cause! Let's hope that patch lands soon and gets ported to all the affected kernel versions. |
We've just been hit by this, and gone through the whole investigation loop. This appears to work around the problem, at the cost of increased cpu usage: |
@zenvdeluca any news on getting that kernel patch merged? |
Should be in kernel v4.14-rc8 (container linux >= 1590.0.0) torvalds/linux@73b9fc4 |
We've not had further reports of this, so it's a fair guess that the kernel fix has propagated sufficiently far and is doing its job. -> closing. |
Hi,
I have a huge issue with the following setup:
What you expected to happen?
I expect a network performance with a bandwidth of >60% of the original bandwidth. Our servers connected with 1Gbit, so I would say with encryption something about 600Mbit would be great.
I tested the whole setup with a older CoreOS version with kernel 4.11 and the results are great:
Hardware machines without virtualization: ~900Mbit
Virtualized with XEN ~850Mbit
Virtualized with Xen with AES-NI disabled: ~250Mbit (really great without AES-NI I think)
What changed? Linux kernel from 4.12 to 4.11. Docker from 17.05-ce to 1.12.6.
All of this is faster than the setup with linux kernel 4.12. (results in
what happend
)Using kernel 4.12 + WEAVE_NO_FASTDP=true + encryption is okay. The iperf bandwidth results are also about 800 to 900Mbit BUT the cpu load is about 100 to 200 percent (weave process).
What happened?
I've got a bandwidth (tested with iperf) of 3 to 55Mbit using coreos with linux kernel 4.12-2
Results of different setups:
55 Mbit: NOT virtualized machines (notebook)
25 Mbit: Virtualized machines (Virtualbox) on a notebook
6 Mbit: Virtualized machines (XenServer 7.2/7.0).
Disabling the tso (ethtool -K ...) is increasing the performance from 6 Mbit to 100Mbit but the cpu load of the ksoftirqd process is increasing, too (from 5 to 100 percent).
How to reproduce it?
Use CoreOS with the linux kernel 4.12.2 (today: beta and alpha releases (alpha got docker 17.05-ce))
Install kubernetes
Install weave-net with
./kubectl create -f ....
Install kube dns (kubernetes dns addon)
Run pods with ubuntu (I started a daemonset to run this pod on each machine)
exec in the pods container (ubuntu) and install and run iperf
Anything else we need to know?
I used the same configuration files for the different setups.
I found a commit for the kernel 4.12 which changed / added something to the xfrm hardware offloading. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d77e38e612a017480157fe6d2c1422f42cb5b7e3
It would be fine if anybode can reproduce this issue.
Some information missing? Please let me know. But you have to know that I reinstalled the cluster to test with the older CoreOS version so the log files are deleted. So it would be fine if anybody else can reproduce it.
The text was updated successfully, but these errors were encountered: