-
Notifications
You must be signed in to change notification settings - Fork 670
reproducible kernel panic w. 4.19.0 & parallel iperf threads P>8, weave/2.5.* #3684
Comments
btw. the codepaths are varying somewhat from stacktrace to stacktrace, but the following calls seem to be the common ones (presumably skb related):
|
thanks @fgeorgatos for reporting this issue
It could be either the specific combination (openstack/qemu) or parallel network streams that must be causing this issue. From the stack trace potentially panic is due to OVS data path that Weave's fastdp uses. |
@murali-reddy thanks for the feedback. fyi. the cause/fix must be hidden somewhere along this linux kernel git diff (<1000 lines until known bugfix point): However, I have run out of ideas about how to corner it, rigorously; @brb any suggestions?
|
What you expected to happen?
No kernel crash for parallel net streams:
iperf3 -c <server_cni_ip> -P1,2,4,8,16,32,64,128
i.e. the receiving end should be able to tolerate multiple parallel network streams, for P >= ~8.
My Request For Comments is, if this is reproducible for any other installations, since kernel 4.19.x series is very popular across a number of distributions (f.i. centos7+elrepo) and i've seen it in many other k8s deployments; testing it is cheap since it is a one-liner, run on 2 pods.
What happened?
kernel panic
, reproducible with iperf for P=~16 or greater, sometimes also for P=8:IMPORTANT: bug is NOT reproducible without involving a cni (precisely: over openvswitch layer).
How to reproduce it?
You need to install
iperf3 -s
inside a test pod withkernel/4.19.0
or a "bug-compatible", then pick a client pod and simply try:iperf3 -c <server_cni_ip> -P1,2,4,8,16,32,64,128
On a problematic kernel, the kernel panic will occur about midway on the above sequence.
A convenient oneliner:
echo 1 2 4 8 16 32 64 128|xargs -n1 iperf3 -c <server_cni_ip> -P
N.B. the crashing system is always the traffic receiving server that listens to that CNI ip.
Anything else we need to know?
The configuration tried here regards an
openstack
qemu back-end, deploying viarancher
.Mentioning it, because it could be a factor and/or even bug cause, in some conceivable way,
although my bigger question is if having enabled FASTDP service could be a factor, since I have noticed that if it gets disabled traffic throughput drops and kernel panic ceases (i.e. there is a correlation, but not necessarily causal relationship).
Versions:
Logs:
the kernel ultimately dies with:
kernel panic - not syncing: fatal exception in interrupt
stacktrace:
Conclusion
If the above bug report is considered historic (since the said kernel is old),
please consider this feature request instead:
iperf3 -s & echo 1 2 4 8 16 32 64 128|xargs -n1 iperf3 -c my_iperf_daemonset -P
The text was updated successfully, but these errors were encountered: