-
Notifications
You must be signed in to change notification settings - Fork 670
Weave errors AWS K8s TCP socket connection issues #2731
Comments
Is this related to #2674 |
@chrislovecnm this is the mitigation we added to prevent the load-inducing packet looping issue we jointly debugged a while back. We conjectured that this was caused by something external enabling hairpin mode (TBD, although there has been at least one Kubernetes bug that did exactly this) on the port which connects the OVS datapath to the local bridge (see reasoning here #2650). In addition to blocking looping flows, we also added some diagnostics that continually monitor for hairpin mode being enabled - are you also seeing an ERROR level message like
in the logs? If so, then we need to determine what it is in your environment that is turning this on; if not, we will need to give some more thought to how this can happen. |
@awh will take a look! |
@awh / @bboreham so yes we have hairpin_mode on. Question is how and why.
While on another node:
I am checking though code to determine what is doing this. Any ideas on your side? |
So wth does kubelet hairpin mode default to? And what should we have hairpin set to?? |
It's ok to have hairpin mode on for a device like It's bad to have it on for Kubelet has a |
If I understand, we have hit a kubelet bug? If so we are on 1.4.8. What released version is the fix in? Has it been cherry picked? What is the PR? |
Also what commands can I execute to remove the hairpin and test? |
Not necessarily - we're just using that as an example to illustrate that there have been instances in the past where things outside of our control have erroneously enabled hairpin on the veth that connects weave's bridge and datapath. This the bug we're talking about: kubernetes/kubernetes#19766 |
We have it on for vethwe-bridge and not sure about weave interface. What does bad mean ;) |
Would bad cause intermittent packet loss? |
Btw totally not on weave on this one. kubelet decided to do this ... Is there a Debian command where I am disable the hairpin and retest without removing the interface? |
|
If you want to ping me later today on slack or zoom, that would be grand |
@bboreham I am escalating your k8s pr as a cherry pick btw. |
FYI
Was only on for |
kubelet is turning it right back on ... :) So manual intervention is not helping. Let me figure out my kubelet service :P |
Check your kubelet logs - it may say why it is turning it on (e.g. an error like when ethtool was missing) |
nada |
nada == "nothing in the kubelet logs" - to be clear, hairpin keeps turning itself back on, because we are hitting an error from CNI? |
I am attempting to get curl on the pod, but apt-get stinks when you have intermittent packet loss ... :( |
That is the patch of logs associated to the process that I have.
root@foo-393834086-6y2gi:/# !ping
ping www.google.com
ping: unknown host |
What are the next steps that you recommend? We are going to get a tcpdump and crack that open we have one of UDP traffic, but need one with tcp traffic. |
Well interestingly enough;
I am going to figure out the damn code lines in kubelet and get the logging bumped up on it. Thanks for pointing me that direction again @bboreham ... |
Closing |
Related: kubernetes/kubernetes#36990 |
@chrislovecnm @awh A slight update. We've requested a cherry-pick of this fix into the 1.5 release and also into the 1.4 release. I'm beginning to wonder if this issue is a aws-weave thing since it doesn't seem to be affecting a larger group of people (because the noise about this issue would be louder). That said, you might find this interesting (see image below 👇). When we let the cluster idle, we have no hairpin errors. When we use the cluster for anything we get a sudden rise in the number of hairpin errors and that results in random communication failures (as we'd expect). |
We are running weave 1.8.2, as a Deamonset, on K8s.
Seen errors like the following from the weave pods:
Pods besides weave pod, other pods were having base TCP socket connection issues where connections were intermittent and or where pods were unable to make a connection.
From what is appears this was limited to a single AZ in AWS. When and if this occurs again what diagnostic information do you require?
cc @bboreham
The text was updated successfully, but these errors were encountered: