Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Weave errors AWS K8s TCP socket connection issues #2731

Closed
chrislovecnm opened this issue Jan 13, 2017 · 27 comments
Closed

Weave errors AWS K8s TCP socket connection issues #2731

chrislovecnm opened this issue Jan 13, 2017 · 27 comments

Comments

@chrislovecnm
Copy link

We are running weave 1.8.2, as a Deamonset, on K8s.

Seen errors like the following from the weave pods:

ERRO: 2017/01/13 16:33:38.413173 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: b6:89:e4:76:d9:92, dst: 72:27:ef:47:9b:1b} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/13 16:33:38.413223 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: b6:89:e4:76:d9:92, dst: 72:27:ef:47:9b:1b} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/13 16:33:39.311857 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: b6:89:e4:76:d9:92, dst: 8e:08:0b:78:e2:c6} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}

Pods besides weave pod, other pods were having base TCP socket connection issues where connections were intermittent and or where pods were unable to make a connection.

From what is appears this was limited to a single AZ in AWS. When and if this occurs again what diagnostic information do you require?

cc @bboreham

@chrislovecnm chrislovecnm changed the title Weave errors AWS K8s Weave errors AWS K8s TCP socket connection issues Jan 13, 2017
@chrislovecnm
Copy link
Author

chrislovecnm commented Jan 13, 2017

Is this related to #2674

@awh
Copy link
Contributor

awh commented Jan 16, 2017

@chrislovecnm this is the mitigation we added to prevent the load-inducing packet looping issue we jointly debugged a while back. We conjectured that this was caused by something external enabling hairpin mode (TBD, although there has been at least one Kubernetes bug that did exactly this) on the port which connects the OVS datapath to the local bridge (see reasoning here #2650). In addition to blocking looping flows, we also added some diagnostics that continually monitor for hairpin mode being enabled - are you also seeing an ERROR level message like

Hairpin mode enabled on <port name>

in the logs? If so, then we need to determine what it is in your environment that is turning this on; if not, we will need to give some more thought to how this can happen.

@chrislovecnm
Copy link
Author

@awh will take a look!

@chrislovecnm
Copy link
Author

chrislovecnm commented Jan 26, 2017

@awh / @bboreham so yes we have hairpin_mode on. Question is how and why.

admin@ip-172-20-93-173:/sys/devices/virtual/net$ find . | grep hair
./vethweplb6010c0/brport/hairpin_mode
./vethweplefdc7db/brport/hairpin_mode
./vethwepld35371c/brport/hairpin_mode
./vethwepl0a31135/brport/hairpin_mode
./vethwepl7c7808b/brport/hairpin_mode
./vethwepl0d7a63f/brport/hairpin_mode
./vethwepl9f683e0/brport/hairpin_mode
./vethwe-bridge/brport/hairpin_mode
./vethwepl659d42b/brport/hairpin_mode
admin@ip-172-20-93-173:/sys/devices/virtual/net$ cat ./vethweplb6010c0/brport/hairpin_mode
1

While on another node:

root@ip-172-20-64-216:/sys/devices/virtual/net# find . | grep hair
./vethwepl7679caa/brport/hairpin_mode
./vethwepl35d8ad8/brport/hairpin_mode
./vethwe-bridge/brport/hairpin_mode

I am checking though code to determine what is doing this. Any ideas on your side?

@chrislovecnm
Copy link
Author

chrislovecnm commented Jan 26, 2017

So wth does kubelet hairpin mode default to?

https://github.com/kubernetes/kubernetes/blob/d40710988f5d79c38493579e7c1bc978d7eecce6/cmd/kubelet/app/options/options.go#L209

And what should we have hairpin set to??

@bboreham
Copy link
Contributor

It's ok to have hairpin mode on for a device like vethwepl0a31135. These are the individual interfaces for containers.

It's bad to have it on for weave, vethwe-bridge.

Kubelet has a --hairpin-mode flag which defaults to veth. Previously it had a misfeature where it would apply the setting to every device on the host if it encountered an error trying to set just one device.

@chrislovecnm
Copy link
Author

chrislovecnm commented Jan 26, 2017

If I understand, we have hit a kubelet bug? If so we are on 1.4.8. What released version is the fix in? Has it been cherry picked? What is the PR?

@chrislovecnm
Copy link
Author

Also what commands can I execute to remove the hairpin and test?

@awh
Copy link
Contributor

awh commented Jan 26, 2017

@chrislovecnm

If I understand, we have hit a kubelet bug?

Not necessarily - we're just using that as an example to illustrate that there have been instances in the past where things outside of our control have erroneously enabled hairpin on the veth that connects weave's bridge and datapath. This the bug we're talking about: kubernetes/kubernetes#19766

@chrislovecnm
Copy link
Author

It's bad to have it on for weave, vethwe-bridge.

We have it on for vethwe-bridge and not sure about weave interface. What does bad mean ;)

@chrislovecnm
Copy link
Author

Would bad cause intermittent packet loss?

@chrislovecnm
Copy link
Author

Btw totally not on weave on this one. kubelet decided to do this ...

Is there a Debian command where I am disable the hairpin and retest without removing the interface?

@awh
Copy link
Contributor

awh commented Jan 26, 2017

bridge link set dev <device> hairpin off

@chrislovecnm
Copy link
Author

If you want to ping me later today on slack or zoom, that would be grand

@chrislovecnm
Copy link
Author

@bboreham I am escalating your k8s pr as a cherry pick btw.

@chrislovecnm
Copy link
Author

FYI

root@ip-172-20-64-216:~# bridge link set dev weave hairpin off
RTNETLINK answers: Operation not supported
root@ip-172-20-64-216:~# bridge link set dev vethwe-bridge hairpin off

Was only on for vethwe-bridge

@chrislovecnm
Copy link
Author

kubelet is turning it right back on ... :) So manual intervention is not helping. Let me figure out my kubelet service :P

@bboreham
Copy link
Contributor

Check your kubelet logs - it may say why it is turning it on (e.g. an error like when ethtool was missing)

@chrislovecnm
Copy link
Author

nada

@chrislovecnm
Copy link
Author

nada == "nothing in the kubelet logs" - to be clear, hairpin keeps turning itself back on, because we are hitting an error from CNI?

@chrislovecnm
Copy link
Author

I am attempting to get curl on the pod, but apt-get stinks when you have intermittent packet loss ... :(

@chrislovecnm
Copy link
Author

INFO: 2017/01/26 12:10:04.971468 Expired MAC ee:cc:4b:05:87:7d at a2:cf:d5:b3:d1:c8(ip-172-20-95-78)
ERRO: 2017/01/26 12:13:27.151094 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: a2:cf:d5:b3:d1:c8} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:13:28.148193 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: a2:cf:d5:b3:d1:c8} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:13:29.148222 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: a2:cf:d5:b3:d1:c8} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:13:30.148264 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: a2:cf:d5:b3:d1:c8} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:13:31.148186 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: a2:cf:d5:b3:d1:c8} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:13:32.148193 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: a2:cf:d5:b3:d1:c8} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:14:51.476761 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: 32:8a:9c:2e:58:29} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:14:52.473912 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: 32:8a:9c:2e:58:29} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:14:53.473867 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: 32:8a:9c:2e:58:29} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:14:54.474024 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: 32:8a:9c:2e:58:29} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:14:55.473937 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: 32:8a:9c:2e:58:29} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:14:56.473830 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: 32:8a:9c:2e:58:29} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
INFO: 2017/01/26 12:16:04.972545 Expired MAC de:cd:9a:26:4b:34 at fa:14:0e:93:9e:79(ip-172-20-64-216)

That is the patch of logs associated to the process that I have.

  1. turn off hairpin
  2. kubectl exec into pod that is having network issues
  3. from the pod: ping www.google.com
root@foo-393834086-6y2gi:/# !ping
ping www.google.com
ping: unknown host

@chrislovecnm
Copy link
Author

chrislovecnm commented Jan 26, 2017

What are the next steps that you recommend? We are going to get a tcpdump and crack that open we have one of UDP traffic, but need one with tcp traffic.
.

@chrislovecnm
Copy link
Author

Well interestingly enough;

admin@ip-172-20-93-173:~$ sudo which ethtool
admin@ip-172-20-93-173:~$

I am going to figure out the damn code lines in kubelet and get the logging bumped up on it. Thanks for pointing me that direction again @bboreham ...

@chrislovecnm
Copy link
Author

Closing

@itskingori
Copy link

Related: kubernetes/kubernetes#36990

@itskingori
Copy link

@chrislovecnm @awh A slight update. We've requested a cherry-pick of this fix into the 1.5 release and also into the 1.4 release.

I'm beginning to wonder if this issue is a aws-weave thing since it doesn't seem to be affecting a larger group of people (because the noise about this issue would be louder).

That said, you might find this interesting (see image below 👇). When we let the cluster idle, we have no hairpin errors. When we use the cluster for anything we get a sudden rise in the number of hairpin errors and that results in random communication failures (as we'd expect).

screen shot 2017-02-02 at 16 55 11

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants