Weave errors AWS K8s TCP socket connection issues #2731

chrislovecnm · 2017-01-13T19:28:13Z

We are running weave 1.8.2, as a Deamonset, on K8s.

Seen errors like the following from the weave pods:

ERRO: 2017/01/13 16:33:38.413173 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: b6:89:e4:76:d9:92, dst: 72:27:ef:47:9b:1b} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/13 16:33:38.413223 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: b6:89:e4:76:d9:92, dst: 72:27:ef:47:9b:1b} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/13 16:33:39.311857 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: b6:89:e4:76:d9:92, dst: 8e:08:0b:78:e2:c6} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}

Pods besides weave pod, other pods were having base TCP socket connection issues where connections were intermittent and or where pods were unable to make a connection.

From what is appears this was limited to a single AZ in AWS. When and if this occurs again what diagnostic information do you require?

cc @bboreham

The text was updated successfully, but these errors were encountered:

chrislovecnm · 2017-01-13T19:31:42Z

Is this related to #2674

awh · 2017-01-16T12:08:45Z

@chrislovecnm this is the mitigation we added to prevent the load-inducing packet looping issue we jointly debugged a while back. We conjectured that this was caused by something external enabling hairpin mode (TBD, although there has been at least one Kubernetes bug that did exactly this) on the port which connects the OVS datapath to the local bridge (see reasoning here #2650). In addition to blocking looping flows, we also added some diagnostics that continually monitor for hairpin mode being enabled - are you also seeing an ERROR level message like

Hairpin mode enabled on <port name>

in the logs? If so, then we need to determine what it is in your environment that is turning this on; if not, we will need to give some more thought to how this can happen.

chrislovecnm · 2017-01-17T19:34:54Z

@awh will take a look!

chrislovecnm · 2017-01-26T06:37:52Z

@awh / @bboreham so yes we have hairpin_mode on. Question is how and why.

admin@ip-172-20-93-173:/sys/devices/virtual/net$ find . | grep hair
./vethweplb6010c0/brport/hairpin_mode
./vethweplefdc7db/brport/hairpin_mode
./vethwepld35371c/brport/hairpin_mode
./vethwepl0a31135/brport/hairpin_mode
./vethwepl7c7808b/brport/hairpin_mode
./vethwepl0d7a63f/brport/hairpin_mode
./vethwepl9f683e0/brport/hairpin_mode
./vethwe-bridge/brport/hairpin_mode
./vethwepl659d42b/brport/hairpin_mode
admin@ip-172-20-93-173:/sys/devices/virtual/net$ cat ./vethweplb6010c0/brport/hairpin_mode
1

While on another node:

root@ip-172-20-64-216:/sys/devices/virtual/net# find . | grep hair
./vethwepl7679caa/brport/hairpin_mode
./vethwepl35d8ad8/brport/hairpin_mode
./vethwe-bridge/brport/hairpin_mode

I am checking though code to determine what is doing this. Any ideas on your side?

chrislovecnm · 2017-01-26T06:43:37Z

So wth does kubelet hairpin mode default to?

https://github.com/kubernetes/kubernetes/blob/d40710988f5d79c38493579e7c1bc978d7eecce6/cmd/kubelet/app/options/options.go#L209

And what should we have hairpin set to??

bboreham · 2017-01-26T09:42:54Z

It's ok to have hairpin mode on for a device like vethwepl0a31135. These are the individual interfaces for containers.

It's bad to have it on for weave, vethwe-bridge.

Kubelet has a --hairpin-mode flag which defaults to veth. Previously it had a misfeature where it would apply the setting to every device on the host if it encountered an error trying to set just one device.

chrislovecnm · 2017-01-26T09:48:46Z

If I understand, we have hit a kubelet bug? If so we are on 1.4.8. What released version is the fix in? Has it been cherry picked? What is the PR?

chrislovecnm · 2017-01-26T10:14:21Z

Also what commands can I execute to remove the hairpin and test?

awh · 2017-01-26T11:36:53Z

@chrislovecnm

If I understand, we have hit a kubelet bug?

Not necessarily - we're just using that as an example to illustrate that there have been instances in the past where things outside of our control have erroneously enabled hairpin on the veth that connects weave's bridge and datapath. This the bug we're talking about: kubernetes/kubernetes#19766

chrislovecnm · 2017-01-26T11:48:26Z

It's bad to have it on for weave, vethwe-bridge.

We have it on for vethwe-bridge and not sure about weave interface. What does bad mean ;)

chrislovecnm · 2017-01-26T11:49:46Z

Would bad cause intermittent packet loss?

chrislovecnm · 2017-01-26T11:51:21Z

Btw totally not on weave on this one. kubelet decided to do this ...

Is there a Debian command where I am disable the hairpin and retest without removing the interface?

awh · 2017-01-26T11:53:06Z

bridge link set dev <device> hairpin off

chrislovecnm · 2017-01-26T11:53:34Z

If you want to ping me later today on slack or zoom, that would be grand

chrislovecnm · 2017-01-26T11:56:09Z

@bboreham I am escalating your k8s pr as a cherry pick btw.

chrislovecnm · 2017-01-26T11:59:22Z

FYI

root@ip-172-20-64-216:~# bridge link set dev weave hairpin off
RTNETLINK answers: Operation not supported
root@ip-172-20-64-216:~# bridge link set dev vethwe-bridge hairpin off

Was only on for vethwe-bridge

chrislovecnm · 2017-01-26T12:09:34Z

kubelet is turning it right back on ... :) So manual intervention is not helping. Let me figure out my kubelet service :P

bboreham · 2017-01-26T12:10:52Z

Check your kubelet logs - it may say why it is turning it on (e.g. an error like when ethtool was missing)

chrislovecnm · 2017-01-26T12:12:59Z

nada

chrislovecnm · 2017-01-26T12:13:53Z

nada == "nothing in the kubelet logs" - to be clear, hairpin keeps turning itself back on, because we are hitting an error from CNI?

chrislovecnm · 2017-01-26T12:15:32Z

I am attempting to get curl on the pod, but apt-get stinks when you have intermittent packet loss ... :(

chrislovecnm · 2017-01-26T12:18:18Z

INFO: 2017/01/26 12:10:04.971468 Expired MAC ee:cc:4b:05:87:7d at a2:cf:d5:b3:d1:c8(ip-172-20-95-78)
ERRO: 2017/01/26 12:13:27.151094 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: a2:cf:d5:b3:d1:c8} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:13:28.148193 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: a2:cf:d5:b3:d1:c8} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:13:29.148222 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: a2:cf:d5:b3:d1:c8} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:13:30.148264 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: a2:cf:d5:b3:d1:c8} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:13:31.148186 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: a2:cf:d5:b3:d1:c8} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:13:32.148193 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: a2:cf:d5:b3:d1:c8} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:14:51.476761 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: 32:8a:9c:2e:58:29} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:14:52.473912 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: 32:8a:9c:2e:58:29} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:14:53.473867 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: 32:8a:9c:2e:58:29} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:14:54.474024 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: 32:8a:9c:2e:58:29} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:14:55.473937 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: 32:8a:9c:2e:58:29} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
ERRO: 2017/01/26 12:14:56.473830 Vetoed installation of hairpin flow FlowSpec{keys: [EthernetFlowKey{src: 4e:e8:a5:bc:e3:c4, dst: 32:8a:9c:2e:58:29} InPortFlowKey{vport: 1}], actions: [OutputAction{vport: 1}]}
INFO: 2017/01/26 12:16:04.972545 Expired MAC de:cd:9a:26:4b:34 at fa:14:0e:93:9e:79(ip-172-20-64-216)

That is the patch of logs associated to the process that I have.

turn off hairpin
kubectl exec into pod that is having network issues
from the pod: ping www.google.com

root@foo-393834086-6y2gi:/# !ping
ping www.google.com
ping: unknown host

chrislovecnm · 2017-01-26T12:21:01Z

What are the next steps that you recommend? We are going to get a tcpdump and crack that open we have one of UDP traffic, but need one with tcp traffic.
.

chrislovecnm · 2017-01-26T12:44:59Z

Well interestingly enough;

admin@ip-172-20-93-173:~$ sudo which ethtool
admin@ip-172-20-93-173:~$

I am going to figure out the damn code lines in kubelet and get the logging bumped up on it. Thanks for pointing me that direction again @bboreham ...

chrislovecnm · 2017-01-26T12:51:56Z

Closing

itskingori · 2017-01-30T15:37:42Z

Related: kubernetes/kubernetes#36990

itskingori · 2017-02-02T15:10:30Z

@chrislovecnm @awh A slight update. We've requested a cherry-pick of this fix into the 1.5 release and also into the 1.4 release.

I'm beginning to wonder if this issue is a aws-weave thing since it doesn't seem to be affecting a larger group of people (because the noise about this issue would be louder).

That said, you might find this interesting (see image below 👇). When we let the cluster idle, we have no hairpin errors. When we use the cluster for anything we get a sudden rise in the number of hairpin errors and that results in random communication failures (as we'd expect).

chrislovecnm changed the title ~~Weave errors AWS K8s~~ Weave errors AWS K8s TCP socket connection issues Jan 13, 2017

awh added state/investigating question labels Jan 20, 2017

chrislovecnm closed this as completed Jan 26, 2017

itskingori mentioned this issue Feb 8, 2017

1.4 image does not have ethtool? kubernetes/kops#1725

Closed

bboreham added resolution/duplicate and removed question state/investigating labels Feb 22, 2017

bboreham added this to the n/a milestone Feb 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weave errors AWS K8s TCP socket connection issues #2731

Weave errors AWS K8s TCP socket connection issues #2731

chrislovecnm commented Jan 13, 2017

chrislovecnm commented Jan 13, 2017 •

edited

Loading

awh commented Jan 16, 2017

chrislovecnm commented Jan 17, 2017

chrislovecnm commented Jan 26, 2017 •

edited

Loading

chrislovecnm commented Jan 26, 2017 •

edited

Loading

bboreham commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017 •

edited

Loading

chrislovecnm commented Jan 26, 2017

awh commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

awh commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

bboreham commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017 •

edited

Loading

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

itskingori commented Jan 30, 2017

itskingori commented Feb 2, 2017

Weave errors AWS K8s TCP socket connection issues #2731

Weave errors AWS K8s TCP socket connection issues #2731

Comments

chrislovecnm commented Jan 13, 2017

chrislovecnm commented Jan 13, 2017 • edited Loading

awh commented Jan 16, 2017

chrislovecnm commented Jan 17, 2017

chrislovecnm commented Jan 26, 2017 • edited Loading

chrislovecnm commented Jan 26, 2017 • edited Loading

bboreham commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017 • edited Loading

chrislovecnm commented Jan 26, 2017

awh commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

awh commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

bboreham commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017 • edited Loading

chrislovecnm commented Jan 26, 2017

chrislovecnm commented Jan 26, 2017

itskingori commented Jan 30, 2017

itskingori commented Feb 2, 2017

chrislovecnm commented Jan 13, 2017 •

edited

Loading

chrislovecnm commented Jan 26, 2017 •

edited

Loading

chrislovecnm commented Jan 26, 2017 •

edited

Loading

chrislovecnm commented Jan 26, 2017 •

edited

Loading

chrislovecnm commented Jan 26, 2017 •

edited

Loading