IP address sharing + externalTrafficPolicy: Local; exact same selector mixes traffic between pods #271

uablrek · 2018-06-27T14:12:05Z

Is this a bug report or a feature request?:

Bug

What happened:

The metallb documentation states that if address sharing and externalTrafficPolicy: Local is combined the services must "have the exact same selector".

But the implementing pods must then also have the same selector or Kubernetes complains. However Kubernetes seem to regard the pods exactly equal and distributes traffic to all sevices with the shared ip (and the same selector) to all pods, regardless of which ports they serve.

Hence my two pods serving ports 22 and 5001 will get a mix of traffic to the ports and half the traffic is lost.

What you expected to happen:

Traffic to a service should be distributed to the pods implementing the service only.

How to reproduce it (as minimally and precisely as possible):

I have 2 services and deployments;

apiVersion: v1
kind: Service
metadata:
  name: cgen
  annotations:
    metallb.universe.tf/allow-shared-ip: ekvm
spec:
  selector:
    app: ekvm
  ports:
  - port: 5001
  externalTrafficPolicy: Local
  loadBalancerIP: 10.0.0.2
  type: LoadBalancer
---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: cgen-deployment
spec:
  selector:
    matchLabels:
      app: ekvm
  replicas: 4
  template:
    metadata:
      labels:
        app: ekvm
    spec:
      containers:
      - name: cgen
        image: example.com/cgen:0.0.1
        ports:
        - containerPort: 5001

and

apiVersion: v1
kind: Service
metadata:
  name: ekvm-busybox
  annotations:
    metallb.universe.tf/allow-shared-ip: ekvm
spec:
  selector:
    app: ekvm
  ports:
  - port: 1022
    name: ssh
    targetPort: 22
  externalTrafficPolicy: Local
  loadBalancerIP: 10.0.0.2
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ekvm-busybox-deployment
spec:
  selector:
    matchLabels:
      app: ekvm
  replicas: 4
  template:
    metadata:
      labels:
        app: ekvm
    spec:
      containers:
      - name: ekvm-busybox
        image: example.com/ekvm-busybox:0.0.1
        ports:
        - containerPort: 22
          name: ssh

When started every other attempt to e.g. port 1022 will fail;

vm-201 ~ # ssh -p 1022 10.0.0.2 netstat -putan
dbclient: Caution, skipping hostkey check for 10.0.0.2

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      35/dropbear
tcp        0      0 11.0.2.2:22             192.168.0.201:56530     ESTABLISHED 56/dropbear
tcp        0      0 :::80                   :::*                    LISTEN      28/inetd
tcp        0      0 :::22                   :::*                    LISTEN      35/dropbear
tcp        0      0 :::23                   :::*                    LISTEN      28/inetd
vm-201 ~ # ssh -p 1022 10.0.0.2 netstat -putan

dbclient: Connection to [email protected]:1022 exited: Connect failed: Connection refused

The source is preserved (192.168.0.201) as seen in the first attempt, so far so good. But the second attempt fails because it is routed to the other serving only port 5001.

I run proxy-mode=ipvs so it is quite easy to visualize the problem;

vm-003 ~ # ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.0.3:31374 rr
  -> 11.0.3.2:5001                Masq    1      0          0
  -> 11.0.3.3:5001                Masq    1      0          0
TCP  10.0.0.2:1022 rr
  -> 11.0.3.2:22                  Masq    1      0          0
  -> 11.0.3.3:22                  Masq    1      0          0
TCP  10.0.0.2:5001 rr
  -> 11.0.3.2:5001                Masq    1      0          0
  -> 11.0.3.3:5001                Masq    1      0          0
TCP  11.0.3.1:30919 rr
...

The loadBalancerIP 10.0.0.2 is distributed locally (good) but for port 1022 it is distributed to both local pods which is a bug IMO. Maybe not in metallb though.

Anything else we need to know?:

Actually I can't understand the restriction of exact the same selector.

By removing the check i can make it work perfectly the way I want it.

Environment:

MetalLB version:
Built from source. commit; d38ad1e
Kubernetes version:
v1.10.5-beta
BGP router type/version:
gobgp
OS (e.g. from /etc/os-release):
Own busyBox based
Kernel (e.g. uname -a):
Linux vm-003 4.16.2 #2 SMP Mon Jun 11 16:02:46 CEST 2018 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

danderson · 2018-06-27T16:04:31Z

Thanks for the report.

You're right, I tried to be generic in the documentation, but I actually need to be more specific: to use the externalTrafficPolicy: Local on a shared IP service, all services must be sending traffic to the same set of pods.

This is because MetalLB can only control traffic flow at L3, so it's all-or-nothing. MetalLB can't tell the outside world "I want traffic for IP 1.2.3.4, but only port 80", it can only say "I want all traffic for 1.2.3.4". This is fine with the Cluster traffic policy, because kube-proxy will forward traffic per-port to the right destinations, no matter where they are. But with the Local traffic policy, kube-proxy can only forward to pods on the same node.

So, we have 2 constraints:

MetalLB can only attract traffic for all ports on an IP
kube-proxy can only route traffic to pods on the same node

There are 2 ways to work with these constraints:

Only receive traffic if all shared services have >=1 healthy pod on the current node. This leads to confusing behavior, where one unhealthy service triggers traffic shifts in other services.
Force all services to use the exact same pods as backend. That way, the ready/unready endpoint data is identical for all services, and it's easy to decide if a node should receive traffic.

MetalLB picks the second option, because the first leads to a bunch of confusing behaviors that would mean a lot more bugs.

Unfortunately this means that IP sharing with externalTrafficPolicy: Local is only really useful as a workaround for kubernetes/kubernetes#23880 , and not as a generic IP sharing mechanism.

It's pretty clear that I need to make the docs more explicit about the limitations for your scenario. If you have suggestions on how to make this behavior better (e.g. how to remove this constraint without a flood of bug reports about confusing traffic routing because of what I explained above), I'd love to make IP sharing more generally useful.

uablrek · 2018-07-01T09:22:29Z

Tanks for your elaborate answer and I see your point.

But IMHO a bare-metal k8s system will never be as "forgiving" as e.g GCE so one may assume a somewhat better knowledge from those users. The local traffic policy is often a very desired, if not required, feature for the source address preseravtion. I think a reasonable division of responsibility would be;

Metallb make sure that external traffic (L3) is routed to a set of nodes.
The application makes sure there are traffic-handling PODs on those nodes if local traffic policy is used

Most likely an application will be assigned one external IP but will be implemented as several services with different functions and ports for external traffic which is why I want both shared-ip and local traffic policy.

I also foresee a scaling problem that I hope will be possible to turn into a feature using local traffic policy;

In a large system all nodes can't be ECMP targets. In that case I would like to use "frontend" PODs (e.g. Ingress Controllers) forming some "load-balancing-tier" and using local traffic policy to avoid an unnecessary hop. External traffic flow would become very efficient I think, and the source would always be preserved.

I would suggest some "expert" option to metallb, like --unsafe-local-policy that would disable the check and also allow local traffic policy it in L2-mode. But from long experience I know that it would not prevent the flood of bug reports. But if you think the flow would be bearable, please consider it.

For now I can patch metallb as described. Another option is to correct externalIPs in k8s. It already does local traffic policy, and drop traffic if there is no local POD, but ... it still does the SNAT which is totally unnecessary.

uablrek closed this as completed Jul 1, 2018

This was referenced Sep 11, 2018

advertise service VIP from the node only if there is atleast one service endpoint pod on the node cloudnativelabs/kube-router#262

Closed

Shared Layer 2 IPs should not be announced by multiple nodes #315

Closed

mattmalec mentioned this issue May 9, 2023

Support DSR for LoadBalancer antrea-io/antrea#4956

Closed

xeruf mentioned this issue Mar 2, 2024

metallb.universe.tf/allow-shared-ip #788

Closed

xorinzor mentioned this issue Sep 27, 2024

EdgeCase: docker-mailserver on bare-metal kubernetes with metallb as loadbalancer docker-mailserver/docker-mailserver-helm#136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IP address sharing + externalTrafficPolicy: Local; exact same selector mixes traffic between pods #271

IP address sharing + externalTrafficPolicy: Local; exact same selector mixes traffic between pods #271

uablrek commented Jun 27, 2018

danderson commented Jun 27, 2018

uablrek commented Jul 1, 2018 •

edited

Loading

IP address sharing + externalTrafficPolicy: Local; exact same selector mixes traffic between pods #271

IP address sharing + externalTrafficPolicy: Local; exact same selector mixes traffic between pods #271

Comments

uablrek commented Jun 27, 2018

danderson commented Jun 27, 2018

uablrek commented Jul 1, 2018 • edited Loading

uablrek commented Jul 1, 2018 •

edited

Loading