Bridge proxy arp #1744

djlwilder · 2017-05-03T20:15:34Z

This change enables the Linux bridge's ProxyArpWiFi feature eliminating the need flood ARP packets. When an endpoint is created the bridge driver already has the data needed to complete arp and fdb table entries. Rather than let the kernel discover this information on its own we populate the arp and fdb tables when the endpoint is configured. All other broadcast traffic will pass normal allowing the administrator to manage it using ebtables.

Linux bridge ProxyArpWifi is enabled with:
--opt com.docker.network.bridge.proxyarp=1

Dependencies:
linux kernel v4.1-rc1 or later(commit 842a9ae08a25671db3d4f689eed68b4d64be15)

djlwilder · 2017-05-03T20:17:34Z

Some background on this change:
One obstacle to scaling a bridge network is the management of broadcast packets. To emulate a Ethernet type network the Linux bridge must deliver broadcast packets to every port of the bridge. This is accomplished by cloning and flooding broadcasts to every endpoint. Each of these cloned packets becomes an ingress packet in the kernel's network stack. If a bridge has 1024 ports one broadcast packets must be process by the kernel 1024 times. We have demonstrated running 10,000 containers on a single bridge network increasing the issue ten fold. Using multiple smaller L3 networks can reduce this overhead but creates additional management problems.

The ARP protocol makes use of broadcast packets when sending who-has requests. In our 10,000 node tests, we will send Arp who-has packets to all 10,000 nodes, this requires the kernel to process 100,000,000 broadcast packets (all but 10k are dropped)! This can cause the kernel's network ingress queues to overflow resulting in packet drops and connection establishment timeouts. I have seen this problem with configurations of only a few thousand nodes as well, even when using multiple bridges. Increasing the size of the ingress queue helps but results in higher latency when the queue is filled with broadcast packets. Brtables can be used to eliminate broadcast traffic or to limit delivery to a small number of endpoints. However, doing so breaks the Arp protocol as arps relies on broadcast packets. This change enables the Linux bridge's ProxyArpWiFi feature eliminating the need flood ARP packets. All other broadcast traffic will pass normal allowing the administrator to manage it using ebtables.

aboch · 2017-05-04T16:43:09Z

drivers/bridge/bridge.go

+			return fmt.Errorf("could not find interface with destination name %s: %v", config.BridgeName, err)
+		}
+
+		HostIF, err := d.nlh.LinkByName(hostIfName)


Please remove this, host variable already contains the host side veth end link. In fact later you use that

err = d.nlh.LinkSetBrProxyArpWiFi(host, true)

aboch · 2017-05-04T16:44:19Z

drivers/bridge/bridge.go

@@ -60,6 +60,7 @@ type networkConfiguration struct {
 	EnableIPv6           bool
 	EnableIPMasquerade   bool
 	EnableICC            bool
+	EnableBrProxyArp     bool


Given it is already part of the bridge options, would it make sense to drop the Br part from the name of this option ?

djlwilder · 2017-05-04T21:20:07Z

Thank you for the review. I agree with both of your suggestions, I will update the commit shortly.

This change enables the Linux bridge's ProxyArpWiFi feature eliminating the need flood ARP packets. When an endpoint is created the bridge driver already has the data needed to complete arp and fdb table entries. Rather than let the kernel discover this information on its own we populate the arp and fdb tables when the endpoint is configured. All other broadcast traffic will pass normal allowing the administrator to manage it using ebtables. Linux bridge ProxyArpWifi is enabled with: --opt com.docker.network.bridge.proxyarp=1 Dependencies: linux kernel v4.1-rc1 or later(commit 842a9ae08a25671db3d4f689eed68b4d64be15) Updated based on review comments from aboch. Signed-off-by: David Wilder <[email protected]>

Signed-off-by: David Wilder <[email protected]>

djlwilder · 2017-05-05T17:58:55Z

Hi, Changes have been made based on @aboch suggestions. Commits have been squashed.

aboch · 2017-05-05T18:48:42Z

Thanks @djlwilder.
I was expecting to see the fdb entry being removed on DeleteEndpoint() call.
Is there a reason not to clean it up once the container is disconnected from this network ?

djlwilder · 2017-05-05T19:19:33Z

Hi @aboch the fdb entry will be cleaned automatically when the endpoint is removed from the bridge (when the bridge port is deleted). However, to anticipate you next question :) Should I remove the permanent neighbor entry (arp) on DeleteEndpoint()? I gave this some thought, there is no harm in leaving the old entry around as the MAC is derived from the IP address. If a new endpoint is create reusing the same ipv4 address the old neighbor entry will be re-used. My concern with removing a neighbor entry is the possibility of a race between two threads deleting and creating a endpoint and re-using the same address. Although it may be cleaner to delete the neighbor entry anyway, what do you think?

aboch · 2017-05-05T20:59:06Z

Thanks, agree.

To this regard, have you checked whether NeighSet will fail on request of adding an existing entry (http://elixir.free-electrons.com/linux/latest/source/net/core/neighbour.c#L1720) ?
If yes, you may need to discard the syscall.EEXIST error on addition.

Regarding the MAC from IP logic, be aware it is not there when the user selects the MAC address
docker run --mac <...> or when an external IPAM driver which requires the endpoint MAC address to select the IP is used (https://github.com/docker/libnetwork/blob/master/docs/ipam.md#requiresmacaddress)

djlwilder · 2017-05-08T16:42:30Z

Hi @aboch
I am using NeighSet() rather than NeighAdd(). The latter sets NLM_F_EXCL, this will cause EEXIST to be returned if the entry already exists. NeighSet() however uses NLM_F_REPLACE and should update the existing entry but wont return EEXIST.

Thanks for pointing out the --mac option, I missed that. Using NLM_F_REPLACE should handle the case where the MAC changes, however I need to test that. I will let you know the results.

Thanks again for the feedback.

djlwilder · 2017-05-08T18:20:50Z

Hi @aboch I verify that the --mac option works correctly with the proxyarp feature. I started a container using the --mac option, then stopped it. Then started another container using a different mac address but the same IP address. I verified that the neighbour entry was updated with the new MAC address as expected.

BTW: found a bug with --mac option (unrelated to my changes), it is possible have two running containers with different IP address and the same MAC address.

aboch · 2017-05-08T18:23:47Z

Thanks @djlwilder

LGTM

aboch suggested changes May 4, 2017

View reviewed changes

aboch reviewed May 4, 2017

View reviewed changes

djlwilder added 2 commits May 5, 2017 09:12

Vendoring the netlink changes.

808e5cb

Signed-off-by: David Wilder <[email protected]>

djlwilder force-pushed the BridgeProxyArp branch from d32350c to 808e5cb Compare May 5, 2017 16:17

aboch approved these changes May 8, 2017

View reviewed changes

corhere added the carry-to-mobymoby label Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bridge proxy arp #1744

Bridge proxy arp #1744

djlwilder commented May 3, 2017

djlwilder commented May 3, 2017

aboch May 4, 2017

aboch May 4, 2017 •

edited

Loading

djlwilder commented May 4, 2017 •

edited

Loading

djlwilder commented May 5, 2017

aboch commented May 5, 2017

djlwilder commented May 5, 2017

aboch commented May 5, 2017

djlwilder commented May 8, 2017

djlwilder commented May 8, 2017

aboch commented May 8, 2017

Bridge proxy arp #1744

Are you sure you want to change the base?

Bridge proxy arp #1744

Conversation

djlwilder commented May 3, 2017

djlwilder commented May 3, 2017

aboch May 4, 2017

Choose a reason for hiding this comment

aboch May 4, 2017 • edited Loading

Choose a reason for hiding this comment

djlwilder commented May 4, 2017 • edited Loading

djlwilder commented May 5, 2017

aboch commented May 5, 2017

djlwilder commented May 5, 2017

aboch commented May 5, 2017

djlwilder commented May 8, 2017

djlwilder commented May 8, 2017

aboch commented May 8, 2017

aboch May 4, 2017 •

edited

Loading

djlwilder commented May 4, 2017 •

edited

Loading