Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bridge proxy arp #1744

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Bridge proxy arp #1744

wants to merge 2 commits into from

Conversation

djlwilder
Copy link

This change enables the Linux bridge's ProxyArpWiFi feature eliminating the need flood ARP packets. When an endpoint is created the bridge driver already has the data needed to complete arp and fdb table entries. Rather than let the kernel discover this information on its own we populate the arp and fdb tables when the endpoint is configured. All other broadcast traffic will pass normal allowing the administrator to manage it using ebtables.

Linux bridge ProxyArpWifi is enabled with:
--opt com.docker.network.bridge.proxyarp=1

Dependencies:
linux kernel v4.1-rc1 or later(commit 842a9ae08a25671db3d4f689eed68b4d64be15)

@djlwilder
Copy link
Author

Some background on this change:
One obstacle to scaling a bridge network is the management of broadcast packets. To emulate a Ethernet type network the Linux bridge must deliver broadcast packets to every port of the bridge. This is accomplished by cloning and flooding broadcasts to every endpoint. Each of these cloned packets becomes an ingress packet in the kernel's network stack. If a bridge has 1024 ports one broadcast packets must be process by the kernel 1024 times. We have demonstrated running 10,000 containers on a single bridge network increasing the issue ten fold. Using multiple smaller L3 networks can reduce this overhead but creates additional management problems.

The ARP protocol makes use of broadcast packets when sending who-has requests. In our 10,000 node tests, we will send Arp who-has packets to all 10,000 nodes, this requires the kernel to process 100,000,000 broadcast packets (all but 10k are dropped)! This can cause the kernel's network ingress queues to overflow resulting in packet drops and connection establishment timeouts. I have seen this problem with configurations of only a few thousand nodes as well, even when using multiple bridges. Increasing the size of the ingress queue helps but results in higher latency when the queue is filled with broadcast packets. Brtables can be used to eliminate broadcast traffic or to limit delivery to a small number of endpoints. However, doing so breaks the Arp protocol as arps relies on broadcast packets. This change enables the Linux bridge's ProxyArpWiFi feature eliminating the need flood ARP packets. All other broadcast traffic will pass normal allowing the administrator to manage it using ebtables.

return fmt.Errorf("could not find interface with destination name %s: %v", config.BridgeName, err)
}

HostIF, err := d.nlh.LinkByName(hostIfName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this, host variable already contains the host side veth end link. In fact later you use that

err = d.nlh.LinkSetBrProxyArpWiFi(host, true)

@@ -60,6 +60,7 @@ type networkConfiguration struct {
EnableIPv6 bool
EnableIPMasquerade bool
EnableICC bool
EnableBrProxyArp bool
Copy link
Contributor

@aboch aboch May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it is already part of the bridge options, would it make sense to drop the Br part from the name of this option ?

@djlwilder
Copy link
Author

djlwilder commented May 4, 2017

Thank you for the review. I agree with both of your suggestions, I will update the commit shortly.

This change enables the Linux bridge's ProxyArpWiFi feature eliminating
the need flood ARP packets. When an endpoint is created the bridge
driver already has the data needed to complete arp and fdb table
entries. Rather than let the kernel discover this information on its own
we populate the arp and fdb tables when the endpoint is configured. All
other broadcast traffic will pass normal allowing the administrator
to manage it using ebtables.

Linux bridge ProxyArpWifi is enabled with:
--opt com.docker.network.bridge.proxyarp=1

Dependencies:
linux kernel v4.1-rc1 or later(commit 842a9ae08a25671db3d4f689eed68b4d64be15)

Updated based on review comments from aboch.

Signed-off-by: David Wilder <[email protected]>
@djlwilder
Copy link
Author

Hi, Changes have been made based on @aboch suggestions. Commits have been squashed.

@aboch
Copy link
Contributor

aboch commented May 5, 2017

Thanks @djlwilder.
I was expecting to see the fdb entry being removed on DeleteEndpoint() call.
Is there a reason not to clean it up once the container is disconnected from this network ?

@djlwilder
Copy link
Author

Hi @aboch the fdb entry will be cleaned automatically when the endpoint is removed from the bridge (when the bridge port is deleted). However, to anticipate you next question :) Should I remove the permanent neighbor entry (arp) on DeleteEndpoint()? I gave this some thought, there is no harm in leaving the old entry around as the MAC is derived from the IP address. If a new endpoint is create reusing the same ipv4 address the old neighbor entry will be re-used. My concern with removing a neighbor entry is the possibility of a race between two threads deleting and creating a endpoint and re-using the same address. Although it may be cleaner to delete the neighbor entry anyway, what do you think?

@aboch
Copy link
Contributor

aboch commented May 5, 2017

Thanks, agree.

To this regard, have you checked whether NeighSet will fail on request of adding an existing entry (http://elixir.free-electrons.com/linux/latest/source/net/core/neighbour.c#L1720) ?
If yes, you may need to discard the syscall.EEXIST error on addition.

Regarding the MAC from IP logic, be aware it is not there when the user selects the MAC address
docker run --mac <...> or when an external IPAM driver which requires the endpoint MAC address to select the IP is used (https://github.com/docker/libnetwork/blob/master/docs/ipam.md#requiresmacaddress)

@djlwilder
Copy link
Author

Hi @aboch
I am using NeighSet() rather than NeighAdd(). The latter sets NLM_F_EXCL, this will cause EEXIST to be returned if the entry already exists. NeighSet() however uses NLM_F_REPLACE and should update the existing entry but wont return EEXIST.

Thanks for pointing out the --mac option, I missed that. Using NLM_F_REPLACE should handle the case where the MAC changes, however I need to test that. I will let you know the results.

Thanks again for the feedback.

@djlwilder
Copy link
Author

Hi @aboch I verify that the --mac option works correctly with the proxyarp feature. I started a container using the --mac option, then stopped it. Then started another container using a different mac address but the same IP address. I verified that the neighbour entry was updated with the new MAC address as expected.

BTW: found a bug with --mac option (unrelated to my changes), it is possible have two running containers with different IP address and the same MAC address.

@aboch
Copy link
Contributor

aboch commented May 8, 2017

Thanks @djlwilder

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants