-
Notifications
You must be signed in to change notification settings - Fork 424
[18.06] Improve scalability of the Linux load balancing #16
[18.06] Improve scalability of the Linux load balancing #16
Conversation
Bump libnetwork to b0186632522c68f4e1222c4f6d7dbe518882024f. This includes the following changes: * Dockerize protocol buffer generation and update (78d9390a..e12dd44c) * Use new plugin interfaces provided by plugin pkg (be94e134) * Improve linux load-balancing scalability (5111c24e..366b9110) Signed-off-by: Chris Telfer <[email protected]> (cherry picked from commit 92335ea) Signed-off-by: Sebastiaan van Stijn <[email protected]>
This patch is required for the updated version of libnetwork and entails two minor changes. First, it uses the new libnetwork.NetworkDeleteOptionRemoveLB option to the network.Delete() method to automatically remove the load balancing endpoint for ingress networks. This allows removal of the deleteLoadBalancerSandbox() function whose functionality is now within libnetwork. The second change is to allocate a load balancer endpoint IP address for all overlay networks rather than just "ingress" and windows overlay networks. Swarmkit is already performing this allocation, but moby was not making use of these IP addresses for Linux overlay networks (except ingress). The current version of libnetwork makes use of these IP addresses by creating a load balancing sandbox and endpoint similar to ingress's for all overlay network and putting all load balancing state for a given node in that sandbox only. This reduces the amount of linux kernel state required per node. In the prior scheme, libnetwork would program each container's network namespace with every piece of load balancing state for every other container that shared *any* network with the first container. This meant that the amount of kernel state on a given node scaled with the square of the number of services in the cluster and with the square of the number of containers per service. With the new scheme, kernel state at each node scales linearly with the number of services and the number of containers per service. This also reduces the number of system calls required to add or remove tasks and containers. Previously the number of system calls required grew linearly with the number of other tasks that shared a network with the container. Now the number of system calls grows linearly only with the number of networks that the task/container is attached to. This results in a significant performance improvement when adding and removing services to a cluster that already heavily loaded. The primary disadvantage to this scheme is that it requires the allocation of an additional IP address per node per subnet for every node in the cluster that has a task on the given subnet. However, as mentioned, swarmkit is already allocating these IP addresses for every node and they are going unused. Future swarmkit modifications should be examined to only allocate said IP addresses when nodes actually require them. Signed-off-by: Chris Telfer <[email protected]> (cherry picked from commit 8e0f6bc) Signed-off-by: Sebastiaan van Stijn <[email protected]>
ping @fcrisciani @ctelfer PTAL |
hm... outage somewhere? |
We should also include 6225d1f if that is not in 18.06. Other than that LGTM. |
Bump libnetwork to 3ac297bc7fd0afec9051bbb47024c9bc1d75bf5b in order to get fix 0c3d9f00 which addresses a flaw that the scalable load balancing code revealed. Attempting to print sandbox IDs where the sandbox name was too short results in a goroutine panic. This can occur with sandboxes with names of 1 or 2 characters in the previous code. But due to naming updates in the scalable load balancing code, it could now occur for networks whose name was 3 characters and at least one of the integration tests employed such networks (named 'foo', 'bar' and 'baz'). This update also brings in several changes as well: * 6c7c6017 - Fix error handling about bridgeSetup * 5ed38221 - Optimize networkDB queue * cfa9afdb - ndots: produce error on negative numbers * 5586e226 - improve error message for invalid ndots number * 449672e5 - Allows to set generic knobs on the Sandbox * 6b4c4af7 - do not ignore user-provided "ndots:0" option * 843a0e42 - Adjust corner case for reconnect logic Signed-off-by: Chris Telfer <[email protected]> (cherry picked from commit 0e162d9) Signed-off-by: Sebastiaan van Stijn <[email protected]>
db5472f
to
1c42326
Compare
@ctelfer Looks like moby#37156 is not in the 18.06 branch, so we shouldn't have the regression (won't hurt to include 6225d1f, but I think it's not necessary. If you can double-check in the 18.06 branch though (in case I overlooked) 🤗 |
Oh! That's true. The 18.06 branch does not have the wrapped error changes. I'm a little surprised, but not displeased. |
looks like we need more tests to catch issues like the one that @ctelfer fixed, this is not the first time that something get broken when the error types get touched :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Adding the new load balancer introduces a new behavior: |
@yuvalo Thanks for the comment. A few thoughts and then queries:
As to the queries portion. I'd be interested to hear about what kinds of east-west services require avoiding source address modification that you know of. I've heard some very vague statements that they exist to that effect, but haven't heard many real use cases. It would be helpful to understand how/why using source NAT breaks things for east-west applications for purposes of identifying a proper fix. Again, thanks! |
@ctelfer Thanks for following up. Some of the use cases that come to mind are:
People tend to think of the overlay network as still internal and expect it to be more "flat" than it is in reality. Hope the above makes a compelling case 🤗 |
@yuvalo Thanks for the update and the quick response. Overlays are definitely flat networks. But services are an abstraction clearly built around a layer on indirection between clients and servers. Having said that, while I'm not a huge fan of IP-based identity as a concept, it is certainly traditional. Item 1 above was the main complaint that I've heard re: source NATing. There are, of course, workarounds for this as well such as accounting via TLS or http header fields. (These can give even more semantically relevant info and stronger identity guarantees.) But obviously, some apps still use just basic IP info. Item 2 is one I haven't encountered before and very good to know about. I'm guessing that redundant NameNodes each have to be a separate Docker service because otherwise datanodes would have the same problem in reverse. (i.e. the datanodes don't know which namenodes they are talking too) So Docker's networking abstractions are interfering with application abstractions built for similar purposes. In any case, I think that the DSR approach will address the cases you listed above. Now, just need a clean way to keep compatibility w/ the Windows overlay services. |
@yuvalo So we have merged in the libnetwork changes to support DSR-mode load balancing as a per-network feature. This won't prevent upgrade failures, but it does provide a solution for when these situations arise. One one can create a network using
Unfortunately, this is not yet merged into the docker engine itself yet, but we are looking to get it in soon (i.e. not waiting for next stable release) including looking at backports. The rationale for making this feature opt-in rather than the default boils down to:
I hope we'll have a version of Docker soon that can mitigate your issues. |
Hi
I have nothing against this change, although it was a surprise and cost me some hours to figure it out. |
backport of moby#37372 for 18.06
Changes included;
--dns-option=ndots:0
w/ user-defined networking results in duplicate ndots:0 options moby/moby#37349cherry-pick was clean, no conflicts