-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve scalability of the Linux load balancing #37372
Conversation
Codecov Report
@@ Coverage Diff @@
## master #37372 +/- ##
==========================================
+ Coverage 34.91% 34.95% +0.03%
==========================================
Files 610 610
Lines 44884 44853 -31
==========================================
+ Hits 15672 15679 +7
+ Misses 27092 27059 -33
+ Partials 2120 2115 -5 |
Bump libnetwork to b0186632522c68f4e1222c4f6d7dbe518882024f. This includes the following changes: * Dockerize protocol buffer generation and update (78d9390a..e12dd44c) * Use new plugin interfaces provided by plugin pkg (be94e134) * Improve linux load-balancing scalability (5111c24e..366b9110) Signed-off-by: Chris Telfer <[email protected]>
This patch is required for the updated version of libnetwork and entails two minor changes. First, it uses the new libnetwork.NetworkDeleteOptionRemoveLB option to the network.Delete() method to automatically remove the load balancing endpoint for ingress networks. This allows removal of the deleteLoadBalancerSandbox() function whose functionality is now within libnetwork. The second change is to allocate a load balancer endpoint IP address for all overlay networks rather than just "ingress" and windows overlay networks. Swarmkit is already performing this allocation, but moby was not making use of these IP addresses for Linux overlay networks (except ingress). The current version of libnetwork makes use of these IP addresses by creating a load balancing sandbox and endpoint similar to ingress's for all overlay network and putting all load balancing state for a given node in that sandbox only. This reduces the amount of linux kernel state required per node. In the prior scheme, libnetwork would program each container's network namespace with every piece of load balancing state for every other container that shared *any* network with the first container. This meant that the amount of kernel state on a given node scaled with the square of the number of services in the cluster and with the square of the number of containers per service. With the new scheme, kernel state at each node scales linearly with the number of services and the number of containers per service. This also reduces the number of system calls required to add or remove tasks and containers. Previously the number of system calls required grew linearly with the number of other tasks that shared a network with the container. Now the number of system calls grows linearly only with the number of networks that the task/container is attached to. This results in a significant performance improvement when adding and removing services to a cluster that already heavily loaded. The primary disadvantage to this scheme is that it requires the allocation of an additional IP address per node per subnet for every node in the cluster that has a task on the given subnet. However, as mentioned, swarmkit is already allocating these IP addresses for every node and they are going unused. Future swarmkit modifications should be examined to only allocate said IP addresses when nodes actually require them. Signed-off-by: Chris Telfer <[email protected]>
@ctelfer I see some failures; not sure if they're flaky, or related |
I did some investigating of the tests that are failing. The Under
Of these, one (TestDockerNetworkIPAMOptions) that went and dumped goroutine traces. I reran a full integration-cli test run on I also ran the failing tests individually and most of them passed individually in the In the Finally, we have the I will keep investigating. |
Ok, so good, bad, ugly time.
Details -- it turns out that there is (was) a lot of libnetwork debug code of the form In one particular instance, if one specified an overlay network name that was 2 characters or less in the previous libnetwork code and that network required a network sandbox (i.e. on a Windows node), then the sandbox name would be 6 or less characters and panic. The scalable network patch made this worse by making use of more load-balancer sandboxes with user-specified names (as opposed to I patched libnetwork to use Fortunately, most of these are non-controversial minor bug fixes. The only ones that alter behavior significantly is the 5ed38221 patch which attempts to prevent gossip queue length explosion. Doing so can slow gossip convergence, but should only really do so when the gossip network is a bit overloaded in the first place (and so it hopefully speeds some of it up in that case). I did rerun the all the SwarmSuite tests w/ the aforementioned libnetwork bump and all of them passed. Hopefully, the newest push gets a clean bill of health from moby CI then. |
Bump libnetwork to 3ac297bc7fd0afec9051bbb47024c9bc1d75bf5b in order to get fix 0c3d9f00 which addresses a flaw that the scalable load balancing code revealed. Attempting to print sandbox IDs where the sandbox name was too short results in a goroutine panic. This can occur with sandboxes with names of 1 or 2 characters in the previous code. But due to naming updates in the scalable load balancing code, it could now occur for networks whose name was 3 characters and at least one of the integration tests employed such networks (named 'foo', 'bar' and 'baz'). This update also brings in several changes as well: * 6c7c6017 - Fix error handling about bridgeSetup * 5ed38221 - Optimize networkDB queue * cfa9afdb - ndots: produce error on negative numbers * 5586e226 - improve error message for invalid ndots number * 449672e5 - Allows to set generic knobs on the Sandbox * 6b4c4af7 - do not ignore user-provided "ndots:0" option * 843a0e42 - Adjust corner case for reconnect logic Signed-off-by: Chris Telfer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
z is a flaky test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🐸
Thanks for the reviews! |
for reference; flaky failure on s390x is tracked through #37408
|
So, is it safe to have a n/w with /16 CIDR subnet in overlay network? |
Yes, this should be safe. There was never anything inherently wrong with using a /16 network in the first place. There were admonitions in the past about not using larger subnet sizes as a crude way of keeping people from trying to create networks with large numbers of services (or large numbers of containers or nodes). The previous load balancer would take minutes to hours to complete a |
@ctelfer Does this break the case of Overlay network (non -ingress) which is attachable being attached to a non swarm conatiner? |
@ctelfer false alarm. potentially, this is due to my IPAM plugin failing to allocate IP for the LB Node. |
It's been 6 years since this is merged in, and yet the documentation still states to avoid using anything other than a |
|
full diff: moby/libnetwork@430c00a...3ac297b
Changes included;
--dns-option=ndots:0
w/ user-defined networking results in duplicate ndots:0 options #37349- What I did
Improve the scalability of the Linux load balancing in overlay networks by allocating a load balancing endpoint per overlay network in a manner similar to ingress and programming it with all load-balancing rules for a given network.
- How I did it
This change allocates a load balancer endpoint IP address for all overlay networks rather than just "ingress" and windows overlay networks. Swarmkit is already performing this IP address allocation, but moby was not making use of these IP addresses for Linux overlay networks (except ingress). The updated version of libnetwork in this PR creates a load balancing sandbox and endpoint for each node in the cluster for each overlay network. It programs all load balancing state for a given network in that sandbox. This scheme is similar to how the Windows overlay networking driver works as it also allocates a per-node per-network IP address.
In the prior scheme, libnetwork would program each container's network namespace with every piece of load balancing state for every other container that shared any network with the first container. This meant that the amount of kernel state on a given node scaled with the square of the number of services in the cluster and with the square of the number of containers per service. With the new scheme, kernel state at each node scales linearly with the number of services and the number of containers per service. This also reduces the number of system calls required to add or remove tasks and containers. Previously the number of system calls required grew linearly with the number of other tasks that shared a network with the container. Now the number of system calls grows linearly only with the number of networks that the task/container is attached to. This results in a significant performance improvement when adding and removing services to a cluster that already heavily loaded.
The primary disadvantage to this scheme is that it requires the allocation of an additional IP address per node per subnet for every node in the cluster that has a task on the given subnet. However, as mentioned, swarmkit is already allocating these IP addresses for every node and they are going unused. Future swarmkit modifications should be examined to only allocate said IP addresses when nodes actually require them.
This should address and hopefully resolve #30820.
- How to verify it
Do the following before and after the change:
Will look into adding test data to this PR.
- Description for the changelog
Improve the scalability of Linux load balancing.
- A picture of a cute animal (not mandatory but encouraged)