Improve scalability of the Linux load balancing #37372

ctelfer · 2018-06-29T16:37:00Z

full diff: moby/libnetwork@430c00a...3ac297b

Changes included;

Improve load balancing scalability in swarm mode libnetwork#2138 Improve load balancing scalability in swarm mode
Use new plugin interfaces provided by plugin pkg libnetwork#2199 Use new plugin interfaces provided by plugin pkg
Adjust corner case for reconnect logic libnetwork#2200 Adjust corner case for reconnect logic
Fix duplicate ndots:0, and improve validation libnetwork#2212 Fix duplicate ndots:0, and improve validation
- fixes Using --dns-option=ndots:0 w/ user-defined networking results in duplicate ndots:0 options #37349
Add knobs on LB sandbox libnetwork#2154 Add knobs on LB sandbox
Fix error handling about bridgeSetup libnetwork#2215 Fix error handling about bridgeSetup
Use fmt precision to limit string length libnetwork#2221 Use fmt precision to limit string length
NetworkDB qlen optimization libnetwork#2216 NetworkDB qlen optimization

- What I did

Improve the scalability of the Linux load balancing in overlay networks by allocating a load balancing endpoint per overlay network in a manner similar to ingress and programming it with all load-balancing rules for a given network.

- How I did it

This change allocates a load balancer endpoint IP address for all overlay networks rather than just "ingress" and windows overlay networks. Swarmkit is already performing this IP address allocation, but moby was not making use of these IP addresses for Linux overlay networks (except ingress). The updated version of libnetwork in this PR creates a load balancing sandbox and endpoint for each node in the cluster for each overlay network. It programs all load balancing state for a given network in that sandbox. This scheme is similar to how the Windows overlay networking driver works as it also allocates a per-node per-network IP address.

In the prior scheme, libnetwork would program each container's network namespace with every piece of load balancing state for every other container that shared any network with the first container. This meant that the amount of kernel state on a given node scaled with the square of the number of services in the cluster and with the square of the number of containers per service. With the new scheme, kernel state at each node scales linearly with the number of services and the number of containers per service. This also reduces the number of system calls required to add or remove tasks and containers. Previously the number of system calls required grew linearly with the number of other tasks that shared a network with the container. Now the number of system calls grows linearly only with the number of networks that the task/container is attached to. This results in a significant performance improvement when adding and removing services to a cluster that already heavily loaded.

The primary disadvantage to this scheme is that it requires the allocation of an additional IP address per node per subnet for every node in the cluster that has a task on the given subnet. However, as mentioned, swarmkit is already allocating these IP addresses for every node and they are going unused. Future swarmkit modifications should be examined to only allocate said IP addresses when nodes actually require them.

This should address and hopefully resolve #30820.

- How to verify it

Do the following before and after the change:

Start up a cluster of docker nodes (say 3 nodes).
Spin up 20-50 services each with 10-20 containers each attached to 2-5 networks.
Measure the time it takes for the creation to converge.
Delete the services.
Measure the time it takes for all networks to be removed.

Will look into adding test data to this PR.

- Description for the changelog

Improve the scalability of Linux load balancing.

- A picture of a cute animal (not mandatory but encouraged)

codecov · 2018-07-02T13:28:15Z

Codecov Report

Merging #37372 into master will increase coverage by 0.03%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master   #37372      +/-   ##
==========================================
+ Coverage   34.91%   34.95%   +0.03%     
==========================================
  Files         610      610              
  Lines       44884    44853      -31     
==========================================
+ Hits        15672    15679       +7     
+ Misses      27092    27059      -33     
+ Partials     2120     2115       -5

Bump libnetwork to b0186632522c68f4e1222c4f6d7dbe518882024f. This includes the following changes: * Dockerize protocol buffer generation and update (78d9390a..e12dd44c) * Use new plugin interfaces provided by plugin pkg (be94e134) * Improve linux load-balancing scalability (5111c24e..366b9110) Signed-off-by: Chris Telfer <[email protected]>

This patch is required for the updated version of libnetwork and entails two minor changes. First, it uses the new libnetwork.NetworkDeleteOptionRemoveLB option to the network.Delete() method to automatically remove the load balancing endpoint for ingress networks. This allows removal of the deleteLoadBalancerSandbox() function whose functionality is now within libnetwork. The second change is to allocate a load balancer endpoint IP address for all overlay networks rather than just "ingress" and windows overlay networks. Swarmkit is already performing this allocation, but moby was not making use of these IP addresses for Linux overlay networks (except ingress). The current version of libnetwork makes use of these IP addresses by creating a load balancing sandbox and endpoint similar to ingress's for all overlay network and putting all load balancing state for a given node in that sandbox only. This reduces the amount of linux kernel state required per node. In the prior scheme, libnetwork would program each container's network namespace with every piece of load balancing state for every other container that shared *any* network with the first container. This meant that the amount of kernel state on a given node scaled with the square of the number of services in the cluster and with the square of the number of containers per service. With the new scheme, kernel state at each node scales linearly with the number of services and the number of containers per service. This also reduces the number of system calls required to add or remove tasks and containers. Previously the number of system calls required grew linearly with the number of other tasks that shared a network with the container. Now the number of system calls grows linearly only with the number of networks that the task/container is attached to. This results in a significant performance improvement when adding and removing services to a cluster that already heavily loaded. The primary disadvantage to this scheme is that it requires the allocation of an additional IP address per node per subnet for every node in the cluster that has a task on the given subnet. However, as mentioned, swarmkit is already allocating these IP addresses for every node and they are going unused. Future swarmkit modifications should be examined to only allocate said IP addresses when nodes actually require them. Signed-off-by: Chris Telfer <[email protected]>

thaJeztah · 2018-07-03T23:15:00Z

@ctelfer I see some failures; not sure if they're flaky, or related

ctelfer · 2018-07-05T19:04:00Z

I did some investigating of the tests that are failing.

The powerPC tests failed due to failure to fetch stuff from the Internet to set up the test. Can't do anything about that.

Under janky there were 7 tests that failed

TestPruneNetwork
TearDownTest
TestSwarmNetworkIPAMOptions
TestDockerNetworkCustomIPAM
TestAPIServiceUpdatePort
TestSwarmPublishDuplicatePorts
TestSwarmServiceNetworkUpdate

Of these, one (TestDockerNetworkIPAMOptions) that went and dumped goroutine traces. I reran a full integration-cli test run on master (dca4cab, i.e. without my changes) and found that at least 4 of the 7 (maybe more, lost some of the logs) including the IPAM one failed in the exact same way.

I also ran the failing tests individually and most of them passed individually in the scalable-lb branch. One that didn't was TestPruneNetwork. That one definitely failed on master for me as well in a full run. TestSwarmServiceNetworkUpdate is also giving me inconsistent results -- sometimes passing sometimes failing when run individually. It may be that some of the others are flaky and I just haven't run them enough times to cause them to fail.

In the z test suite two tests failed and they were two that failed and they were both tests that failed in janky and which failed for me in master: TestPruneNetwork and TearDownTest.

Finally, we have the experimental set where 4 tests failed. Of these TestPruneNetwork, TearDownTest and TestDockerNetworkCustomIPAM are all familiar (see above) and all failed on master for me. The one new one is TestSwarmClusterRotateUnlockKey. Not sure what to make of that one, but don't think it is likely to be related to my changes.

I will keep investigating.

thaJeztah · 2018-07-06T17:39:25Z

@ctelfer also look at #37306 to filter tests that are known to be flaky (but need looking into)

ctelfer · 2018-07-06T17:54:55Z

Ok, so good, bad, ugly time.

The good: I looked more closely at the TestSwarmServiceNetworkUpdate test found an error.
The bad: it's actually an existing bug that MOSTLY won't show up in older versions ... but could
The ugly: patching it means another bump in libnetwork which also means bringing in other changes -- fortunately mostly minor

Details -- it turns out that there is (was) a lot of libnetwork debug code of the form logrus.LEVELf("...%s...", ..., ENTITY.ID()[0:7]... ). LEVEL could be Debug, or Error, etc... and ENTITY could be a sandbox or an endpoint or a network. The goal of the slice was obviously to limit the length of the output. BUT .. the problem was that it was possible to have IDs that were less than 7 characters long. The one I was hitting were sandbox IDs which were not UUIDs, but instead names (for reasons I don't completely understand). This meant that under the right circumstances the slice would specify a length longer than the string and the goroutine would panic.

In one particular instance, if one specified an overlay network name that was 2 characters or less in the previous libnetwork code and that network required a network sandbox (i.e. on a Windows node), then the sandbox name would be 6 or less characters and panic. The scalable network patch made this worse by making use of more load-balancer sandboxes with user-specified names (as opposed to ingress) in the unix world and also by shortening the sandbox prefix by 1 character meaning that 3 character network names could now trigger the panic as well.

I patched libnetwork to use logrus.LEVELf("... %.7s ....", ... ENTITY.ID() ...) instead which specifies a precision in the string formatting giving a maximum (instead of fixed) width, which is safe. However, bringing in this libnetwork version also means accepting the following mods as well:
* 6c7c6017 - Fix error handling about bridgeSetup
* 5ed38221 - Optimize networkDB queue
* cfa9afdb - ndots: produce error on negative numbers
* 5586e226 - improve error message for invalid ndots number
* 449672e5 - Allows to set generic knobs on the Sandbox
* 6b4c4af7 - do not ignore user-provided "ndots:0" option
* 843a0e42 - Adjust corner case for reconnect logic

Fortunately, most of these are non-controversial minor bug fixes. The only ones that alter behavior significantly is the 5ed38221 patch which attempts to prevent gossip queue length explosion. Doing so can slow gossip convergence, but should only really do so when the gossip network is a bit overloaded in the first place (and so it hopefully speeds some of it up in that case).

I did rerun the all the SwarmSuite tests w/ the aforementioned libnetwork bump and all of them passed. Hopefully, the newest push gets a clean bill of health from moby CI then.

Bump libnetwork to 3ac297bc7fd0afec9051bbb47024c9bc1d75bf5b in order to get fix 0c3d9f00 which addresses a flaw that the scalable load balancing code revealed. Attempting to print sandbox IDs where the sandbox name was too short results in a goroutine panic. This can occur with sandboxes with names of 1 or 2 characters in the previous code. But due to naming updates in the scalable load balancing code, it could now occur for networks whose name was 3 characters and at least one of the integration tests employed such networks (named 'foo', 'bar' and 'baz'). This update also brings in several changes as well: * 6c7c6017 - Fix error handling about bridgeSetup * 5ed38221 - Optimize networkDB queue * cfa9afdb - ndots: produce error on negative numbers * 5586e226 - improve error message for invalid ndots number * 449672e5 - Allows to set generic knobs on the Sandbox * 6b4c4af7 - do not ignore user-provided "ndots:0" option * 843a0e42 - Adjust corner case for reconnect logic Signed-off-by: Chris Telfer <[email protected]>

thaJeztah

LGTM

z is a flaky test

vdemeester

LGTM 🐸

ctelfer · 2018-07-09T17:25:09Z

Thanks for the reviews!

thaJeztah · 2018-07-09T18:06:33Z

for reference; flaky failure on s390x is tracked through #37408

20:47:05 FAIL: docker_api_swarm_service_test.go:311: DockerSwarmSuite.TestAPISwarmServicesFailedUpdate
20:47:05 
20:47:05 [d8493890b28f4] waiting for daemon to start
20:47:05 [d8493890b28f4] daemon started
20:47:05 
20:47:05 [da76d9454d5b9] waiting for daemon to start
20:47:05 [da76d9454d5b9] daemon started
20:47:05 
20:47:05 [d475ab3ed458c] waiting for daemon to start
20:47:05 [d475ab3ed458c] daemon started
20:47:05 
20:47:05 docker_api_swarm_service_test.go:347:
20:47:05     waitAndAssert(c, defaultReconciliationTimeout, daemons[0].CheckRunningTaskImages, checker.DeepEquals,
20:47:05         map[string]int{image1: instances})
20:47:05 docker_utils_test.go:435:
20:47:05     c.Assert(v, checker, args...)
20:47:05 ... obtained map[string]int = map[string]int{"busybox:latest":4}
20:47:05 ... expected map[string]int = map[string]int{"busybox:latest":5}
20:47:05 
20:47:05 [d8493890b28f4] exiting daemon
20:47:05 [da76d9454d5b9] exiting daemon
20:47:05 [d475ab3ed458c] exiting daemon

rjshrjndrn · 2018-08-14T11:34:12Z

So, is it safe to have a n/w with /16 CIDR subnet in overlay network?

ctelfer · 2018-08-20T17:07:08Z

Yes, this should be safe. There was never anything inherently wrong with using a /16 network in the first place. There were admonitions in the past about not using larger subnet sizes as a crude way of keeping people from trying to create networks with large numbers of services (or large numbers of containers or nodes). The previous load balancer would take minutes to hours to complete a docker service create... when the network had more than some threshold (10s to 100s ish) of containers and/or services especially if all on the same overlay(s). I'm not sure that saying "keep subnets smaller than X" was the right way to prevent people from tripping over that problem, but that's what it was trying to do. In any case, the new scheme will definitely prevent the aforementioned issues.

rnataraja · 2018-09-19T03:46:09Z

@ctelfer
Why was this change made in engine/daemon/network.go ?
From
agent && driver == “overlay” && (create.Ingress || runtime.GOOS == “windows”)
to
agent && driver == "overlay"

Does this break the case of Overlay network (non -ingress) which is attachable being attached to a non swarm conatiner?

rnataraja · 2018-09-19T04:12:13Z

@ctelfer false alarm. potentially, this is due to my IPAM plugin failing to allocate IP for the LB Node.

q0rban · 2024-07-18T18:35:01Z

So, is it safe to have a n/w with /16 CIDR subnet in overlay network?

Yes, this should be safe. There was never anything inherently wrong with using a /16 network in the first place...

It's been 6 years since this is merged in, and yet the documentation still states to avoid using anything other than a /24 for swarm networks. Should that documentation be updated?

thaJeztah · 2024-07-18T23:32:32Z

@q0rban tracked in Update limits on overlay networks docker/docs#11412 (contributions welcome in that repository)

ctelfer requested a review from tianon as a code owner June 29, 2018 16:37

GordonTheTurtle added the status/0-triage label Jun 29, 2018

ctelfer force-pushed the scalable-lb branch from 7fb26a5 to 2afbdc5 Compare July 2, 2018 13:28

ctelfer requested a review from vdemeester as a code owner July 2, 2018 13:28

vdemeester added status/2-code-review and removed status/0-triage labels Jul 2, 2018

ctelfer added 2 commits July 3, 2018 13:46

ctelfer force-pushed the scalable-lb branch from 2afbdc5 to 8e0f6bc Compare July 3, 2018 18:13

ctelfer mentioned this pull request Jul 3, 2018

services do not start: ingress-sbox is already present #36743

Open

ctelfer force-pushed the scalable-lb branch from 10b44ef to 0e162d9 Compare July 6, 2018 17:58

ctelfer mentioned this pull request Jul 6, 2018

Fix faulty error type checking in removeNetwork() #37409

Merged

thaJeztah approved these changes Jul 9, 2018

View reviewed changes

vdemeester approved these changes Jul 9, 2018

View reviewed changes

thaJeztah added area/networking impact/changelog labels Jul 9, 2018

thaJeztah merged commit 328c089 into moby:master Jul 9, 2018

thaJeztah mentioned this pull request Jul 9, 2018

[18.06] Improve scalability of the Linux load balancing docker-archive/engine#16

Merged

ctelfer deleted the scalable-lb branch July 9, 2018 18:17

BretFisher mentioned this pull request Jul 18, 2018

Is there a roadmap for docker swarm? moby/swarmkit#2665

Open

hoerup mentioned this pull request Sep 23, 2020

Update limits on overlay networks docker/docs#11412

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve scalability of the Linux load balancing #37372

Improve scalability of the Linux load balancing #37372

ctelfer commented Jun 29, 2018 •

edited by thaJeztah

Loading

codecov bot commented Jul 2, 2018 •

edited

Loading

thaJeztah commented Jul 3, 2018

ctelfer commented Jul 5, 2018 •

edited

Loading

thaJeztah commented Jul 6, 2018

ctelfer commented Jul 6, 2018

thaJeztah left a comment

vdemeester left a comment

ctelfer commented Jul 9, 2018

thaJeztah commented Jul 9, 2018

rjshrjndrn commented Aug 14, 2018 •

edited

Loading

ctelfer commented Aug 20, 2018

rnataraja commented Sep 19, 2018 •

edited

Loading

rnataraja commented Sep 19, 2018

q0rban commented Jul 18, 2024

thaJeztah commented Jul 18, 2024

Improve scalability of the Linux load balancing #37372

Improve scalability of the Linux load balancing #37372

Conversation

ctelfer commented Jun 29, 2018 • edited by thaJeztah Loading

codecov bot commented Jul 2, 2018 • edited Loading

Codecov Report

thaJeztah commented Jul 3, 2018

ctelfer commented Jul 5, 2018 • edited Loading

thaJeztah commented Jul 6, 2018

ctelfer commented Jul 6, 2018

thaJeztah left a comment

Choose a reason for hiding this comment

vdemeester left a comment

Choose a reason for hiding this comment

ctelfer commented Jul 9, 2018

thaJeztah commented Jul 9, 2018

rjshrjndrn commented Aug 14, 2018 • edited Loading

ctelfer commented Aug 20, 2018

rnataraja commented Sep 19, 2018 • edited Loading

rnataraja commented Sep 19, 2018

q0rban commented Jul 18, 2024

thaJeztah commented Jul 18, 2024

ctelfer commented Jun 29, 2018 •

edited by thaJeztah

Loading

codecov bot commented Jul 2, 2018 •

edited

Loading

ctelfer commented Jul 5, 2018 •

edited

Loading

rjshrjndrn commented Aug 14, 2018 •

edited

Loading

rnataraja commented Sep 19, 2018 •

edited

Loading