Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve scalability of the Linux load balancing #37372

Merged
merged 3 commits into from
Jul 9, 2018

Conversation

ctelfer
Copy link
Contributor

@ctelfer ctelfer commented Jun 29, 2018

full diff: moby/libnetwork@430c00a...3ac297b

Changes included;

- What I did

Improve the scalability of the Linux load balancing in overlay networks by allocating a load balancing endpoint per overlay network in a manner similar to ingress and programming it with all load-balancing rules for a given network.

- How I did it

This change allocates a load balancer endpoint IP address for all overlay networks rather than just "ingress" and windows overlay networks. Swarmkit is already performing this IP address allocation, but moby was not making use of these IP addresses for Linux overlay networks (except ingress). The updated version of libnetwork in this PR creates a load balancing sandbox and endpoint for each node in the cluster for each overlay network. It programs all load balancing state for a given network in that sandbox. This scheme is similar to how the Windows overlay networking driver works as it also allocates a per-node per-network IP address.

In the prior scheme, libnetwork would program each container's network namespace with every piece of load balancing state for every other container that shared any network with the first container. This meant that the amount of kernel state on a given node scaled with the square of the number of services in the cluster and with the square of the number of containers per service. With the new scheme, kernel state at each node scales linearly with the number of services and the number of containers per service. This also reduces the number of system calls required to add or remove tasks and containers. Previously the number of system calls required grew linearly with the number of other tasks that shared a network with the container. Now the number of system calls grows linearly only with the number of networks that the task/container is attached to. This results in a significant performance improvement when adding and removing services to a cluster that already heavily loaded.

The primary disadvantage to this scheme is that it requires the allocation of an additional IP address per node per subnet for every node in the cluster that has a task on the given subnet. However, as mentioned, swarmkit is already allocating these IP addresses for every node and they are going unused. Future swarmkit modifications should be examined to only allocate said IP addresses when nodes actually require them.

This should address and hopefully resolve #30820.

- How to verify it

Do the following before and after the change:

  • Start up a cluster of docker nodes (say 3 nodes).
  • Spin up 20-50 services each with 10-20 containers each attached to 2-5 networks.
  • Measure the time it takes for the creation to converge.
  • Delete the services.
  • Measure the time it takes for all networks to be removed.

Will look into adding test data to this PR.

- Description for the changelog

Improve the scalability of Linux load balancing.

- A picture of a cute animal (not mandatory but encouraged)

@codecov
Copy link

codecov bot commented Jul 2, 2018

Codecov Report

Merging #37372 into master will increase coverage by 0.03%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master   #37372      +/-   ##
==========================================
+ Coverage   34.91%   34.95%   +0.03%     
==========================================
  Files         610      610              
  Lines       44884    44853      -31     
==========================================
+ Hits        15672    15679       +7     
+ Misses      27092    27059      -33     
+ Partials     2120     2115       -5

Bump libnetwork to b0186632522c68f4e1222c4f6d7dbe518882024f.   This
includes the following changes:
 * Dockerize protocol buffer generation and update (78d9390a..e12dd44c)
 * Use new plugin interfaces provided by plugin pkg (be94e134)
 * Improve linux load-balancing scalability (5111c24e..366b9110)

Signed-off-by: Chris Telfer <[email protected]>
This patch is required for the updated version of libnetwork and entails
two minor changes.

First, it uses the new libnetwork.NetworkDeleteOptionRemoveLB option to
the network.Delete() method to automatically remove the load balancing
endpoint for ingress networks.   This allows removal of the
deleteLoadBalancerSandbox() function whose functionality is now within
libnetwork.

The second change is to allocate a load balancer endpoint IP address for
all overlay networks rather than just "ingress" and windows overlay
networks.  Swarmkit is already performing this allocation, but moby was
not making use of these IP addresses for Linux overlay networks (except
ingress).  The current version of libnetwork makes use of these IP
addresses by creating a load balancing sandbox and endpoint similar to
ingress's  for all overlay network and putting all load balancing state
for a given node in that sandbox only.  This reduces the amount of linux
kernel state required per node.

In the prior scheme, libnetwork would program each container's network
namespace with every piece of load balancing state for every other
container that shared *any* network with the first container.  This
meant that the amount of kernel state on a given node scaled with the
square of the number of services in the cluster and with the square of
the number of containers per service.  With the new scheme, kernel state
at each node scales linearly with the number of services and the number
of containers per service.  This also reduces the number of system calls
required to add or remove tasks and containers.  Previously the number
of system calls required grew linearly with the number of other
tasks that shared a network with the container.  Now the number of
system calls grows linearly only with the number of networks that the
task/container is attached to.  This results in a significant
performance improvement when adding and removing services to a cluster
that already heavily loaded.

The primary disadvantage to this scheme is that it requires the
allocation of an additional IP address per node per subnet for every
node in the cluster that has a task on the given subnet.  However, as
mentioned, swarmkit is already allocating these IP addresses for every
node and they are going unused.  Future swarmkit modifications should be
examined to only allocate said IP addresses when nodes actually require
them.

Signed-off-by: Chris Telfer <[email protected]>
@thaJeztah
Copy link
Member

@ctelfer I see some failures; not sure if they're flaky, or related

@ctelfer
Copy link
Contributor Author

ctelfer commented Jul 5, 2018

I did some investigating of the tests that are failing.

The powerPC tests failed due to failure to fetch stuff from the Internet to set up the test. Can't do anything about that.

Under janky there were 7 tests that failed

  • TestPruneNetwork
  • TearDownTest
  • TestSwarmNetworkIPAMOptions
  • TestDockerNetworkCustomIPAM
  • TestAPIServiceUpdatePort
  • TestSwarmPublishDuplicatePorts
  • TestSwarmServiceNetworkUpdate

Of these, one (TestDockerNetworkIPAMOptions) that went and dumped goroutine traces. I reran a full integration-cli test run on master (dca4cab, i.e. without my changes) and found that at least 4 of the 7 (maybe more, lost some of the logs) including the IPAM one failed in the exact same way.

I also ran the failing tests individually and most of them passed individually in the scalable-lb branch. One that didn't was TestPruneNetwork. That one definitely failed on master for me as well in a full run. TestSwarmServiceNetworkUpdate is also giving me inconsistent results -- sometimes passing sometimes failing when run individually. It may be that some of the others are flaky and I just haven't run them enough times to cause them to fail.

In the z test suite two tests failed and they were two that failed and they were both tests that failed in janky and which failed for me in master: TestPruneNetwork and TearDownTest.

Finally, we have the experimental set where 4 tests failed. Of these TestPruneNetwork, TearDownTest and TestDockerNetworkCustomIPAM are all familiar (see above) and all failed on master for me. The one new one is TestSwarmClusterRotateUnlockKey. Not sure what to make of that one, but don't think it is likely to be related to my changes.

I will keep investigating.

@thaJeztah
Copy link
Member

@ctelfer also look at #37306 to filter tests that are known to be flaky (but need looking into)

@ctelfer
Copy link
Contributor Author

ctelfer commented Jul 6, 2018

Ok, so good, bad, ugly time.

  • The good: I looked more closely at the TestSwarmServiceNetworkUpdate test found an error.
  • The bad: it's actually an existing bug that MOSTLY won't show up in older versions ... but could
  • The ugly: patching it means another bump in libnetwork which also means bringing in other changes -- fortunately mostly minor

Details -- it turns out that there is (was) a lot of libnetwork debug code of the form logrus.LEVELf("...%s...", ..., ENTITY.ID()[0:7]... ). LEVEL could be Debug, or Error, etc... and ENTITY could be a sandbox or an endpoint or a network. The goal of the slice was obviously to limit the length of the output. BUT .. the problem was that it was possible to have IDs that were less than 7 characters long. The one I was hitting were sandbox IDs which were not UUIDs, but instead names (for reasons I don't completely understand). This meant that under the right circumstances the slice would specify a length longer than the string and the goroutine would panic.

In one particular instance, if one specified an overlay network name that was 2 characters or less in the previous libnetwork code and that network required a network sandbox (i.e. on a Windows node), then the sandbox name would be 6 or less characters and panic. The scalable network patch made this worse by making use of more load-balancer sandboxes with user-specified names (as opposed to ingress) in the unix world and also by shortening the sandbox prefix by 1 character meaning that 3 character network names could now trigger the panic as well.

I patched libnetwork to use logrus.LEVELf("... %.7s ....", ... ENTITY.ID() ...) instead which specifies a precision in the string formatting giving a maximum (instead of fixed) width, which is safe. However, bringing in this libnetwork version also means accepting the following mods as well:
* 6c7c6017 - Fix error handling about bridgeSetup
* 5ed38221 - Optimize networkDB queue
* cfa9afdb - ndots: produce error on negative numbers
* 5586e226 - improve error message for invalid ndots number
* 449672e5 - Allows to set generic knobs on the Sandbox
* 6b4c4af7 - do not ignore user-provided "ndots:0" option
* 843a0e42 - Adjust corner case for reconnect logic

Fortunately, most of these are non-controversial minor bug fixes. The only ones that alter behavior significantly is the 5ed38221 patch which attempts to prevent gossip queue length explosion. Doing so can slow gossip convergence, but should only really do so when the gossip network is a bit overloaded in the first place (and so it hopefully speeds some of it up in that case).

I did rerun the all the SwarmSuite tests w/ the aforementioned libnetwork bump and all of them passed. Hopefully, the newest push gets a clean bill of health from moby CI then.

Bump libnetwork to 3ac297bc7fd0afec9051bbb47024c9bc1d75bf5b in order to
get fix 0c3d9f00 which addresses a flaw that the scalable load balancing
code revealed.  Attempting to print sandbox IDs where the sandbox name
was too short results in a goroutine panic.  This can occur with
sandboxes with names of 1 or 2 characters in the previous code. But due
to naming updates in the scalable load balancing code, it could now
occur for networks whose name was 3 characters and at least one of the
integration tests employed such networks (named 'foo', 'bar' and 'baz').

This update also brings in several changes as well:
 * 6c7c6017 - Fix error handling about bridgeSetup
 * 5ed38221 - Optimize networkDB queue
 * cfa9afdb - ndots: produce error on negative numbers
 * 5586e226 - improve error message for invalid ndots number
 * 449672e5 - Allows to set generic knobs on the Sandbox
 * 6b4c4af7 - do not ignore user-provided "ndots:0" option
 * 843a0e42 - Adjust corner case for reconnect logic

Signed-off-by: Chris Telfer <[email protected]>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

z is a flaky test

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐸

@ctelfer
Copy link
Contributor Author

ctelfer commented Jul 9, 2018

Thanks for the reviews!

@thaJeztah
Copy link
Member

for reference; flaky failure on s390x is tracked through #37408

20:47:05 FAIL: docker_api_swarm_service_test.go:311: DockerSwarmSuite.TestAPISwarmServicesFailedUpdate
20:47:05 
20:47:05 [d8493890b28f4] waiting for daemon to start
20:47:05 [d8493890b28f4] daemon started
20:47:05 
20:47:05 [da76d9454d5b9] waiting for daemon to start
20:47:05 [da76d9454d5b9] daemon started
20:47:05 
20:47:05 [d475ab3ed458c] waiting for daemon to start
20:47:05 [d475ab3ed458c] daemon started
20:47:05 
20:47:05 docker_api_swarm_service_test.go:347:
20:47:05     waitAndAssert(c, defaultReconciliationTimeout, daemons[0].CheckRunningTaskImages, checker.DeepEquals,
20:47:05         map[string]int{image1: instances})
20:47:05 docker_utils_test.go:435:
20:47:05     c.Assert(v, checker, args...)
20:47:05 ... obtained map[string]int = map[string]int{"busybox:latest":4}
20:47:05 ... expected map[string]int = map[string]int{"busybox:latest":5}
20:47:05 
20:47:05 [d8493890b28f4] exiting daemon
20:47:05 [da76d9454d5b9] exiting daemon
20:47:05 [d475ab3ed458c] exiting daemon

@rjshrjndrn
Copy link

rjshrjndrn commented Aug 14, 2018

So, is it safe to have a n/w with /16 CIDR subnet in overlay network?

@ctelfer
Copy link
Contributor Author

ctelfer commented Aug 20, 2018

Yes, this should be safe. There was never anything inherently wrong with using a /16 network in the first place. There were admonitions in the past about not using larger subnet sizes as a crude way of keeping people from trying to create networks with large numbers of services (or large numbers of containers or nodes). The previous load balancer would take minutes to hours to complete a docker service create... when the network had more than some threshold (10s to 100s ish) of containers and/or services especially if all on the same overlay(s). I'm not sure that saying "keep subnets smaller than X" was the right way to prevent people from tripping over that problem, but that's what it was trying to do. In any case, the new scheme will definitely prevent the aforementioned issues.

@rnataraja
Copy link

rnataraja commented Sep 19, 2018

@ctelfer
Why was this change made in engine/daemon/network.go ?
From
agent && driver == “overlay” && (create.Ingress || runtime.GOOS == “windows”)
to
agent && driver == "overlay"

Does this break the case of Overlay network (non -ingress) which is attachable being attached to a non swarm conatiner?

@rnataraja
Copy link

@ctelfer false alarm. potentially, this is due to my IPAM plugin failing to allocate IP for the LB Node.

@q0rban
Copy link

q0rban commented Jul 18, 2024

So, is it safe to have a n/w with /16 CIDR subnet in overlay network?

Yes, this should be safe. There was never anything inherently wrong with using a /16 network in the first place...

It's been 6 years since this is merged in, and yet the documentation still states to avoid using anything other than a /24 for swarm networks. Should that documentation be updated?

@thaJeztah
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants