[18.06] Improve scalability of the Linux load balancing #16

thaJeztah · 2018-07-09T18:10:30Z

backport of moby#37372 for 18.06

Changes included;

Improve load balancing scalability in swarm mode moby/libnetwork#2138 Improve load balancing scalability in swarm mode
Use new plugin interfaces provided by plugin pkg moby/libnetwork#2199 Use new plugin interfaces provided by plugin pkg
Adjust corner case for reconnect logic moby/libnetwork#2200 Adjust corner case for reconnect logic
Fix duplicate ndots:0, and improve validation moby/libnetwork#2212 Fix duplicate ndots:0, and improve validation
- fixes Using --dns-option=ndots:0 w/ user-defined networking results in duplicate ndots:0 options moby/moby#37349
Add knobs on LB sandbox moby/libnetwork#2154 Add knobs on LB sandbox
Fix error handling about bridgeSetup moby/libnetwork#2215 Fix error handling about bridgeSetup
Use fmt precision to limit string length moby/libnetwork#2221 Use fmt precision to limit string length
NetworkDB qlen optimization moby/libnetwork#2216 NetworkDB qlen optimization

git checkout -b 18.06-backport-scalable-lb ce-engine/18.06
git cherry-pick -s -S -x 92335eaef12fb0fb86e5e7503f88b8873a9fa973
git cherry-pick -s -S -x 8e0f6bc90351525f3e52f3bc357378fcccccdd27
git cherry-pick -s -S -x 0e162d992394d9216f590176991d33ff48ef3389
git push -u origin

cherry-pick was clean, no conflicts

Bump libnetwork to b0186632522c68f4e1222c4f6d7dbe518882024f. This includes the following changes: * Dockerize protocol buffer generation and update (78d9390a..e12dd44c) * Use new plugin interfaces provided by plugin pkg (be94e134) * Improve linux load-balancing scalability (5111c24e..366b9110) Signed-off-by: Chris Telfer <[email protected]> (cherry picked from commit 92335ea) Signed-off-by: Sebastiaan van Stijn <[email protected]>

This patch is required for the updated version of libnetwork and entails two minor changes. First, it uses the new libnetwork.NetworkDeleteOptionRemoveLB option to the network.Delete() method to automatically remove the load balancing endpoint for ingress networks. This allows removal of the deleteLoadBalancerSandbox() function whose functionality is now within libnetwork. The second change is to allocate a load balancer endpoint IP address for all overlay networks rather than just "ingress" and windows overlay networks. Swarmkit is already performing this allocation, but moby was not making use of these IP addresses for Linux overlay networks (except ingress). The current version of libnetwork makes use of these IP addresses by creating a load balancing sandbox and endpoint similar to ingress's for all overlay network and putting all load balancing state for a given node in that sandbox only. This reduces the amount of linux kernel state required per node. In the prior scheme, libnetwork would program each container's network namespace with every piece of load balancing state for every other container that shared *any* network with the first container. This meant that the amount of kernel state on a given node scaled with the square of the number of services in the cluster and with the square of the number of containers per service. With the new scheme, kernel state at each node scales linearly with the number of services and the number of containers per service. This also reduces the number of system calls required to add or remove tasks and containers. Previously the number of system calls required grew linearly with the number of other tasks that shared a network with the container. Now the number of system calls grows linearly only with the number of networks that the task/container is attached to. This results in a significant performance improvement when adding and removing services to a cluster that already heavily loaded. The primary disadvantage to this scheme is that it requires the allocation of an additional IP address per node per subnet for every node in the cluster that has a task on the given subnet. However, as mentioned, swarmkit is already allocating these IP addresses for every node and they are going unused. Future swarmkit modifications should be examined to only allocate said IP addresses when nodes actually require them. Signed-off-by: Chris Telfer <[email protected]> (cherry picked from commit 8e0f6bc) Signed-off-by: Sebastiaan van Stijn <[email protected]>

thaJeztah · 2018-07-09T18:10:47Z

ping @fcrisciani @ctelfer PTAL

thaJeztah · 2018-07-09T18:26:27Z

18:14:05 2018/07/09 18:14:05 Errors on clone:
18:14:05 google.golang.org/api: Err: exit status 128, out: Cloning into '/go/src/github.com/docker/docker/vendor/google.golang.org/api'...
18:14:05 fatal: git fetch-pack: expected ACK/NAK, got '�error: Git repository not found'
18:14:05 fatal: The remote end hung up unexpectedly
18:14:05 Build step 'Execute shell' marked build as failure

hm... outage somewhere?

ctelfer · 2018-07-09T18:27:13Z

We should also include 6225d1f if that is not in 18.06. Other than that LGTM.

Bump libnetwork to 3ac297bc7fd0afec9051bbb47024c9bc1d75bf5b in order to get fix 0c3d9f00 which addresses a flaw that the scalable load balancing code revealed. Attempting to print sandbox IDs where the sandbox name was too short results in a goroutine panic. This can occur with sandboxes with names of 1 or 2 characters in the previous code. But due to naming updates in the scalable load balancing code, it could now occur for networks whose name was 3 characters and at least one of the integration tests employed such networks (named 'foo', 'bar' and 'baz'). This update also brings in several changes as well: * 6c7c6017 - Fix error handling about bridgeSetup * 5ed38221 - Optimize networkDB queue * cfa9afdb - ndots: produce error on negative numbers * 5586e226 - improve error message for invalid ndots number * 449672e5 - Allows to set generic knobs on the Sandbox * 6b4c4af7 - do not ignore user-provided "ndots:0" option * 843a0e42 - Adjust corner case for reconnect logic Signed-off-by: Chris Telfer <[email protected]> (cherry picked from commit 0e162d9) Signed-off-by: Sebastiaan van Stijn <[email protected]>

thaJeztah · 2018-07-09T18:29:48Z

@ctelfer Looks like moby#37156 is not in the 18.06 branch, so we shouldn't have the regression (won't hurt to include 6225d1f, but I think it's not necessary.

If you can double-check in the 18.06 branch though (in case I overlooked) 🤗

ctelfer · 2018-07-09T18:39:32Z

Oh! That's true. The 18.06 branch does not have the wrapped error changes. I'm a little surprised, but not displeased.

fcrisciani · 2018-07-09T23:26:14Z

looks like we need more tests to catch issues like the one that @ctelfer fixed, this is not the first time that something get broken when the error types get touched :(

fcrisciani

LGTM

ctelfer

LGTM

yuvalo · 2018-10-08T09:19:15Z

Adding the new load balancer introduces a new behavior:
When a container communicates with another service, it will go through the load balancer and hide the source IP address.
This is a potential problem in cases where the source IP matters for an application.

ctelfer · 2018-10-08T13:37:04Z

@yuvalo Thanks for the comment. A few thoughts and then queries:

fwiw, this change only affects east-west traffic and not north-south (which always used masquerading)
this behavior was already present in the Windows implementation of engine since 17.09, so at least in mixed-OS deployments having the source address unmodified wasn't a guarantee
there is an outstanding PR in libnetwork (Use direct server return in east-west overlay load balancing moby/libnetwork#2270 (comment)) that would use direct server return to avoid changing the source IP
unfortunately, the main issue with that approach is, again, Windows. Windows' networking stack does not support (at least as far as we know) the direct route load balancing primitives required to implement DSR itself meaning that Windows containers will not work with it. We're still pondering how we might get around this.

As to the queries portion. I'd be interested to hear about what kinds of east-west services require avoiding source address modification that you know of. I've heard some very vague statements that they exist to that effect, but haven't heard many real use cases. It would be helpful to understand how/why using source NAT breaks things for east-west applications for purposes of identifying a proper fix.

Again, thanks!

yuvalo · 2018-10-08T14:29:10Z

@ctelfer Thanks for following up.
The reason I got to this PR was that updating to 18.06 broke our deployment in the test environment. We are still working around it, but don't have enough confidence that everything will work out.

Some of the use cases that come to mind are:

Access auditing where you need to record the source of a request.
High availability / scalable solutions such as Hadoop HDFS where the Datanode registers with the Namenode or Spark workers with the Spark master. There are workarounds in most cases, but updating Docker will break existing deployments. This is where I'm at today btw.
IP hash based internal HTTP load balancing
In the future, it will be harder to introduce network restrictions (think Kubernetes Network Policies).

People tend to think of the overlay network as still internal and expect it to be more "flat" than it is in reality.

Hope the above makes a compelling case 🤗

ctelfer · 2018-10-08T15:32:03Z

@yuvalo Thanks for the update and the quick response. Overlays are definitely flat networks. But services are an abstraction clearly built around a layer on indirection between clients and servers. Having said that, while I'm not a huge fan of IP-based identity as a concept, it is certainly traditional.

Item 1 above was the main complaint that I've heard re: source NATing. There are, of course, workarounds for this as well such as accounting via TLS or http header fields. (These can give even more semantically relevant info and stronger identity guarantees.) But obviously, some apps still use just basic IP info.

Item 2 is one I haven't encountered before and very good to know about. I'm guessing that redundant NameNodes each have to be a separate Docker service because otherwise datanodes would have the same problem in reverse. (i.e. the datanodes don't know which namenodes they are talking too) So Docker's networking abstractions are interfering with application abstractions built for similar purposes. ☹️ What we need to be building, therefore, seems like a set of abstractions that can both work stand-alone and work with application services for things like load balancing, audit, scaling, etc. The NameNode design clearly shows that the application needs more than just compute and network redundancy to support hit-less upgrade or fail-over. (this is more for longer-term planning)

In any case, I think that the DSR approach will address the cases you listed above. Now, just need a clean way to keep compatibility w/ the Windows overlay services.

ctelfer · 2018-10-12T14:27:14Z

@yuvalo So we have merged in the libnetwork changes to support DSR-mode load balancing as a per-network feature. This won't prevent upgrade failures, but it does provide a solution for when these situations arise. One one can create a network using docker network create ... -d overlay --opt dsr ... to enable the this behavior and avoid NATing for services in that network. Note that this only works in overlay networks. Alternately, a docker-compose.yml file can be augmented as:

...
networks:
  my-network:
    driver: overlay
    driver_opts:
      dsr: ""
...

Unfortunately, this is not yet merged into the docker engine itself yet, but we are looking to get it in soon (i.e. not waiting for next stable release) including looking at backports.

The rationale for making this feature opt-in rather than the default boils down to:

doing live updates from pre-18.06 to 18.09+ engines w/ DSR as default will cause outages no matter what
doing live updates from pre-18.06 to 18.06+ engines w/ NAT as default may cause issues, but may not .. it depends on the sensitivity of the applications in the cluster to source NATing
DSR mode still will not work w/ Windows and the default overlay configuration should work cross-platform

I hope we'll have a version of Docker soon that can mitigate your issues.

TincaTibo · 2018-12-03T10:33:00Z

Hi
I faced a situation linked to this change when upgrading to 18.06.

I have a tool that allow debugging of the exchanges on the cluster, and gets the name of the replicas used in communication from DNS reverse resolving of their IPs.
After upgrade to 18.06, I was blind. Because the IPs used in communication were changed from the replicas IPs to the services VIPs.
I did a workaround by resolving all services first to get their VIPs, but then, the tool is not autodiscovering the services now... It preloads them :(

I have nothing against this change, although it was a surprise and cost me some hours to figure it out.
It seems indeed normal that since services are addressing VIP to communicate, they are the one used in communications.
But could we add PTR records in the DNS to allow reverse resolving of the VIPs?
Indeed reverse resolving of replicas works, but not of their VIP.

ctelfer added 2 commits July 9, 2018 20:08

thaJeztah force-pushed the 18.06-backport-scalable-lb branch from db5472f to 1c42326 Compare July 9, 2018 18:27

thaJeztah added this to the 18.06.0 milestone Jul 9, 2018

mavenugo requested review from fcrisciani and ctelfer July 9, 2018 21:37

fcrisciani approved these changes Jul 9, 2018

View reviewed changes

ctelfer approved these changes Jul 10, 2018

View reviewed changes

andrewhsu merged commit c863835 into docker-archive:18.06 Jul 10, 2018

thaJeztah deleted the 18.06-backport-scalable-lb branch July 10, 2018 16:30

thaJeztah mentioned this pull request Nov 14, 2018

Mapping port not exposed by stack deploy! boot2docker/boot2docker#1349

Closed

igorkatz mentioned this pull request Aug 7, 2019

Docker 19.03.1 changes the source IP address moby/moby#39665

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[18.06] Improve scalability of the Linux load balancing #16

[18.06] Improve scalability of the Linux load balancing #16

thaJeztah commented Jul 9, 2018

thaJeztah commented Jul 9, 2018

thaJeztah commented Jul 9, 2018

ctelfer commented Jul 9, 2018

thaJeztah commented Jul 9, 2018

ctelfer commented Jul 9, 2018

fcrisciani commented Jul 9, 2018

fcrisciani left a comment

ctelfer left a comment

yuvalo commented Oct 8, 2018

ctelfer commented Oct 8, 2018

yuvalo commented Oct 8, 2018

ctelfer commented Oct 8, 2018 •

edited

Loading

ctelfer commented Oct 12, 2018

TincaTibo commented Dec 3, 2018

[18.06] Improve scalability of the Linux load balancing #16

[18.06] Improve scalability of the Linux load balancing #16

Conversation

thaJeztah commented Jul 9, 2018

thaJeztah commented Jul 9, 2018

thaJeztah commented Jul 9, 2018

ctelfer commented Jul 9, 2018

thaJeztah commented Jul 9, 2018

ctelfer commented Jul 9, 2018

fcrisciani commented Jul 9, 2018

fcrisciani left a comment

Choose a reason for hiding this comment

ctelfer left a comment

Choose a reason for hiding this comment

yuvalo commented Oct 8, 2018

ctelfer commented Oct 8, 2018

yuvalo commented Oct 8, 2018

ctelfer commented Oct 8, 2018 • edited Loading

ctelfer commented Oct 12, 2018

TincaTibo commented Dec 3, 2018

ctelfer commented Oct 8, 2018 •

edited

Loading