Don’t attempt to reconnect swarm on failed join after timeout #27123

tonistiigi · 2016-10-03T20:42:51Z

The reproducible part of the bug was already fixed with the grpc changes in swarmkit, but this makes it more robust and makes it not rely on swarmkit timeouts.

The issue appeared because reconnecting expects state from remote hosts. There was no state because the join failed.

cc @mrjana

Signed-off-by: Tonis Tiigi [email protected]

Signed-off-by: Tonis Tiigi <[email protected]>

thaJeztah · 2016-10-03T21:41:31Z

@tonistiigi is this only on master, or a fix for 1.12.1?

tonistiigi · 2016-10-03T21:43:12Z

@thaJeztah I think this can wait for v1.13

LK4D4 · 2016-10-05T20:00:28Z

LGTM

aaronlehmann · 2016-10-06T14:27:13Z

Can we consider changing join behavior so that it doesn't keep retrying after the timeout? I think it's really unexpected for a node to keep trying to join a swarm after docker swarm join returned failure. Having it succeed potentially weeks or months later isn't useful behavior IMHO.

tonistiigi · 2016-10-06T18:08:53Z

@aaronlehmann I remember @aluzzardi saw it as an important feature. Actually, it seems that swarmkit has already started to move away from that model as for example in #26646 swarmkit doesn't try to connect until network returns but fails out quite soon so we never even reach the timeout anymore. I'm not sure if this is the case with all the possible scenarios. If it is then in Docker side we should just remove the timeout completely and swarmkit either has to join in a meaningful time or give up with an error.

aaronlehmann · 2016-10-11T17:07:07Z

@aluzzardi: Any thoughts?

thaJeztah · 2016-10-20T18:50:15Z

ping @aluzzardi PTAL!

aluzzardi · 2016-11-03T19:01:46Z

@aaronlehmann Well, right now we're half sync half async and we all agreed to move one way or another since right now automation is really painful.

I believe we had a chat offline a while ago where, if I remember correctly, decided to go the async route (and make the CLI look synchronous?)

aluzzardi · 2016-11-08T22:53:04Z

LGTM

aaronlehmann · 2016-11-08T22:53:41Z

LGTM

Don’t attempt to reconnect swarm on failed join after timeout

7381c84

Signed-off-by: Tonis Tiigi <[email protected]>

GordonTheTurtle added the status/0-triage label Oct 3, 2016

thaJeztah added status/2-code-review and removed status/0-triage labels Oct 3, 2016

LK4D4 added this to the 1.13.0 milestone Oct 5, 2016

LK4D4 added the area/swarm label Oct 5, 2016

thaJeztah assigned aluzzardi Nov 3, 2016

aaronlehmann merged commit 0ccbae0 into moby:master Nov 8, 2016

thaJeztah added the impact/changelog label Nov 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don’t attempt to reconnect swarm on failed join after timeout #27123

Don’t attempt to reconnect swarm on failed join after timeout #27123

tonistiigi commented Oct 3, 2016

thaJeztah commented Oct 3, 2016

tonistiigi commented Oct 3, 2016

LK4D4 commented Oct 5, 2016

aaronlehmann commented Oct 6, 2016

tonistiigi commented Oct 6, 2016

aaronlehmann commented Oct 11, 2016

thaJeztah commented Oct 20, 2016

aluzzardi commented Nov 3, 2016

aluzzardi commented Nov 8, 2016

aaronlehmann commented Nov 8, 2016

Don’t attempt to reconnect swarm on failed join after timeout #27123

Don’t attempt to reconnect swarm on failed join after timeout #27123

Conversation

tonistiigi commented Oct 3, 2016

thaJeztah commented Oct 3, 2016

tonistiigi commented Oct 3, 2016

LK4D4 commented Oct 5, 2016

aaronlehmann commented Oct 6, 2016

tonistiigi commented Oct 6, 2016

aaronlehmann commented Oct 11, 2016

thaJeztah commented Oct 20, 2016

aluzzardi commented Nov 3, 2016

aluzzardi commented Nov 8, 2016

aaronlehmann commented Nov 8, 2016