Fix CA Replication when ACLs are enabled #6201

mkeeler · 2019-07-23T20:23:50Z

Essentially this stops using autopilot + RPCs to determine server versions. Instead it just uses the local serf metadata.

This fixes the linked issue by no longer performing the RPCs that would otherwise require a special ACL token. We could have just used the replication token there but really these places already have all the information in the serf metadata so we can just use that instead of making cross-dc requests.

tristan-weil · 2019-07-23T20:57:59Z

Thank you!

And I think the datacentersMeetMinVersion and autopilotServersMeetMinimumVersion functions in leader_connect.go are no more used, so they could also be deleted.

mkeeler · 2019-07-23T21:06:18Z

Yep, it was dead code and has now been removed.

freddygv · 2019-07-23T22:10:38Z

@mkeeler You may be looking into them already, but those ACL test failures in CI don't look like flakes

banks

Looks good overall.

I'm confused though that the test still seems to assert the old behaviour (and pass)!

I'm also pretty sure the logic for servers and versions isn't right. I could be wrong though or making bad assumptions but if so might be worth commenting why this is OK?

agent/consul/leader_connect_test.go

banks · 2019-07-25T11:01:38Z

agent/consul/util.go

+func ServersInDCMeetMinimumVersion(members []serf.Member, datacenter string, minVersion *version.Version) (bool, bool) {
+	found := false
+	ok := ServersMeetRequirements(members, func(srv *metadata.Server) bool {
+		if srv.Status != serf.StatusAlive || srv.Datacenter != datacenter {


We allow servers to be not-alive 🤔

Doesn't this mean that all the servers in the other DC could be entirely dead and we'd treat that as "all have the correct version" and plough on with an upgrade? I feel like we'd need to ensure a quorum are alive and at the right version to be safe in the knowledge that there is a leader and that the leader is upgraded?

This is likely how it's always been so I guess we can leave as it is. Chances of doing an upgrade while primary DC is entirely offline seems slim but still this is suspicious logic to me.

Oh, I see the found thing that at least ensures that there is at least one alive server on the other side.

But it still feels a bit odd - I came up with all sorts of unlikley cases where this is wrong but then there is a very plausible one (actually the common case I think):

upgrade secondary first, it will sit in a loop checking primary version waiting to init CA

first server in primary DC is upgraded, it's not yet leader though since user picked a follower to upgrade first (typical)

secondary DC polls again and sees one alive server in the primary DC that is the right version (this logic) and so it goes ahead even though the primary DC's leader is still the old version that doesn't know how to init a secondary CA etc.

I don't know what happens in that case - maybe just a few errors until the primary upgrade is done but it seems like that is the core thing we are trying to prevent here.

Am I missing something?

For the case you mention how can there be one alive server not of a sufficient version but is not the leader.

The logic I was going for is that all alive servers must be at that version before it’s allowed to go through. If you are running two servers only (which would be odd) and are upgrading one of them then the only way this is allowed to proceed is if you have already upgraded the other server. If you have 3 servers and are following the prescribed rolling update procedure you are in the same situation. The assumption is that any servers in a failed state will come back online with a version at least as new as the other alive servers.

You figured out the other part which is that we require at least 1 alive server in the primary before allowing initializing as a secondary CA

The assumption is that any servers in a failed state will come back online with a version at least as new as the other alive servers

Is that OK? I mean it's very very likely true but not guaranteed if something fails during an upgrade or health flaps for some reason etc. I guess given it's low probability, what happens if it does come up? If it's just a few error logs for a bit until primary gets healthy again then fine, if it causes a panic because something is writing new state that a leader can't understand then obviously not fine. In the CA case I think it's the first right?

all alive servers must be at that version

Ah maybe I got mixed up because of the "at least one alive" part.

I can still dream up many possible scenarios that I have no idea if they would cause crashes etc, but I guess this is as safe as it was before at least and generally good enough.

In cases (possibly not this one) like the new ACL stuff where leader turning on the new feature causes old followers to panic, are we stricter about this or do we use the same logic (so flappy health might allow an old server to be temporarily ignored and then come back and be in a crash loop?

In the situation where one of the servers in the primary has not been upgraded and eventually comes back from a failed state the worst that could happen is that we get some logs about RPC failures in the primary CA roots watch go routine if it tries to get a new signed intermediate and the current primary leader doesn't support that RPC.

The secondary will backoff retrying as necessary.

banks · 2019-07-25T12:56:31Z

agent/consul/util.go

+func ServersInDCMeetMinimumVersion(members []serf.Member, datacenter string, minVersion *version.Version) (bool, bool) {
+	found := false
+	ok := ServersMeetRequirements(members, func(srv *metadata.Server) bool {
+		if srv.Status != serf.StatusAlive || srv.Datacenter != datacenter {


The assumption is that any servers in a failed state will come back online with a version at least as new as the other alive servers

Is that OK? I mean it's very very likely true but not guaranteed if something fails during an upgrade or health flaps for some reason etc. I guess given it's low probability, what happens if it does come up? If it's just a few error logs for a bit until primary gets healthy again then fine, if it causes a panic because something is writing new state that a leader can't understand then obviously not fine. In the CA case I think it's the first right?

all alive servers must be at that version

Ah maybe I got mixed up because of the "at least one alive" part.

I can still dream up many possible scenarios that I have no idea if they would cause crashes etc, but I guess this is as safe as it was before at least and generally good enough.

In cases (possibly not this one) like the new ACL stuff where leader turning on the new feature causes old followers to panic, are we stricter about this or do we use the same logic (so flappy health might allow an old server to be temporarily ignored and then come back and be in a crash loop?

This avoids needing a token just for doing a roots watch. In order to get a signed intermediate cert in the secondary dc you will still need operator:write

This prevents leadership establishment failures when the servers are not yet wan joined.

This prevents leadership flapping and a few other things. When a primary dc is configured, if no existing CA is initialized and for whatever reason we cannot initialize a secondary CA the secondary DC will remain without a CA. As soon as it can it will initialize the secondary CA by pulling the primaries roots and getting the primary to sign an intermediate.

mkeeler · 2019-07-25T19:10:44Z

Looking int a segfault I had seen before my last force push. It is concerning at the least.

mkeeler · 2019-07-26T19:37:35Z

The segfault was caused by leadership revocation after the secondary ca roots watch had been run. This could have happened even with the old codes behavior so I am reinstating my changes.

We now will disallow initializing a secondary datacenters CA as a primary in the case where the primary can not sign intermediate certs yet (servers in primary are not on new enough versions)

mkeeler added this to the 1.6.0-beta3 milestone Jul 23, 2019

mkeeler requested a review from a team July 23, 2019 20:23

mkeeler mentioned this pull request Jul 23, 2019

Use the replication token when checking the health of a primary DC during CA replication #6193

Closed

banks requested changes Jul 25, 2019

View reviewed changes

banks approved these changes Jul 25, 2019

View reviewed changes

mkeeler force-pushed the bugfix/ca-replication-acls branch from 33a3709 to d23ec18 Compare July 25, 2019 14:07

mkeeler added 6 commits July 25, 2019 10:12

Use serf metadata for server version checking

dc32fdc

This avoids needing a token just for doing a roots watch. In order to get a signed intermediate cert in the secondary dc you will still need operator:write

Add test for connect ca replication + acls

b79b04a

Remove dead code

a931566

Require at least 1 alive server in the dc for this to return true

c97b9fc

This prevents leadership establishment failures when the servers are not yet wan joined.

Update to fix tests.

b0f5c88

mkeeler force-pushed the bugfix/ca-replication-acls branch 2 times, most recently from c40489d to b0f5c88 Compare July 25, 2019 17:28

mkeeler added 2 commits July 26, 2019 15:35

Don’t ever initialize a primary CA within a secondary DC

5684842

Fix segfault that can happen during leadership revocation.

c58fb95

mkeeler merged commit 35e67b1 into release/1-6 Jul 26, 2019

mkeeler deleted the bugfix/ca-replication-acls branch October 31, 2019 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CA Replication when ACLs are enabled #6201

Fix CA Replication when ACLs are enabled #6201

mkeeler commented Jul 23, 2019

tristan-weil commented Jul 23, 2019

mkeeler commented Jul 23, 2019

freddygv commented Jul 23, 2019

banks left a comment

banks Jul 25, 2019

banks Jul 25, 2019

mkeeler Jul 25, 2019

banks Jul 25, 2019

mkeeler Jul 25, 2019

banks Jul 25, 2019

mkeeler commented Jul 25, 2019

mkeeler commented Jul 26, 2019

Fix CA Replication when ACLs are enabled #6201

Fix CA Replication when ACLs are enabled #6201

Conversation

mkeeler commented Jul 23, 2019

tristan-weil commented Jul 23, 2019

mkeeler commented Jul 23, 2019

freddygv commented Jul 23, 2019

banks left a comment

Choose a reason for hiding this comment

banks Jul 25, 2019

Choose a reason for hiding this comment

banks Jul 25, 2019

Choose a reason for hiding this comment

mkeeler Jul 25, 2019

Choose a reason for hiding this comment

banks Jul 25, 2019

Choose a reason for hiding this comment

mkeeler Jul 25, 2019

Choose a reason for hiding this comment

banks Jul 25, 2019

Choose a reason for hiding this comment

mkeeler commented Jul 25, 2019

mkeeler commented Jul 26, 2019