Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CA Replication when ACLs are enabled #6201

Merged
merged 8 commits into from
Jul 26, 2019

Conversation

mkeeler
Copy link
Member

@mkeeler mkeeler commented Jul 23, 2019

Fixes #6192

Essentially this stops using autopilot + RPCs to determine server versions. Instead it just uses the local serf metadata.

This fixes the linked issue by no longer performing the RPCs that would otherwise require a special ACL token. We could have just used the replication token there but really these places already have all the information in the serf metadata so we can just use that instead of making cross-dc requests.

@tristan-weil
Copy link
Contributor

Thank you!

And I think the datacentersMeetMinVersion and autopilotServersMeetMinimumVersion functions in leader_connect.go are no more used, so they could also be deleted.

@mkeeler
Copy link
Member Author

mkeeler commented Jul 23, 2019

Yep, it was dead code and has now been removed.

@freddygv
Copy link
Contributor

@mkeeler You may be looking into them already, but those ACL test failures in CI don't look like flakes

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

I'm confused though that the test still seems to assert the old behaviour (and pass)!

I'm also pretty sure the logic for servers and versions isn't right. I could be wrong though or making bad assumptions but if so might be worth commenting why this is OK?

agent/consul/leader_connect_test.go Outdated Show resolved Hide resolved
func ServersInDCMeetMinimumVersion(members []serf.Member, datacenter string, minVersion *version.Version) (bool, bool) {
found := false
ok := ServersMeetRequirements(members, func(srv *metadata.Server) bool {
if srv.Status != serf.StatusAlive || srv.Datacenter != datacenter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow servers to be not-alive 🤔

Doesn't this mean that all the servers in the other DC could be entirely dead and we'd treat that as "all have the correct version" and plough on with an upgrade? I feel like we'd need to ensure a quorum are alive and at the right version to be safe in the knowledge that there is a leader and that the leader is upgraded?

This is likely how it's always been so I guess we can leave as it is. Chances of doing an upgrade while primary DC is entirely offline seems slim but still this is suspicious logic to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see the found thing that at least ensures that there is at least one alive server on the other side.

But it still feels a bit odd - I came up with all sorts of unlikley cases where this is wrong but then there is a very plausible one (actually the common case I think):

  • upgrade secondary first, it will sit in a loop checking primary version waiting to init CA
  • first server in primary DC is upgraded, it's not yet leader though since user picked a follower to upgrade first (typical)
  • secondary DC polls again and sees one alive server in the primary DC that is the right version (this logic) and so it goes ahead even though the primary DC's leader is still the old version that doesn't know how to init a secondary CA etc.

I don't know what happens in that case - maybe just a few errors until the primary upgrade is done but it seems like that is the core thing we are trying to prevent here.

Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case you mention how can there be one alive server not of a sufficient version but is not the leader.

The logic I was going for is that all alive servers must be at that version before it’s allowed to go through. If you are running two servers only (which would be odd) and are upgrading one of them then the only way this is allowed to proceed is if you have already upgraded the other server. If you have 3 servers and are following the prescribed rolling update procedure you are in the same situation. The assumption is that any servers in a failed state will come back online with a version at least as new as the other alive servers.

You figured out the other part which is that we require at least 1 alive server in the primary before allowing initializing as a secondary CA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is that any servers in a failed state will come back online with a version at least as new as the other alive servers

Is that OK? I mean it's very very likely true but not guaranteed if something fails during an upgrade or health flaps for some reason etc. I guess given it's low probability, what happens if it does come up? If it's just a few error logs for a bit until primary gets healthy again then fine, if it causes a panic because something is writing new state that a leader can't understand then obviously not fine. In the CA case I think it's the first right?

all alive servers must be at that version

Ah maybe I got mixed up because of the "at least one alive" part.

I can still dream up many possible scenarios that I have no idea if they would cause crashes etc, but I guess this is as safe as it was before at least and generally good enough.

In cases (possibly not this one) like the new ACL stuff where leader turning on the new feature causes old followers to panic, are we stricter about this or do we use the same logic (so flappy health might allow an old server to be temporarily ignored and then come back and be in a crash loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the situation where one of the servers in the primary has not been upgraded and eventually comes back from a failed state the worst that could happen is that we get some logs about RPC failures in the primary CA roots watch go routine if it tries to get a new signed intermediate and the current primary leader doesn't support that RPC.

The secondary will backoff retrying as necessary.

func ServersInDCMeetMinimumVersion(members []serf.Member, datacenter string, minVersion *version.Version) (bool, bool) {
found := false
ok := ServersMeetRequirements(members, func(srv *metadata.Server) bool {
if srv.Status != serf.StatusAlive || srv.Datacenter != datacenter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is that any servers in a failed state will come back online with a version at least as new as the other alive servers

Is that OK? I mean it's very very likely true but not guaranteed if something fails during an upgrade or health flaps for some reason etc. I guess given it's low probability, what happens if it does come up? If it's just a few error logs for a bit until primary gets healthy again then fine, if it causes a panic because something is writing new state that a leader can't understand then obviously not fine. In the CA case I think it's the first right?

all alive servers must be at that version

Ah maybe I got mixed up because of the "at least one alive" part.

I can still dream up many possible scenarios that I have no idea if they would cause crashes etc, but I guess this is as safe as it was before at least and generally good enough.

In cases (possibly not this one) like the new ACL stuff where leader turning on the new feature causes old followers to panic, are we stricter about this or do we use the same logic (so flappy health might allow an old server to be temporarily ignored and then come back and be in a crash loop?

@mkeeler mkeeler force-pushed the bugfix/ca-replication-acls branch from 33a3709 to d23ec18 Compare July 25, 2019 14:07
This avoids needing a token just for doing a roots watch.

In order to get a signed intermediate cert in the secondary dc you will still need operator:write
This prevents leadership establishment failures when the servers are not yet wan joined.
This prevents leadership flapping and a few other things.

When a primary dc is configured, if no existing CA is initialized and for whatever reason we cannot initialize a secondary CA the secondary DC will remain without a CA. As soon as it can it will initialize the secondary CA by pulling the primaries roots and getting the primary to sign an intermediate.
@mkeeler mkeeler force-pushed the bugfix/ca-replication-acls branch 2 times, most recently from c40489d to b0f5c88 Compare July 25, 2019 17:28
@mkeeler
Copy link
Member Author

mkeeler commented Jul 25, 2019

Looking int a segfault I had seen before my last force push. It is concerning at the least.

@mkeeler
Copy link
Member Author

mkeeler commented Jul 26, 2019

The segfault was caused by leadership revocation after the secondary ca roots watch had been run. This could have happened even with the old codes behavior so I am reinstating my changes.

We now will disallow initializing a secondary datacenters CA as a primary in the case where the primary can not sign intermediate certs yet (servers in primary are not on new enough versions)

@mkeeler mkeeler merged commit 35e67b1 into release/1-6 Jul 26, 2019
@mkeeler mkeeler deleted the bugfix/ca-replication-acls branch October 31, 2019 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants