-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CA Replication when ACLs are enabled #6201
Conversation
Thank you! And I think the datacentersMeetMinVersion and autopilotServersMeetMinimumVersion functions in leader_connect.go are no more used, so they could also be deleted. |
Yep, it was dead code and has now been removed. |
@mkeeler You may be looking into them already, but those ACL test failures in CI don't look like flakes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
I'm confused though that the test still seems to assert the old behaviour (and pass)!
I'm also pretty sure the logic for servers and versions isn't right. I could be wrong though or making bad assumptions but if so might be worth commenting why this is OK?
func ServersInDCMeetMinimumVersion(members []serf.Member, datacenter string, minVersion *version.Version) (bool, bool) { | ||
found := false | ||
ok := ServersMeetRequirements(members, func(srv *metadata.Server) bool { | ||
if srv.Status != serf.StatusAlive || srv.Datacenter != datacenter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We allow servers to be not-alive 🤔
Doesn't this mean that all the servers in the other DC could be entirely dead and we'd treat that as "all have the correct version" and plough on with an upgrade? I feel like we'd need to ensure a quorum are alive and at the right version to be safe in the knowledge that there is a leader and that the leader is upgraded?
This is likely how it's always been so I guess we can leave as it is. Chances of doing an upgrade while primary DC is entirely offline seems slim but still this is suspicious logic to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see the found
thing that at least ensures that there is at least one alive server on the other side.
But it still feels a bit odd - I came up with all sorts of unlikley cases where this is wrong but then there is a very plausible one (actually the common case I think):
- upgrade secondary first, it will sit in a loop checking primary version waiting to init CA
- first server in primary DC is upgraded, it's not yet leader though since user picked a follower to upgrade first (typical)
- secondary DC polls again and sees one alive server in the primary DC that is the right version (this logic) and so it goes ahead even though the primary DC's leader is still the old version that doesn't know how to init a secondary CA etc.
I don't know what happens in that case - maybe just a few errors until the primary upgrade is done but it seems like that is the core thing we are trying to prevent here.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the case you mention how can there be one alive server not of a sufficient version but is not the leader.
The logic I was going for is that all alive servers must be at that version before it’s allowed to go through. If you are running two servers only (which would be odd) and are upgrading one of them then the only way this is allowed to proceed is if you have already upgraded the other server. If you have 3 servers and are following the prescribed rolling update procedure you are in the same situation. The assumption is that any servers in a failed state will come back online with a version at least as new as the other alive servers.
You figured out the other part which is that we require at least 1 alive server in the primary before allowing initializing as a secondary CA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption is that any servers in a failed state will come back online with a version at least as new as the other alive servers
Is that OK? I mean it's very very likely true but not guaranteed if something fails during an upgrade or health flaps for some reason etc. I guess given it's low probability, what happens if it does come up? If it's just a few error logs for a bit until primary gets healthy again then fine, if it causes a panic because something is writing new state that a leader can't understand then obviously not fine. In the CA case I think it's the first right?
all alive servers must be at that version
Ah maybe I got mixed up because of the "at least one alive" part.
I can still dream up many possible scenarios that I have no idea if they would cause crashes etc, but I guess this is as safe as it was before at least and generally good enough.
In cases (possibly not this one) like the new ACL stuff where leader turning on the new feature causes old followers to panic, are we stricter about this or do we use the same logic (so flappy health might allow an old server to be temporarily ignored and then come back and be in a crash loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the situation where one of the servers in the primary has not been upgraded and eventually comes back from a failed state the worst that could happen is that we get some logs about RPC failures in the primary CA roots watch go routine if it tries to get a new signed intermediate and the current primary leader doesn't support that RPC.
The secondary will backoff retrying as necessary.
func ServersInDCMeetMinimumVersion(members []serf.Member, datacenter string, minVersion *version.Version) (bool, bool) { | ||
found := false | ||
ok := ServersMeetRequirements(members, func(srv *metadata.Server) bool { | ||
if srv.Status != serf.StatusAlive || srv.Datacenter != datacenter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption is that any servers in a failed state will come back online with a version at least as new as the other alive servers
Is that OK? I mean it's very very likely true but not guaranteed if something fails during an upgrade or health flaps for some reason etc. I guess given it's low probability, what happens if it does come up? If it's just a few error logs for a bit until primary gets healthy again then fine, if it causes a panic because something is writing new state that a leader can't understand then obviously not fine. In the CA case I think it's the first right?
all alive servers must be at that version
Ah maybe I got mixed up because of the "at least one alive" part.
I can still dream up many possible scenarios that I have no idea if they would cause crashes etc, but I guess this is as safe as it was before at least and generally good enough.
In cases (possibly not this one) like the new ACL stuff where leader turning on the new feature causes old followers to panic, are we stricter about this or do we use the same logic (so flappy health might allow an old server to be temporarily ignored and then come back and be in a crash loop?
33a3709
to
d23ec18
Compare
This avoids needing a token just for doing a roots watch. In order to get a signed intermediate cert in the secondary dc you will still need operator:write
This prevents leadership establishment failures when the servers are not yet wan joined.
This prevents leadership flapping and a few other things. When a primary dc is configured, if no existing CA is initialized and for whatever reason we cannot initialize a secondary CA the secondary DC will remain without a CA. As soon as it can it will initialize the secondary CA by pulling the primaries roots and getting the primary to sign an intermediate.
c40489d
to
b0f5c88
Compare
Looking int a segfault I had seen before my last force push. It is concerning at the least. |
The segfault was caused by leadership revocation after the secondary ca roots watch had been run. This could have happened even with the old codes behavior so I am reinstating my changes. We now will disallow initializing a secondary datacenters CA as a primary in the case where the primary can not sign intermediate certs yet (servers in primary are not on new enough versions) |
Fixes #6192
Essentially this stops using autopilot + RPCs to determine server versions. Instead it just uses the local serf metadata.
This fixes the linked issue by no longer performing the RPCs that would otherwise require a special ACL token. We could have just used the replication token there but really these places already have all the information in the serf metadata so we can just use that instead of making cross-dc requests.