-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HA sync issue #148
Comments
Probably need a sidecar with just busybox in it. it does nothing but sleep. then a readiness hook in it that checks for the age of files in the pvc. if less then a few minutes old, unready for a minute. |
Can skip the sidecar if there is an upstreamAuthority specified if sqlite backend (replicas must always be 1) |
I'm not sure we need to fix this in the charts, because this is how it is designed to work without Helm charts. Yes, there is a period of time where certs are not in sync, because it is a distributed system. During that period of time, agents typically get valid certs, because both CAs are valid (their validity overlaps). With that in mind, there are also scenarios where a CA is purposefully expired, but that involves revocation lists (or waiting for the TTL of the cert to expire. In both cases, how the CA is sourced and used is a function of the UpstreamAuthority plugin, and SPIRE is designed to not source and sync these synchronously across all Server instances. Is the proposal to fix this asynchronous behavior by adding additional logic in SPIRE to make the asynchronous behavior synchronous? |
Its a problem when not using an UpstreamAuthority plugin. The non containerized spire has the same issue, but hit it less due to not as good orchestration layers in non kubernetes environments. Humans take long enough to set things up its less likely to hit it. K8s has enough automation it does hit it, and some folks will want to autoscale their spire servers and would definitely hit this issue in that situation I believe. The solution would be to make sure the pod doesn't go ready until the ca is added to the bundle in k8s. |
There was a conversation on slack about multiple instances of the spire server making their own CA's when in HA mode and usually waiting a certain amount of time for the new instances CA's to sync to the agents before adding it to the LoadBalancer. We currently do not do this. We either need to make the server's initialDelaySeconds a larger number, like 60+ seconds, or make a dynamic readiness probe that waited only on new instances. If not done, agents may get valid certs that other agents don't trust for a while.
The text was updated successfully, but these errors were encountered: