Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd cluster fails to come up when TLS, Pod labels and Replicas are all changed with upgrade from v0.22.x to v0.23.0 #881

Closed
unmarshall opened this issue Sep 24, 2024 · 0 comments · Fixed by #883
Labels
area/control-plane Control plane related kind/bug Bug status/closed Issue is closed (either delivered or triaged)

Comments

@unmarshall
Copy link
Contributor

unmarshall commented Sep 24, 2024

How to categorize this issue?

/area control-plane
/kind bug

What happened:
With etcd-druid version v0.23.0 there is a change to STS Pod labels and label-selector. These are immutable fields and just updating the STS spec would not work as the update will be rejected. A change was introduced in v0.23.0 to orphan delete the STS and then update the STS spec. However if you club this with TLS and replicas change then we ran into a situation where not all members of the etcd cluster could join.

What was observed:

  • etcd-test-0 pod gets restarted, but without peer TLS (the pod never gets the new peerTLS spec from sts, so it has no peerTLS secrets mounted at all, to begin with)
  • Almost immediately, two new pods etcd-test-1 and etcd-test-2 get spawned, even before etcd-test-0 has become ready, let alone waiting for the peer-tls-enabled annotation on the lease to become true
  • The two new pods have peerTLS in their spec (peer TLS secrets are mounted correctly), so they expect etcd-test-0 to have peer TLS enabled. But because etcd-test-0 is still running with peerTLS disabled, the two new pods fail to come up (expected HTTPS response, got HTTP instead)

What you expected to happen:
etcd-cluster should come up and become healthy even if there are simultaneous changes to pod labels, label selector, peer TLS and replicas.

How to reproduce it (as minimally and precisely as possible):

  1. checkout etcd-druid v0.22.5
  2. run make kind-up and make deploy
  3. create an Etcd with replicas=1, no peerTLS enabled, and spec.labels foo=bar (additional pod labels).
  4. annotate with gardener.cloud/operation=reconcile (previous version of druid requires this annotation even upon creation of Etcd resource)
  5. wait for reconciliation to succeed and etcd to become ready
  6. modify Etcd spec with replicas=3, enable peerTLS, and add a new label foo1=bar1 to spec.labels (trying to simulate changes that gardenlet does, like adding new networking labels to allow peer communication).
  7. ensure that the Etcd spec is NOT reconciled (do not annotate with gardener.cloud/operation=reconcile)
  8. checkout druid v0.23.0
  9. run make deploy again (you might need to run make clean-tools-bin to remove the old skaffold version)
  10. once druid v0.23.0 is up, reconcile the Etcd spec using the annotation and observe.
@gardener-robot gardener-robot added area/control-plane Control plane related kind/bug Bug labels Sep 24, 2024
unmarshall added a commit to unmarshall/etcd-druid that referenced this issue Sep 30, 2024
unmarshall added a commit to unmarshall/etcd-druid that referenced this issue Oct 23, 2024
Added ability to handle unknown CLI args to allow switching between v0.22 and v0.23
Added use-etcd-wrapper cli arg for etcdbr container
unmarshall added a commit to unmarshall/etcd-druid that referenced this issue Oct 23, 2024
Added ability to handle unknown CLI args to allow switching between v0.22 and v0.23
Added use-etcd-wrapper cli arg for etcdbr container
unmarshall added a commit to unmarshall/etcd-druid that referenced this issue Oct 23, 2024
Added ability to handle unknown CLI args to allow switching between v0.22 and v0.23
Added use-etcd-wrapper cli arg for etcdbr container
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Oct 23, 2024
shreyas-s-rao pushed a commit to shreyas-s-rao/etcd-druid that referenced this issue Oct 23, 2024
…client and peer communication (gardener#883)

* fixes gardener#881, gardener#877
* Added ability to handle unknown CLI args to allow switching between v0.22 and v0.23
* Added use-etcd-wrapper cli arg for etcdbr container
* removed etcd-cluster-size label to be added later with ability to restore while keep etcd.spec.replicas > 1
shreyas-s-rao added a commit that referenced this issue Oct 23, 2024
…client and peer communication (#883) (#894)

* fixes #881, #877
* Added ability to handle unknown CLI args to allow switching between v0.22 and v0.23
* Added use-etcd-wrapper cli arg for etcdbr container
* removed etcd-cluster-size label to be added later with ability to restore while keep etcd.spec.replicas > 1

Co-authored-by: Madhav Bhargava <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related kind/bug Bug status/closed Issue is closed (either delivered or triaged)
Projects
None yet
2 participants