-
Notifications
You must be signed in to change notification settings - Fork 295
Always maintain etcd quorum, ensure wait signals are sufficient #411
Comments
@c-knowles Thanks for bringing this up 👍 |
@mumoshu sorry for my late reply. I've upgraded our clusters to the latest kube-aws so awaiting an opportunity to retest this. If I am on one of the etcd nodes, do you have a recommendation on how to access etcd via etcdctl? The version on the instance seems to be the older v2 which confused me for a while. v3 is running inside rkt and also had some trouble trying to get into the container (I'm new to rkt, I think maybe the container does not have a shell installed). |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
7 similar comments
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Extracted from #332.
Background
@c-knowles did an uptime rolling update of etcd using v0.9.5-rc.3. It seems there was a slight pause on etcd responding in a 3 node cluster when the state was:
The pause was circa 20 seconds and various processes including kubectl and dashboard became unresponsive momentarily. I just wanted to check if anyone has seen anything similar before trying to diagnose more? Each of the wait signals was passing after around 5 minutes so it looks like this was etcd related somehow.
Details
From @mumoshu - each etcd2 member(=etcd2 process inside a rkt pod) doesn't wait until the member becomes connected and ready to serve requests on its startup, and there's no way to know the member is actually ready.
For example, running
etcdctl --peers <first etcd member's advertised peer url> cluster-health
would block until all the remaining etcd members until the number meets quorum(2 for your cluster). This incomplete solution hits a chicken-and-egg problem like this and break wait signals. That's why it doesn't wait for an etcd2 member to be ready to avoid down time completely.For @mumoshu, the down time was less than 1 sec when first tried but it is suspected the result varies from time to time hence @c-knowles' case.
Implementation
@mumoshu mentioned etcd3 seems to signal systemd for readiness when its systemd unit is set to Type=notify. So this may be covered by #381.
@redbaron mentioned an idea about drawing dependencies between ASGs, then CF will roll them one by one. It should allow quorum to be maintained all the time.
The text was updated successfully, but these errors were encountered: