Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Always maintain etcd quorum, ensure wait signals are sufficient #411

Closed
cknowles opened this issue Mar 13, 2017 · 13 comments
Closed

Always maintain etcd quorum, ensure wait signals are sufficient #411

cknowles opened this issue Mar 13, 2017 · 13 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Milestone

Comments

@cknowles
Copy link
Contributor

Extracted from #332.

Background

@c-knowles did an uptime rolling update of etcd using v0.9.5-rc.3. It seems there was a slight pause on etcd responding in a 3 node cluster when the state was:

  • first new node was up, old node terminated
  • second new node was running but possibly not quite fully linked into cluster yet, old node terminated
  • third new node was not up, old node still running

The pause was circa 20 seconds and various processes including kubectl and dashboard became unresponsive momentarily. I just wanted to check if anyone has seen anything similar before trying to diagnose more? Each of the wait signals was passing after around 5 minutes so it looks like this was etcd related somehow.

Details

From @mumoshu - each etcd2 member(=etcd2 process inside a rkt pod) doesn't wait until the member becomes connected and ready to serve requests on its startup, and there's no way to know the member is actually ready.

For example, running etcdctl --peers <first etcd member's advertised peer url> cluster-health would block until all the remaining etcd members until the number meets quorum(2 for your cluster). This incomplete solution hits a chicken-and-egg problem like this and break wait signals. That's why it doesn't wait for an etcd2 member to be ready to avoid down time completely.

For @mumoshu, the down time was less than 1 sec when first tried but it is suspected the result varies from time to time hence @c-knowles' case.

Implementation

@mumoshu mentioned etcd3 seems to signal systemd for readiness when its systemd unit is set to Type=notify. So this may be covered by #381.

@redbaron mentioned an idea about drawing dependencies between ASGs, then CF will roll them one by one. It should allow quorum to be maintained all the time.

@mumoshu mumoshu mentioned this issue Mar 13, 2017
10 tasks
@mumoshu mumoshu added this to the v0.9.6 milestone Mar 22, 2017
@mumoshu
Copy link
Contributor

mumoshu commented Apr 30, 2017

@c-knowles Thanks for bringing this up 👍
Etcd ASGs have dependencies among them to allow CF to replace one by one.
Also, etcd3 is the default etcd version since kube-aws v0.9.6-rc.1. I've tried my best for etcd3 systemd services to notify systemd for readiness(Type=notify) whenever possible.
Could you confirm if this issue is fixed?

@cknowles
Copy link
Contributor Author

@mumoshu sorry for my late reply. I've upgraded our clusters to the latest kube-aws so awaiting an opportunity to retest this. If I am on one of the etcd nodes, do you have a recommendation on how to access etcd via etcdctl? The version on the instance seems to be the older v2 which confused me for a while. v3 is running inside rkt and also had some trouble trying to get into the container (I'm new to rkt, I think maybe the container does not have a shell installed).

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

7 similar comments
@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants