Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set leave_on_terminate=true for servers and hardcode maxUnavailable=1 #3000

Merged
merged 1 commit into from
Jan 16, 2024

Conversation

lkysow
Copy link
Member

@lkysow lkysow commented Sep 22, 2023

When leave_on_terminate=false (default), rolling the statefulset is
disruptive because the new servers come up with the same node IDs but
different IP addresses. They can't join the server cluster until the old
server's node ID is marked as failed by serf. During this time, they continually
start leader elections because they don't know there's a leader. When
they eventually join the cluster, their election term is higher, and so
they trigger a leadership swap. The leadership swap happens at the same
time as the next node to be rolled is being stopped, and so the cluster
can end up without a leader.

With leave_on_terminate=true, the stopping server cleanly leaves the
cluster, so the new server can join smoothly, even though it has the
same node ID as the old server. This increases the speed of the rollout
and in my testing eliminates the period without a leader.

The downside of this change is that when a server leaves gracefully, it
also reduces the number of raft peers. The number of peers is used to
calculate the quorum size, so this can unexpectedly change the fault
tolerance of the cluster. When running with an odd number of servers, 1
server leaving the cluster does not affect quorum size. E.g. 5 servers
=> quorum 3, 4 servers => quorum still 3. During a rollout, Kubernetes
only stops 1 server at a time, so the quorum won't change. During a
voluntary disruption event, e.g. a node being drained, Kubernetes uses
the pod disruption budget to determine how many pods in a statefulset
can be made unavailable at a time. That's why this change hardcodes this
number to 1 now.

Also set autopilot min_quorum to min quorum and disable autopilot
upgrade migration since that's for blue/green deploys.

How I've tested:

  • Acceptance test that restarts the servers and ensures there's no loss of quorum
  • Manual acceptance testing bumping server versions and ensuring the rollout is smooth (https://gist.github.com/lkysow/d310e61dfd207c438528f393e709a0f8). I didn't add this to the test suite because we need to run a helm install with an older version and then upgrade to the current version which requires more thought/work for how that test and versions will get updated.
  • manual testing of wan federation, restarting servers on both sides and ensuring the new nodes are found and the connection still works
  • manual testing of cluster peering, restarting servers on both sides, communication between svcs has no issue, no errors in logs

@lkysow lkysow changed the title wip: test server restart with/without leave_on_terminate Set leave_on_terminate=true for servers and hardcode maxUnavailable=1 Nov 23, 2023
@lkysow lkysow added backport/0.49.x backport/1.1.x Backport to release/1.1.x branch backport/1.2.x This release branch is no longer active. backport/1.3.x labels Nov 23, 2023
@lkysow lkysow force-pushed the lkysow/server-restart branch 2 times, most recently from aab25ee to 01843e6 Compare January 3, 2024 21:57
@lkysow lkysow marked this pull request as ready for review January 3, 2024 21:59
@lkysow lkysow removed backport/0.49.x backport/1.1.x Backport to release/1.1.x branch backport/1.2.x This release branch is no longer active. backport/1.3.x labels Jan 9, 2024
@wilkermichael
Copy link
Contributor

wilkermichael commented Jan 15, 2024

@lkysow since you're adding a new acceptance test package, in order to get it to run in the pipeline you need to add it to the yaml matrices: https://github.com/hashicorp/consul-k8s/tree/lkysow/server-restart/acceptance/ci-inputs

Copy link
Contributor

@wilkermichael wilkermichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a couple nits (you can ignore if they don't make sense) and you need to add the server acceptance package to the acceptance test matrics (see above).

Thanks for all the great comments too, it was very easy to see your reasoning behind things.

acceptance/tests/server/server_test.go Outdated Show resolved Hide resolved
acceptance/tests/server/server_test.go Outdated Show resolved Hide resolved
charts/consul/values.yaml Outdated Show resolved Hide resolved
@lkysow lkysow added the pr/no-backport signals that a PR will not contain a backport label label Jan 15, 2024
acceptance/tests/server/server_test.go Outdated Show resolved Hide resolved
When leave_on_terminate=false (default), rolling the statefulset is
disruptive because the new servers come up with the same node IDs but
different IP addresses. They can't join the server cluster until the old
server's node ID is marked as failed by serf. During this time, they continually
start leader elections because they don't know there's a leader. When
they eventually join the cluster, their election term is higher, and so
they trigger a leadership swap. The leadership swap happens at the same
time as the next node to be rolled is being stopped, and so the cluster
can end up without a leader.

With leave_on_terminate=true, the stopping server cleanly leaves the
cluster, so the new server can join smoothly, even though it has the
same node ID as the old server. This increases the speed of the rollout
and in my testing eliminates the period without a leader.

The downside of this change is that when a server leaves gracefully, it
also reduces the number of raft peers. The number of peers is used to
calculate the quorum size, so this can unexpectedly change the fault
tolerance of the cluster. When running with an odd number of servers, 1
server leaving the cluster does not affect quorum size. E.g. 5 servers
=> quorum 3, 4 servers => quorum still 3. During a rollout, Kubernetes
only stops 1 server at a time, so the quorum won't change. During a
voluntary disruption event, e.g. a node being drained, Kubernetes uses
the pod disruption budget to determine how many pods in a statefulset
can be made unavailable at a time. That's why this change hardcodes this
number to 1 now.

Also set autopilot min_quorum to min quorum and disable autopilot
upgrade migration since that's for blue/green deploys.
@lkysow lkysow enabled auto-merge (squash) January 16, 2024 01:56
@lkysow lkysow merged commit 4b6abc7 into main Jan 16, 2024
27 of 48 checks passed
@lkysow lkysow deleted the lkysow/server-restart branch January 16, 2024 02:33
@david-yu
Copy link
Contributor

@lkysow Thanks for working on this, could we close this issue? #1516 I believe that issue is now resolved by the PR.

lkysow added a commit to hashicorp/consul that referenced this pull request Jan 18, 2024
With hashicorp/consul-k8s#3000 merged, users can
upgrade their k8s installs using a regular helm upgrade since the
upgrade is now stable.
lkysow added a commit to hashicorp/consul that referenced this pull request Jan 18, 2024
With hashicorp/consul-k8s#3000 merged, users can
upgrade their k8s installs using a regular helm upgrade since the
upgrade is now stable.
lkysow added a commit to hashicorp/consul that referenced this pull request Jan 18, 2024
* docs: update k8s upgrade instructions

With hashicorp/consul-k8s#3000 merged, users can
upgrade their k8s installs using a regular helm upgrade since the
upgrade is now stable.

Co-authored-by: trujillo-adam <[email protected]>
@curtbushko curtbushko mentioned this pull request Feb 13, 2024
2 tasks
natemollica-nm added a commit that referenced this pull request Feb 13, 2024
… re-apply datadog-integration branch changes
natemollica-nm added a commit that referenced this pull request Feb 13, 2024
* Datadog Integration (#3407)

* datadog-integration: updated consul-server agent telemetry-config.json with dd specific items as well as additional missing VM based options, unit tests, dd unix socket integration, dd agent acl token generation, deployment override failsafes

* datadog-integration: updated consul-server agent telemetry-config.json with dd specific items as well as additional missing VM based options, unit tests, dd unix socket integration, dd agent acl token generation | final initial-push

* changelog entry update

* datadog-integration: updated consul-server agent server.config (enable_debug) and telemetry.config update | enable_debug to server.config

* curt pr review changes (minus extraConfig templating verification changes)

* global.metrics.AgentMetrics -> global.metrics.enableAgentMetrics

* dogstatsd and otlp mutually exclusive verification checks

* breaking changes now incorporated into consul.validateExtraConfig helper template function as precheck

* extraConfig hash updates post merge conflict update

* fix helpers.tpl consul.extraConfig from merge --> /consul/tmp/extra-config/extra-from-values.json | add labels to rolebinding for datadog secrets

* update changelog .txt to match new PR number

* updated server-statefulset.yaml to correct ad.datadoghq.com/consul.logs annotation to valid single quote string

* fix helpers.tpl consul.extraConfig from merge --> /consul/tmp/extra-config/extra-from-values.json | add labels to rolebinding for datadog secrets

* fix helpers.tpl consul.extraConfig from merge --> /consul/tmp/extra-config/extra-from-values.json | add labels to rolebinding for datadog secrets

* update UDP dogstatsdPort behavior to exclude including a port value if using a kube service address (as determined by user overrides)

* update _helpers.tpl consul.ValidateDatadogConfiguration func to account for using 'https' as protocol => should fail

* update server-statefulset.yaml to exclude prometheus.io annotations if enabling datadog openmetrics method for consul server metrics scrape. conflict present with http vs https that breaks openemtrics scrape on consul

* update server-statefulset.yaml to exclude prometheus.io annotations if enabling datadog openmetrics method for consul server metrics scrape. conflict present with http vs https that breaks openemtrics scrape on consul

* correct otlp protocol helpers.tpl check to lower-case the protocol to match the open-telemetry-deployment.yaml behavior

* fix server-acl-init command_test.go for datadog token policy - datacenter should have been dc1

* add in server-statefulset bats test for extraConfig validation testing

* manual cherry-pick failed checks fix

* revert leave_on_terminate and autopilot updates from commit #3000 and re-apply datadog-integration branch changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/no-backport signals that a PR will not contain a backport label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants