feature(raft): Rolling restart raft topology coordinator node #9102

aleksbykov · 2024-10-31T13:03:55Z

When current topology coordinator is not available, new round
of election of new coordinator should be started and new
raft topology coordinator node will be elected.

Added new function to search current coordinator node
Added new nemesis to Rolling restart of elected coordinator node

Testing

[ Passed for PR ] (https://argus.scylladb.com/tests/scylla-cluster-tests/aa2a2300-e6bc-4d36-8081-b35ff7517523) - job failed due to cassandra, Nemesis functionality passed

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

fruch · 2024-11-03T07:43:44Z

sdcm/nemesis.py

@@ -5151,6 +5151,35 @@ def disrupt_disable_binary_gossip_execute_major_compaction(self):
            self.target_node.restart_scylla_server()
            raise

+    def disrupt_rolling_restart_topology_coordinator_node(self):


the name rolling restart doesn't related to what going on with this nemesis

it's multiple restarts of the same node.

rolling means, all of the cluster is being restarted (one by one, hence where the rolling comes from)

@aleksbykov please update the PR headline and the commit headline, to match the name change

fruch · 2024-11-03T07:45:27Z

sdcm/utils/raft/common.py

@@ -44,6 +50,27 @@ def validate_raft_on_nodes(nodes: list[BaseNode]) -> None:
    LOGGER.debug("Raft is ready!")


+def get_topology_coordinator_node(cluster: BaseScyllaCluster) -> BaseNode:
+    active_nodes = cluster.get_nodes_up_and_normal()


maybe this function should be guarded with self.target_node.raft.is_consistent_topology_changes_enabled ?

to make sure it's not getting called when raft topology is disabled ?

fruch

Name of this nemesis should have rolling in it.

aleksbykov · 2024-11-08T07:57:12Z

@soyacz , @fruch can you take a look?

sdcm/nemesis.py

sdcm/utils/raft/common.py

soyacz · 2024-11-08T09:42:51Z

[ Passed for PR ] (https://argus.scylladb.com/tests/scylla-cluster-tests/aa2a2300-e6bc-4d36-8081-b35ff7517523) - job failed due to cassandra, Nemesis functionality passed

Ins't it a Scylla issue?

soyacz

Few more small issues, otherwise LGTM.

if we backport it, it will shuffle all nemesis lists on backported branches. Do we want that backported? cc @roydahan

sdcm/nemesis.py

soyacz · 2024-11-15T11:12:03Z

sdcm/nemesis.py

+            coordinator_node.start_scylla()
+            assert coordinator_node != new_coordinator_node, \
+                f"New coordinator node was not elected while old one {coordinator_node.name} was stopped"
+            self.unset_current_running_nemesis(coordinator_node)


let's unset before assert

agree. fixed

sdcm/nemesis.py

aleksbykov · 2024-11-15T13:54:23Z

Few more small issues, otherwise LGTM.

if we backport it, it will shuffle all nemesis lists on backported branches. Do we want that backported? cc @roydahan

i think we can live with it on master only

aleksbykov · 2024-11-19T09:18:33Z

Staging jobs were passed:

https://argus.scylladb.com/tests/scylla-cluster-tests/652ce6e8-acee-443a-b98a-a00a815d29b2

aleksbykov · 2024-11-19T09:19:21Z

@fruch @soyacz can you review?

soyacz

one more thing and good to go for me.

soyacz · 2024-11-19T09:34:18Z

sdcm/nemesis.py

+        self.log.debug("Wait new topology coordinator election timeout: %s", election_wait_timeout)
+        self.unset_current_running_nemesis(self.target_node)
+        for _ in range(num_of_restarts):
+            self.target_node = coordinator_node = get_topology_coordinator_node(cluster=self.cluster)


we should set target node after verifying it's not running nemesis

we have machinery for selecting target_node, all selection of target nodes should work in similar fashion

see the logic of run_nemesis

doing so in multiple setup, is prone to errors and can lead to selection of the same target nodes from multiple nemesis threads, every selection should go via the same lock we introduced for that purpose

with run nemesis it's going to fail the test when coordinator node is one that currently runs nemesis due to assert free_nodes, f"couldn't find nodes for running:{nemesis_label}, are all nodes running nemesis ?"

When current topology coordinator is not available, new round of election of new coordinator should be started and new raft topology coordinator node will be elected. Added new function to search current coordinator node Added new nemesis to Rolling restart of elected coordinator node

soyacz

LGTM

fruch · 2024-11-19T09:53:41Z

sdcm/nemesis.py

+            new_coordinator_node = get_topology_coordinator_node(cluster=self.cluster)
+            self.log.debug("New coordinator node: %s, %s", new_coordinator_node, new_coordinator_node.name)
+            coordinator_node.start_scylla()
+            self.unset_current_running_nemesis(coordinator_node)


if there was a failure you won't reach this, and would let go of that node, no more nemesis would run on it

we should use context manager (similar to run_nemesis), or try/finally clause here.

but won't nemesis mashinery unset running nemesis upon finishing? (that's why need to re-set target node each loop)

github-actions bot assigned aleksbykov Oct 31, 2024

aleksbykov force-pushed the find-kill-topology-coordinator branch from a231f16 to ee78519 Compare November 1, 2024 09:28

aleksbykov added backport/2024.2 Need backport to 2024.2 backport/6.1 Need backport to 6.1 backport/6.2 labels Nov 1, 2024

aleksbykov marked this pull request as ready for review November 1, 2024 09:49

aleksbykov requested review from soyacz, patjed41, temichus, fruch and vponomaryov November 1, 2024 09:49

fruch reviewed Nov 3, 2024

View reviewed changes

fruch requested changes Nov 3, 2024

View reviewed changes

aleksbykov force-pushed the find-kill-topology-coordinator branch from ee78519 to 06e5079 Compare November 8, 2024 07:49

aleksbykov requested a review from fruch November 8, 2024 07:56

soyacz reviewed Nov 8, 2024

View reviewed changes

sdcm/nemesis.py Outdated Show resolved Hide resolved

sdcm/nemesis.py Outdated Show resolved Hide resolved

sdcm/nemesis.py Outdated Show resolved Hide resolved

sdcm/nemesis.py Show resolved Hide resolved

sdcm/utils/raft/common.py Outdated Show resolved Hide resolved

aleksbykov force-pushed the find-kill-topology-coordinator branch from 06e5079 to 46fc2e9 Compare November 15, 2024 09:47

soyacz reviewed Nov 15, 2024

View reviewed changes

aleksbykov force-pushed the find-kill-topology-coordinator branch from 46fc2e9 to d7dd6d8 Compare November 15, 2024 13:53

aleksbykov requested a review from soyacz November 15, 2024 13:54

soyacz reviewed Nov 19, 2024

View reviewed changes

aleksbykov force-pushed the find-kill-topology-coordinator branch from d7dd6d8 to 383b743 Compare November 19, 2024 09:41

aleksbykov requested a review from soyacz November 19, 2024 09:41

soyacz approved these changes Nov 19, 2024

View reviewed changes

fruch reviewed Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(raft): Rolling restart raft topology coordinator node #9102

feature(raft): Rolling restart raft topology coordinator node #9102

aleksbykov commented Oct 31, 2024 •

edited

Loading

fruch Nov 3, 2024

fruch Nov 19, 2024

fruch Nov 3, 2024

aleksbykov Nov 8, 2024

fruch left a comment

aleksbykov commented Nov 8, 2024

soyacz commented Nov 8, 2024

soyacz left a comment

soyacz Nov 15, 2024

aleksbykov Nov 15, 2024

aleksbykov commented Nov 15, 2024

aleksbykov commented Nov 19, 2024

aleksbykov commented Nov 19, 2024

soyacz left a comment

soyacz Nov 19, 2024

fruch Nov 19, 2024

soyacz Nov 19, 2024

soyacz left a comment

fruch Nov 19, 2024

soyacz Nov 19, 2024

feature(raft): Rolling restart raft topology coordinator node #9102

Are you sure you want to change the base?

feature(raft): Rolling restart raft topology coordinator node #9102

Conversation

aleksbykov commented Oct 31, 2024 • edited Loading

Testing

PR pre-checks (self review)

Reminders

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fruch left a comment

Choose a reason for hiding this comment

aleksbykov commented Nov 8, 2024

soyacz commented Nov 8, 2024

soyacz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aleksbykov commented Nov 15, 2024

aleksbykov commented Nov 19, 2024

aleksbykov commented Nov 19, 2024

soyacz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soyacz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aleksbykov commented Oct 31, 2024 •

edited

Loading