Create overview page for the suspicious replica recoverer daemon #373

ChristophAmes · 2024-08-28T13:09:56Z

The suspicious replica recoverer daemon isn't easy to understand at first glance. Therefore, an overview page for the suspicious replica recoverer daemon will be created, which will explain the general structure of the daemon and what the goals of each step are.

cserf · 2024-08-30T13:04:24Z

The text is OK. I would add a schema describing the state machine similar to https://github.com/rucio/documentation/blob/main/website/static/img/request_state_transition_chart.svg even if it might be a bit complicated to fit everything in just one figure

cserf · 2024-08-30T12:59:19Z

docs/operator/suspicious_replica_recoverer.md

+-**ignore**: this is the default policy. Datatypes and scopes can be explicitly set to be ignored, which highlights that a decision has purposefully been made to not perform any actions on these replicas. This is done to prevent mistakes in the future.
+-**declare bad**: this dictates that any associated datatypes or scopes will be declared `BAD` by the daemon.
+-**dry run**: this policy makes the daemon handle the replicas as if they were to be declared `BAD`, but at the final step, no actions are taken. This results in log messages with which it becomes possible to see how many replicas of the given datatype/scope would be declared `BAD` by the daemon.
+


You should mention that the poilcy is a json file and show an example

voetberg · 2024-09-19T19:55:19Z

Can you add a reference to this page here? Could be useful to also understand why replicas become suspicious in the first place and where the recoverer daemon comes in

bari12 · 2024-10-22T11:48:54Z

Ping @ChristophAmes would you mind still doing these changes?

haozturk · 2024-11-07T13:39:23Z

I spent a while to read this daemon's code and adapt it into CMS. I'd like to add things into its documentation including this flow chart [1] and how it should be configured [2], which isn't trivial. If this PR is going to merged, I can wait and make a new PR to add my changes.

[1]https://cmsdmops.docs.cern.ch/CMSRucio/Daemon_configurations/image.png
[2]https://cmsdmops.docs.cern.ch/CMSRucio/Daemon_configurations/replica_recoverer/

ChristophAmes · 2024-11-08T13:24:03Z

Sorry I haven't responded, I've been rather busy.
I've added an example for the JSON file and added a link to the overview of the replica workflow.
I think the flow chart created by @haozturk looks good, so I won't make one myself.

rdimaio · 2024-11-12T12:13:43Z

docs/operator/suspicious_replica_recoverer.md

+
+A replica can be declared suspicious multiple times: each time an attempt to access the replica results in an error message, the replica is declared suspicious. This allows the daemon to handle replicas differently depending on how many times it has been declared suspicious. As long as a file has been declared suspicious less than a certain number of times (referred to as `nattempts`), it's assumed that there is nothing wrong with the replica and that the errors can be ignored. Once there are more that `nattempts` suspicious declarations, the replica is handled by the daemon.
+
+Before replicas are handled individually, the daemon checks how many suspicious replicas are on each Rucio storage element (RSE), which are the servers that host replicas. If an RSE has more than `limit_suspicious_files_on_rse` suspicious replicas, then it is assumed that the problem lays with the RSE and not the replicas themselves. Under such a circumstance, the replicas on that RSE are set to the state `TEMPORARILY UNAVAILABLE` for three days. A replica in this state can't be interacted with. The assumption is that after three days, problems with the RSE will have been fixed. If not, then the replicas on that RSE will end-up being declared suspicious en masse, which will result in them being set as `TEMPORARILY UNAVAILABLE` again. This cycle will be repeated until the underlying issue is solved.


Suggested change

Before replicas are handled individually, the daemon checks how many suspicious replicas are on each Rucio storage element (RSE), which are the servers that host replicas. If an RSE has more than `limit_suspicious_files_on_rse` suspicious replicas, then it is assumed that the problem lays with the RSE and not the replicas themselves. Under such a circumstance, the replicas on that RSE are set to the state `TEMPORARILY UNAVAILABLE` for three days. A replica in this state can't be interacted with. The assumption is that after three days, problems with the RSE will have been fixed. If not, then the replicas on that RSE will end-up being declared suspicious en masse, which will result in them being set as `TEMPORARILY UNAVAILABLE` again. This cycle will be repeated until the underlying issue is solved.

Before replicas are handled individually, the daemon checks how many suspicious replicas are on each Rucio storage element (RSE), which are the servers that host replicas. If an RSE has more than `limit_suspicious_files_on_rse` suspicious replicas, then it is assumed that the problem lays with the RSE and not the replicas themselves. Under such a circumstance, the replicas on that RSE are set to the state `TEMPORARILY UNAVAILABLE` for three days. A replica in this state can't be interacted with. The assumption is that after three days, problems with the RSE will have been fixed. If not, then the replicas on that RSE will end up being declared suspicious en masse, which will result in them being set as `TEMPORARILY UNAVAILABLE` again. This cycle will be repeated until the underlying issue is solved.

rdimaio · 2024-11-12T12:14:55Z

docs/operator/suspicious_replica_recoverer.md

+-**declare bad**: this dictates that any associated datatypes or scopes will be declared `BAD` by the daemon.
+-**dry run**: this policy makes the daemon handle the replicas as if they were to be declared `BAD`, but at the final step, no actions are taken. This results in log messages with which it becomes possible to see how many replicas of the given datatype/scope would be declared `BAD` by the daemon.
+
+The replica polices can easily be expanded in the future.


Suggested change

The replica polices can easily be expanded in the future.

The replica policies can easily be expanded in the future.

rdimaio · 2024-11-12T12:16:39Z

docs/operator/suspicious_replica_recoverer.md

+
+## `nattempts = 1`
+
+A very large number of suspicious replicas have `nattempts = 1`. To clean-up the database, replicas with `nattempts = 1` that also have a policy that would result in the replica being declared bad are given a "boost". This means that rules for those replicas are created. These rules only exist to create an attempt to interact with the replica. If there is in fact a problem with the replica (or the RSE), then each rule will result in an error and that replica will be declared `SUSPICIOUS` once for each rule, which will bring the number of declarations over the `nattempts` barrier. This results in the replica being handled normally by the daemon during the daemon's next cycle.


Suggested change

A very large number of suspicious replicas have `nattempts = 1`. To clean-up the database, replicas with `nattempts = 1` that also have a policy that would result in the replica being declared bad are given a "boost". This means that rules for those replicas are created. These rules only exist to create an attempt to interact with the replica. If there is in fact a problem with the replica (or the RSE), then each rule will result in an error and that replica will be declared `SUSPICIOUS` once for each rule, which will bring the number of declarations over the `nattempts` barrier. This results in the replica being handled normally by the daemon during the daemon's next cycle.

A very large number of suspicious replicas have `nattempts = 1`. To clean up the database, replicas with `nattempts = 1` that also have a policy that would result in the replica being declared bad are given a "boost". This means that rules for those replicas are created. These rules only exist to create an attempt to interact with the replica. If there is in fact a problem with the replica (or the RSE), then each rule will result in an error and that replica will be declared `SUSPICIOUS` once for each rule, which will bring the number of declarations over the `nattempts` barrier. This results in the replica being handled normally by the daemon during the daemon's next cycle.

rdimaio · 2024-11-12T12:20:27Z

docs/operator/suspicious_replica_recoverer.md

+## Active mode
+
+By default, the entire suspicious replica recoverer is ran in a passive mode, meaning that it will create log files as if it were taking actions on replicas, however, it does not take any actions. This passive mode allows for test runs of the daemon without accidentally causing damage. The active mode, which then takes actions on the replicas, has to be specified when the daemon is called.


I'd personally rewrite it like this for clarity, but it's just my opinion

Suggested change

## Active mode

By default, the entire suspicious replica recoverer is ran in a passive mode, meaning that it will create log files as if it were taking actions on replicas, however, it does not take any actions. This passive mode allows for test runs of the daemon without accidentally causing damage. The active mode, which then takes actions on the replicas, has to be specified when the daemon is called.

## Passive and active modes

The suspicious replica recoverer has two modes of operation:

- **Passive (default)**: no actions are taken by the daemon, but log files are generated as if the actions were taken (like a dry-run mode). Useful for testing daemon runs without affecting data.

- **Active**: the daemon is allowed to take actions on the replicas. This option has to explicitly be set when the daemon is called.

ChristophAmes self-assigned this Aug 28, 2024

cserf reviewed Aug 30, 2024

View reviewed changes

ChristophAmes force-pushed the 372-Create_overview_page_for_the_suspicious_replica_recoverer_daemon branch from f7bad85 to 326d12c Compare November 8, 2024 13:01

Create overview page for the suspicious replica recoverer daemon

1f1a888

ChristophAmes force-pushed the 372-Create_overview_page_for_the_suspicious_replica_recoverer_daemon branch from 326d12c to 1f1a888 Compare November 8, 2024 13:20

rdimaio reviewed Nov 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create overview page for the suspicious replica recoverer daemon #373

Create overview page for the suspicious replica recoverer daemon #373

ChristophAmes commented Aug 28, 2024

cserf commented Aug 30, 2024

cserf Aug 30, 2024

voetberg commented Sep 19, 2024

bari12 commented Oct 22, 2024

haozturk commented Nov 7, 2024

ChristophAmes commented Nov 8, 2024

rdimaio Nov 12, 2024

rdimaio Nov 12, 2024

rdimaio Nov 12, 2024

rdimaio Nov 12, 2024


		A replica can be declared suspicious multiple times: each time an attempt to access the replica results in an error message, the replica is declared suspicious. This allows the daemon to handle replicas differently depending on how many times it has been declared suspicious. As long as a file has been declared suspicious less than a certain number of times (referred to as `nattempts`), it's assumed that there is nothing wrong with the replica and that the errors can be ignored. Once there are more that `nattempts` suspicious declarations, the replica is handled by the daemon.

		Before replicas are handled individually, the daemon checks how many suspicious replicas are on each Rucio storage element (RSE), which are the servers that host replicas. If an RSE has more than `limit_suspicious_files_on_rse` suspicious replicas, then it is assumed that the problem lays with the RSE and not the replicas themselves. Under such a circumstance, the replicas on that RSE are set to the state `TEMPORARILY UNAVAILABLE` for three days. A replica in this state can't be interacted with. The assumption is that after three days, problems with the RSE will have been fixed. If not, then the replicas on that RSE will end-up being declared suspicious en masse, which will result in them being set as `TEMPORARILY UNAVAILABLE` again. This cycle will be repeated until the underlying issue is solved.

	The replica polices can easily be expanded in the future.
	The replica policies can easily be expanded in the future.


		## `nattempts = 1`

		A very large number of suspicious replicas have `nattempts = 1`. To clean-up the database, replicas with `nattempts = 1` that also have a policy that would result in the replica being declared bad are given a "boost". This means that rules for those replicas are created. These rules only exist to create an attempt to interact with the replica. If there is in fact a problem with the replica (or the RSE), then each rule will result in an error and that replica will be declared `SUSPICIOUS` once for each rule, which will bring the number of declarations over the `nattempts` barrier. This results in the replica being handled normally by the daemon during the daemon's next cycle.

		## Active mode

		By default, the entire suspicious replica recoverer is ran in a passive mode, meaning that it will create log files as if it were taking actions on replicas, however, it does not take any actions. This passive mode allows for test runs of the daemon without accidentally causing damage. The active mode, which then takes actions on the replicas, has to be specified when the daemon is called.

-## Active mode
-By default, the entire suspicious replica recoverer is ran in a passive mode, meaning that it will create log files as if it were taking actions on replicas, however, it does not take any actions. This passive mode allows for test runs of the daemon without accidentally causing damage. The active mode, which then takes actions on the replicas, has to be specified when the daemon is called.
+## Passive and active modes
+The suspicious replica recoverer has two modes of operation:
+- **Passive (default)**: no actions are taken by the daemon, but log files are generated as if the actions were taken (like a dry-run mode). Useful for testing daemon runs without affecting data.
+- **Active**: the daemon is allowed to take actions on the replicas. This option has to explicitly be set when the daemon is called.

Create overview page for the suspicious replica recoverer daemon #373

Are you sure you want to change the base?

Create overview page for the suspicious replica recoverer daemon #373

Conversation

ChristophAmes commented Aug 28, 2024

cserf commented Aug 30, 2024

cserf Aug 30, 2024

Choose a reason for hiding this comment

voetberg commented Sep 19, 2024

bari12 commented Oct 22, 2024

haozturk commented Nov 7, 2024

ChristophAmes commented Nov 8, 2024

rdimaio Nov 12, 2024

Choose a reason for hiding this comment

rdimaio Nov 12, 2024

Choose a reason for hiding this comment

rdimaio Nov 12, 2024

Choose a reason for hiding this comment

rdimaio Nov 12, 2024

Choose a reason for hiding this comment