Skip to content

Commit

Permalink
Update FAR README for V0.2.0
Browse files Browse the repository at this point in the history
Include FAR v0.2.0 changes and update some old places
  • Loading branch information
razo7 committed Aug 23, 2023
1 parent 3e0294e commit 0b010c8
Showing 1 changed file with 35 additions and 9 deletions.
44 changes: 35 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,21 @@
# Fence Agents Remediation (FAR)

The fence-agents-remediation (*FAR*) is a Kubrenetes operator generated using the [operator-sdk](https://github.com/operator-framework/operator-sdk), and it is part of [Medik8s](https://github.com/medik8s) operators. This operator is desgined to run an existing set of [upstream fencing agents](https://github.com/ClusterLabs/fence-agents) for environments with a traditional API end-point (e.g., [IPMI](https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface)) for power cycling cluster nodes.
The fence-agents-remediation (*FAR*) is a Kubrenetes operator generated using the [operator-sdk](https://github.com/operator-framework/operator-sdk), and it is part of [Medik8s](https://github.com/medik8s) operators. FAR operator provides high availability for Kubernetes nodes in an automatic manner. FAR runs a fence-agent to remediate a node from an unhealthy state by power-cycling the node (using a management interface or traditional API).

The operator watches for new or deleted custom resources (CRs) called `FenceAgentsRemediation` (or `far`) which trigger a fence-agent to remediate a node, based on the CR's name.
FAR operator was designed to run with the Node HealthCheck Operator [(NHC)](https://github.com/medik8s/node-healthcheck-operator) as extrenal remediatior for easier and smoother experience, but it can be used as a standalone remeidatior for the more advanced user.
FAR joins Medik8s as another remediator alternative for NHC, apart from [Self Node Remediation](https://github.com/medik8s/self-node-remediation) and [Machine Deletion Remediation](https://github.com/medik8s/machine-deletion-remediation) which are also from the [Medik8s](https://www.medik8s.io/) group.
For an easier and smoother experience, the FAR operator was designed to run with the Node HealthCheck Operator [(NHC)](https://github.com/medik8s/node-healthcheck-operator) as an external remediation operator. It can be used as a standalone remediation operator for the more advanced user.
Furthermore, FAR joins Medik8s as another remediator alternative for NHC, apart from [Self Node Remediation](https://github.com/medik8s/self-node-remediation) and [Machine Deletion Remediation](https://github.com/medik8s/machine-deletion-remediation) which are also from the [Medik8s](https://www.medik8s.io/) group.

FAR operator includes plenty of well known [fence-agents](https://github.com/medik8s/fence-agents-remediation/blob/main/Dockerfile#L31) to choose from (see [here](https://github.com/ClusterLabs/fence-agents/tree/main/agents) for the full list), thanks to the upstream [fence-agents repo](https://github.com/ClusterLabs/fence-agents) from *ClusterLabs*.
Currently FAR has been tested only with one fence-agent [*fence_ipmilan*](https://www.mankier.com/8/fence_ipmilan) - I/O Fencing agent which can be used with machines controlled by IPMI, and using [ipmitool](<http://ipmitool.sf.net/>).
FAR uses a fence-agent to fence a Kubernetes node. Generally fencing is the process of taking unresponsive/unhealthy computers into a safe state, isolating the computer. Fence agent is a software code that uses a management interface to perform fencing, mostly power based fencing which enables power-cycling, reset, or turning off the computer.

FAR Operator includes numerous [fence-agents](https://github.com/medik8s/fence-agents-remediation/blob/main/Dockerfile#L41) to choose from the [upstream repository by *ClusterLabs* group](https://github.com/ClusterLabs/fence-agents/tree/main/agent). Out of the large list there are two tested agents, fence agent [*fence_ipmilan*](https://www.mankier.com/8/fence_ipmilan) for Intelligent Platform Management Interface ([IPMI](https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface)) environments, and [*fence_aws*](https://manpages.ubuntu.com/manpages/focal/man8/fence_aws.8.html) for Amazon Web Services ([AWS](https://aws.amazon.com/)) platform.

## Installation

There are two ways to install the operator:

* Deploy the latest version, which was built from the `main` branch, to a running Kubernetes/OpenShift cluster.
<!-- TODO: - Deploy the last release version from OperatorHub to a running Kubernetes cluster. -->
<!-- TODO: - Deploy the latest release version from [OperatorHub](https://operatorhub.io/operator/fence-agents-remediation) to a running Kubernetes cluster. -->
* Build and deploy from sources to a running or to be created Kubernetes/OpenShift cluster.

### Deploy the latest version
Expand All @@ -27,7 +28,6 @@ For deployment of FAR using these images you need:
* A running OpenShift cluster, or a Kubernetes cluster with Operator Lifecycle Manager ([OLM](https://olm.operatorframework.io/docs/)) installed (to install it run `operator-sdk olm install`).

* A valid `$KUBECONFIG` configured to access your cluster.
<!-- TODO: ATM it can't be installed on the default namespace -->
Then, run `operator-sdk run bundle quay.io/medik8s/fence-agents-remediation-operator-bundle:latest` to deploy the FAR's latest version on the current namespace.

### Build and deploy from sources
Expand All @@ -37,6 +37,20 @@ Then, run `operator-sdk run bundle quay.io/medik8s/fence-agents-remediation-oper
* Follow OLM's [instructions](https://sdk.operatorframework.io/docs/building-operators/golang/tutorial/#configure-the-operators-image-registry) on how to configure the operator's image reistry (build and push the operator container).
* Run FAR using one the [suggested options from OLM](https://sdk.operatorframework.io/docs/building-operators/golang/tutorial/#run-the-operator) to run it locally, in the cluster, and in the cluster using bundle container (similar to the [above installation](#deploy-the-latest-version)).

## Workflow

1. One of the nodes failed (and has become unhealthy)
2. FAR adds NoExecute taint to the failed node
=> Ensure that any workloads are not executed after rebooting the failed node, and any stateless pods (that can’t tolerate FAR NoExecute taint) will be evicted immediately.
3. FAR reboots the failed node via the Fence Agent
=> After rebooting, there are no workloads in the failed node
4. FAR forcefully deletes the pods and the volume attachments in the failed node
=> The scheduler can schedule the failed does’s workloads on a different node
5. After the failed node becomes healthy, the NoExecute taint in Step 2 is removed and the node becomes schedulable again.
=> This taint is removed when the CR for remediating the node is deleted. As long as the CR is valid, its name matches to a cluster node, the matched node should be tainted with FAR NoExecute taint.

## Status

## Usage

FAR is recommended for using with NHC to create a complete solution for unhealty nodes, since NHC detects unhelthy nodes and creates an extrenal remediation CR, e.g., FAR's CR, for unhealthy nodes.
Expand Down Expand Up @@ -120,18 +134,30 @@ spec:
worker-0: "6233"
worker-1: "6234"
worker-2: "6235"
status:
conditions:
- status: false
message: Node Healthcheck timeout annotation has been set
reason: RemediationFinishedNodeNotFound
type: Processing
- status: true
type: FenceAgentActionSucceeded
- status: true
type: Succeeded
lastUpdateTime: '2023-08-23T09:25:13Z'
```

## Tests

### Run code checks and unit tests

`make test`
Run `make test`

### Run e2e tests

1. Deploy the operator as explained above
2. Run `make test-e2e`
2. If the cluster is running on AWS platform, then run `make ocp-aws-credentials test-e2e` to add sufficient [CredentialsRequest](https://github.com/medik8s/fence-agents-remediation/blob/main/config/ocp_aws/fence_aws_credentials_request.yaml)
3. Run `export OPERATOR_NS=openshift-operators && make test-e2e` when the operator was installed in `openshift-operators` namespace

## Help

Expand Down

0 comments on commit 0b010c8

Please sign in to comment.