Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to fence nodes with fence_azure_arm agent #90

Closed
jcanocan opened this issue Oct 10, 2023 · 6 comments · Fixed by #91
Closed

Unable to fence nodes with fence_azure_arm agent #90

jcanocan opened this issue Oct 10, 2023 · 6 comments · Fixed by #91
Labels
good first issue Good for newcomers

Comments

@jcanocan
Copy link
Contributor

Hi!

I'm currently playing around with FAR with Azure VMs. I've been able to install NHC, FAR in an OCP 4.13 cluster, to create the FAR Template and start the remediation process. This is the FAR Template I'm currently using:

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediationTemplate
metadata:
  name: fenceagentsremediationtemplate-default
  namespace: openshift-operators
spec:
  template:
    spec:
      sharedparameters:
        '--action': reboot
        '-l': ea6bxxx
        '-p': y~xxx
        '--resourceGroup': jcano-cluster-mfxww-rg
        '--tenantId': 60xxx
        '--subscriptionId': 89xxx
      nodeparameters:
        '--plug=':
          jcano-cluster-mfxww-master-0: jcano-cluster-mfxww-master-0
          jcano-cluster-mfxww-master-1: jcano-cluster-mfxww-master-1
          jcano-cluster-mfxww-master-2: jcano-cluster-mfxww-master-2
          jcano-cluster-mfxww-worker-germanywestcentral1-b58kw: jcano-cluster-mfxww-worker-germanywestcentral1-b58kw
          jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd: jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd
          jcano-cluster-mfxww-worker-germanywestcentral3-xd7h5: jcano-cluster-mfxww-worker-germanywestcentral3-xd7h5
      agent: fence_azure_arm

I've tried with fence_azure_arm tool standalone locally to restart a faulty VM where an OCP node is running. For that purpose, I stopped the kubelet process to bring a node to an unhealthy state, and it worked but requires a tiny modification, see: Azure/azure-sdk-for-python#30983 (comment)

Nevertheless, it is not working along with FAR operator. It throws the following errors:

2023-10-10T15:08:07.128294848Z	INFO	controllers.FenceAgentsRemediation	Begin FenceAgentsRemediation Reconcile
2023-10-10T15:08:07.128341449Z	INFO	controllers.FenceAgentsRemediation	Check FAR CR's name
2023-10-10T15:08:07.138883921Z	INFO	controllers.FenceAgentsRemediation	Finalizer was added	{"CR Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd"}
2023-10-10T15:08:07.138914222Z	INFO	controllers.FenceAgentsRemediation	Updating Status Condition	{"processingConditionStatus": "True", "fenceAgentActionSucceededConditionStatus": "Unknown", "succededConditionStatus": "Unknown", "reason": "RemediationStarted", "LastUpdateTime": "2023-10-10 15:08:07.138913322 +0000 UTC m=+23184.695547222"}
2023-10-10T15:08:07.151777431Z	INFO	controllers.FenceAgentsRemediation	Finish FenceAgentsRemediation Reconcile
2023-10-10T15:08:07.151923434Z	INFO	controllers.FenceAgentsRemediation	Begin FenceAgentsRemediation Reconcile
2023-10-10T15:08:07.151954534Z	INFO	controllers.FenceAgentsRemediation	Check FAR CR's name
2023-10-10T15:08:07.152025935Z	INFO	controllers.FenceAgentsRemediation	Try adding FAR (Medik8s) remediation taint	{"Fence Agent": "fence_azure_arm", "Node Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd"}
2023-10-10T15:08:07.170359134Z	INFO	taints	Taint was added	{"taint effect": "NoExecute", "taint list": [{"key":"node.kubernetes.io/unreachable","effect":"NoSchedule","timeAdded":"2023-10-10T15:03:06Z"},{"key":"node.kubernetes.io/unreachable","effect":"NoExecute","timeAdded":"2023-10-10T15:03:12Z"},{"key":"medik8s.io/fence-agents-remediation","effect":"NoExecute","timeAdded":"2023-10-10T15:08:07Z"}]}
2023-10-10T15:08:07.170395735Z	INFO	controllers.FenceAgentsRemediation	Fetch FAR's pod
2023-10-10T15:08:07.170512137Z	INFO	controllers.FenceAgentsRemediation	Combine fence agent parameters	{"Fence Agent": "fence_azure_arm", "Node Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd"}
2023-10-10T15:08:07.170539037Z	INFO	controllers.FenceAgentsRemediation	Execute the fence agent	{"Fence Agent": "fence_azure_arm", "Node Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd"}
2023-10-10T15:08:07.340974815Z	ERROR	executer	Failed to run exec command	{"stdout": "", "stderr": "time=\"2023-10-10T15:08:07Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"fence_azure_arm\\\": executable file not found in $PATH\"\n", "error": "command terminated with exit code 255"}
github.com/medik8s/fence-agents-remediation/pkg/cli.executer.Execute
	/remote-source/app/pkg/cli/cliexecuter.go:92
github.com/medik8s/fence-agents-remediation/controllers.(*FenceAgentsRemediationReconciler).Reconcile
	/remote-source/app/controllers/fenceagentsremediation_controller.go:203
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226
2023-10-10T15:08:07.341030816Z	ERROR	controllers.FenceAgentsRemediation	Fence Agent response was a failure	{"CR's Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd", "error": "command terminated with exit code 255"}
github.com/medik8s/fence-agents-remediation/controllers.(*FenceAgentsRemediationReconciler).Reconcile
	/remote-source/app/controllers/fenceagentsremediation_controller.go:206
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226
2023-10-10T15:08:07.350733575Z	INFO	controllers.FenceAgentsRemediation	Finish FenceAgentsRemediation Reconcile

It looks like FAR it's not able to find the fence_azure_arm tool in PATH for its purpose.

Environment:

  • OCP version: 4.13
  • NHC version: 0.6.0
  • FAR version: 0.2.0

Thanks in advance!

@clobrano
Copy link
Contributor

Hey @jcanocan,

it worked but requires a tiny modification, see: Azure/azure-sdk-for-python#30983 (comment)

thank you for pointing this out, really appreciated!

It looks like FAR it's not able to find the fence_azure_arm tool in PATH for its purpose.

I think fence_azure_arm is not installed in FAR's image. Currently it installs fence-agents-all (and aws), but it doesn't seem it includes the azure one

https://github.com/medik8s/fence-agents-remediation/blob/b2d3419a73a73231b70e46eb4fb28b39194609a6/Dockerfile#L39C1-L42C24

 ➤  docker run --rm -it quay.io/clobrano/fence-agents-remediation-fencing-agents bash
[root@4f3bb118da07 /]# fence_a 
fence_amt_ws    fence_apc       fence_apc_snmp  fence_aws       
[root@4f3bb118da07 /]# fence_ 
fence_amt_ws           fence_brocade          fence_eaton_snmp       fence_hpblade          fence_ilo2             fence_ilo5             fence_imm              fence_kdump            fence_rsb              fence_vmware_soap
fence_apc              fence_cisco_mds        fence_emerson          fence_ibmblade         fence_ilo3             fence_ilo5_ssh         fence_intelmodular     fence_mpath            fence_sbd              fence_wti
fence_apc_snmp         fence_cisco_ucs        fence_eps              fence_idrac            fence_ilo3_ssh         fence_ilo_moonshot     fence_ipdu             fence_redfish          fence_scsi             fence_xvm
fence_aws              fence_compute          fence_evacuate         fence_ifmib            fence_ilo4             fence_ilo_mp           fence_ipmilan          fence_rhevm            fence_virt             
fence_bladecenter      fence_drac5            fence_heuristics_ping  fence_ilo              fence_ilo4_ssh         fence_ilo_ssh          fence_ipmilanplus      fence_rsa              fence_vmware_rest      
[root@4f3bb118da07 /]# fence_

@razo7 razo7 added the good first issue Good for newcomers label Oct 11, 2023
@jcanocan
Copy link
Contributor Author

Thanks for answering back! I'm glad to help 😊

Regarding Azure/azure-sdk-for-python#30983 (comment). Looks like they are not motivated to make the change. Moreover, It will take some time to land. Therefore, what do you think about including the following command right after fence-azure-arm package installation?

RUN sed -i 's/\"instanceView\"/expand=\"instanceView\"/' /usr/sbin/fence_azure_arm 

I would agree that it's not a very clean solution, just a workaround. Nevertheless, it will allow the fence agent work.

@clobrano
Copy link
Contributor

Looks like they are not motivated to make the change.

It seems they need to propagate the request to the right people :)

I would agree that it's not a very clean solution, just a workaround. Nevertheless, it will allow the fence agent work.

We actually want to decouple the operator's image from the one containing the agents so that one could use an image with a specific fencing agent and the related quirks to make it work.

@razo7
Copy link
Member

razo7 commented Oct 11, 2023

First of all thanks Javier for noticing/raising the notion of using Azure fence agent!

Looks like they are not motivated to make the change.

Yes, how about creating a PR with the above fix to https://github.com/ClusterLabs/fence-agents/tree/main repo? They are available in their mailing list if you want to discuss about if beforehand.

@jcanocan
Copy link
Contributor Author

We actually want to decouple the operator's image from the one containing the agents so that one could use an image > with a specific fencing agent and the related quirks to make it work.

Thanks for letting me know. Sounds nice :)

First of all thanks Javier for noticing/raising the notion of using Azure fence agent!

Looks like they are not motivated to make the change.

Yes, how about creating a PR with the above fix to https://github.com/ClusterLabs/fence-agents/tree/main repo? They are available in their mailing list if you want to discuss about if beforehand.

Thanks for the suggestion. I misinterpreted the words in Azure/azure-sdk-for-python#30983 (comment), but I just realized that the azure fence agent is independent to the https://github.com/Azure/azure-sdk-for-python. Apologizes for the confusion. So I will try to post a PR fixing this issue in the fence agent.

Meanwhile, I will learn how to build the operator locally and deploy it in an OCP cluster.

@jcanocan
Copy link
Contributor Author

Posted ClusterLabs/fence-agents#562. Just in case you are curious :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants