FAR E2E Test - Check Node Reboot by Boot Time and Patch CredentialsRequest for AWS #20

razo7 · 2023-03-06T12:59:16Z

The PR adds the support of running a reboot of OCP node on AWS environment, and the E2E test that checks FAR's code.

Until now the E2E test checked if FAR's CR has been created and if the FA CLI command has been executed correctly by reviewing FAR's pod/container logs (both have been done in #32). Now we check the node's boot time (Kubelet ready condition status transition time) to verify that the FA has been doing a successful reboot.

But for running the fence_aws fence agent with a reboot action outside of AWS we use the --skip-race-check flag, and we patch a CredentialsRequest in OCP to add missing AWS permissions (e.g., ec2:StartInstances, and ec2:StopInstances).

ECOPROJECT-1274

test/e2e/far_e2e_test.go

mshitrit · 2023-03-13T12:35:51Z

test/e2e/far_e2e_test.go

+			log.Info("Testing Node", "Node name", testNode.Name)
+
+			// save the node's boot time prior to the fence agent call
+			if cond, errBoot = getKubeletReadyCondition(testNode.Name); errBoot != nil {


I don't think that getting the last transition of the Ready condition is necessarily equivalent to the node's boot time.
Here is how we get the boot time in SNR

I don't think that getting the last transition of the Ready condition is necessarily equivalent to the node's boot time.

What makes you think this way?

why is "kubelet" part of the function name? 🤔

even when it's ok to use the ready condition as reboot indicator, it's not exactly the same as the boot time, so at least the phrasing around this is misleading

and I'm also not sure of this will always work. AFAIK it takes 40s until an unresponsive node is marked as not ready. When the reboot is faster than that, you test will probably fail?

I agree about the comment around getKubeletReadyCondition.
If the reboot is faster than waiting around 40s until an unresponsive node is marked as not ready, then why this test will fail? IIUC my test will wait more time which is safer than finish testing too early.
It might means that my test will unnecessary last longer. ATM I don't see this as a problem, but when the E2E tests will be more complicated that testing this might be time consuming 🤔

I expected that you test that the transition time to ready status is after start test time. When the node never gets unready that will fail.

When the node never gets unready that will fail.

Is that bad? I think it works as I would expect.
Is there a scenario where a node has been remediated and in the process of remediation the node never gets to Not Ready status?

Is that bad? I think it works as I would expect.

You want to know if FAR works, right? You want to do that by checking that the Node rebooted, right? You want to do that by looking at the Ready condition, right? So when that condition never changes, how do you that the node rebooted and far did something?

Is there a scenario where a node has been remediated and in the process of remediation the node never gets to Not Ready status?

When remediation is faster than 40s: yes

When remediation is faster than 40s: yes

Now, I got it. Thanks 👍🏻

I am reverting to verify reboot by looking for boot time rather than using the getKubeletReadyCondition

mshitrit · 2023-03-13T12:40:54Z

I think that as long as we are using hard coded node names or ip this test will not be able to run successfully in our CI

test/e2e/far_e2e_test.go

slintes · 2023-03-13T15:58:11Z

test/e2e/far_e2e_test.go

+			log.Info("Testing Node", "Node name", testNode.Name)
+
+			// save the node's boot time prior to the fence agent call
+			if cond, errBoot = getKubeletReadyCondition(testNode.Name); errBoot != nil {


why is "kubelet" part of the function name? 🤔

even when it's ok to use the ready condition as reboot indicator, it's not exactly the same as the boot time, so at least the phrasing around this is misleading

and I'm also not sure of this will always work. AFAIK it takes 40s until an unresponsive node is marked as not ready. When the reboot is faster than that, you test will probably fail?

test/e2e/far_e2e_test.go

mshitrit · 2023-06-26T11:27:22Z

pkg/utils/nodes.go

@@ -8,6 +8,8 @@ import (
 	"sigs.k8s.io/controller-runtime/pkg/client"
 )

+const WorkerLabelName = "node-role.kubernetes.io/worker"


Nit: might want to add this to common repo, IIRC it is also used in SNR & NHC

I'm wondering if this is somewhere in client-go / k8s api server code...

I didn't find it :)

I already have an open PR in common, since it's a very small change I thought it would be ok to added that label there as well: medik8s/common#3

mshitrit · 2023-06-26T11:30:24Z

/lgtm
/hold
Giving others final chance to review as well

clobrano

A couple of nits
/lgtm
giving chance to the other to give feedback
/hold
feel free to unhold

pkg/utils/pods.go

test/e2e/utils/cluster.go

razo7 · 2023-06-26T18:50:56Z

/unhold

razo7 · 2023-06-27T06:11:54Z

/retest

slintes · 2023-06-27T06:16:17Z

/hold

when someone requested significant changes or we discussed important issues with the PR, please wait until he had a chance to give another review...

pkg/utils/pods.go

test/e2e/far_e2e_test.go

pkg/utils/pods.go

test/e2e/far_e2e_test.go

Sometimes when we taint and reboot a node which had FAR, then that pod will be restarting on a different node, and in the meantime the old pod won't be available until it terminates. Therefore, we update GetFenceAgentsRemediationPod to return the first running pod that match FAR labels. On this scenario GetFenceAgentsRemediationPod will return the new pod, from the other node, the healthy one.

slintes

looks good now, nice work 👍🏼

openshift-ci · 2023-06-27T09:21:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, razo7, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [clobrano,razo7,slintes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

slintes · 2023-06-27T09:22:23Z

/hold cancel

openshift-ci · 2023-06-27T10:35:30Z

@razo7: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/openshift-e2e	`d607660`	link	true	`/test openshift-e2e`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

razo7 · 2023-06-27T11:17:19Z

/retest

openshift-ci bot requested review from beekhof and clobrano March 6, 2023 12:59

openshift-ci bot added the approved label Mar 6, 2023

razo7 commented Mar 6, 2023

View reviewed changes