Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FAR E2E Test - Check Node Reboot by Boot Time and Patch CredentialsRequest for AWS #20

Merged
merged 9 commits into from
Jun 27, 2023

Conversation

razo7
Copy link
Member

@razo7 razo7 commented Mar 6, 2023

The PR adds the support of running a reboot of OCP node on AWS environment, and the E2E test that checks FAR's code.

Until now the E2E test checked if FAR's CR has been created and if the FA CLI command has been executed correctly by reviewing FAR's pod/container logs (both have been done in #32). Now we check the node's boot time (Kubelet ready condition status transition time) to verify that the FA has been doing a successful reboot.

But for running the fence_aws fence agent with a reboot action outside of AWS we use the --skip-race-check flag, and we patch a CredentialsRequest in OCP to add missing AWS permissions (e.g., ec2:StartInstances, and ec2:StopInstances).

ECOPROJECT-1274

@openshift-ci openshift-ci bot requested review from beekhof and clobrano March 6, 2023 12:59
@openshift-ci openshift-ci bot added the approved label Mar 6, 2023
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
@razo7 razo7 changed the title FAR E2E tests - logs and boot time [WIP] FAR E2E tests - logs and boot time Mar 7, 2023
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
@razo7 razo7 force-pushed the e2e-test branch 8 times, most recently from 4df5153 to ba7bb4b Compare March 13, 2023 08:52
@razo7 razo7 changed the title [WIP] FAR E2E tests - logs and boot time FAR E2E tests - logs and boot time Mar 13, 2023
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
log.Info("Testing Node", "Node name", testNode.Name)

// save the node's boot time prior to the fence agent call
if cond, errBoot = getKubeletReadyCondition(testNode.Name); errBoot != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that getting the last transition of the Ready condition is necessarily equivalent to the node's boot time.
Here is how we get the boot time in SNR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that getting the last transition of the Ready condition is necessarily equivalent to the node's boot time.

What makes you think this way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • why is "kubelet" part of the function name? 🤔
  • even when it's ok to use the ready condition as reboot indicator, it's not exactly the same as the boot time, so at least the phrasing around this is misleading
  • and I'm also not sure of this will always work. AFAIK it takes 40s until an unresponsive node is marked as not ready. When the reboot is faster than that, you test will probably fail?

Copy link
Member Author

@razo7 razo7 Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about the comment around getKubeletReadyCondition.
If the reboot is faster than waiting around 40s until an unresponsive node is marked as not ready, then why this test will fail? IIUC my test will wait more time which is safer than finish testing too early.
It might means that my test will unnecessary last longer. ATM I don't see this as a problem, but when the E2E tests will be more complicated that testing this might be time consuming 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected that you test that the transition time to ready status is after start test time. When the node never gets unready that will fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the node never gets unready that will fail.

Is that bad? I think it works as I would expect.
Is there a scenario where a node has been remediated and in the process of remediation the node never gets to Not Ready status?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that bad? I think it works as I would expect.

You want to know if FAR works, right? You want to do that by checking that the Node rebooted, right? You want to do that by looking at the Ready condition, right? So when that condition never changes, how do you that the node rebooted and far did something?

Is there a scenario where a node has been remediated and in the process of remediation the node never gets to Not Ready status?

When remediation is faster than 40s: yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When remediation is faster than 40s: yes

Now, I got it. Thanks 👍🏻

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reverting to verify reboot by looking for boot time rather than using the getKubeletReadyCondition

@mshitrit
Copy link
Member

I think that as long as we are using hard coded node names or ip this test will not be able to run successfully in our CI

test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
log.Info("Testing Node", "Node name", testNode.Name)

// save the node's boot time prior to the fence agent call
if cond, errBoot = getKubeletReadyCondition(testNode.Name); errBoot != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • why is "kubelet" part of the function name? 🤔
  • even when it's ok to use the ready condition as reboot indicator, it's not exactly the same as the boot time, so at least the phrasing around this is misleading
  • and I'm also not sure of this will always work. AFAIK it takes 40s until an unresponsive node is marked as not ready. When the reboot is faster than that, you test will probably fail?

test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
@@ -8,6 +8,8 @@ import (
"sigs.k8s.io/controller-runtime/pkg/client"
)

const WorkerLabelName = "node-role.kubernetes.io/worker"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: might want to add this to common repo, IIRC it is also used in SNR & NHC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this is somewhere in client-go / k8s api server code...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find it :)

Copy link
Member

@mshitrit mshitrit Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already have an open PR in common, since it's a very small change I thought it would be ok to added that label there as well: medik8s/common#3

@mshitrit
Copy link
Member

/lgtm
/hold
Giving others final chance to review as well

Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits
/lgtm
giving chance to the other to give feedback
/hold
feel free to unhold

pkg/utils/pods.go Outdated Show resolved Hide resolved
test/e2e/utils/cluster.go Outdated Show resolved Hide resolved
@razo7
Copy link
Member Author

razo7 commented Jun 26, 2023

/unhold

@razo7
Copy link
Member Author

razo7 commented Jun 27, 2023

/retest

@slintes
Copy link
Member

slintes commented Jun 27, 2023

/hold

when someone requested significant changes or we discussed important issues with the PR, please wait until he had a chance to give another review...

pkg/utils/pods.go Outdated Show resolved Hide resolved
pkg/utils/pods.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
pkg/utils/pods.go Outdated Show resolved Hide resolved
pkg/utils/pods.go Outdated Show resolved Hide resolved
pkg/utils/pods.go Show resolved Hide resolved
test/e2e/far_e2e_test.go Outdated Show resolved Hide resolved
Sometimes when we taint and reboot a node which had FAR, then that pod will be restarting on a different node, and in the meantime the old pod won't be available until it terminates. Therefore, we update GetFenceAgentsRemediationPod to  return the first running pod that match FAR labels. On this scenario GetFenceAgentsRemediationPod will return the new pod, from the other node, the healthy one.
Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good now, nice work 👍🏼

@openshift-ci openshift-ci bot added the lgtm label Jun 27, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 27, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, razo7, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [clobrano,razo7,slintes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@slintes
Copy link
Member

slintes commented Jun 27, 2023

/hold cancel

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 27, 2023

@razo7: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/openshift-e2e d607660 link true /test openshift-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@razo7
Copy link
Member Author

razo7 commented Jun 27, 2023

/retest

@openshift-merge-robot openshift-merge-robot merged commit 0ed0614 into medik8s:main Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants