-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FAR E2E Test - Check Node Reboot by Boot Time and Patch CredentialsRequest for AWS #20
Conversation
4df5153
to
ba7bb4b
Compare
test/e2e/far_e2e_test.go
Outdated
log.Info("Testing Node", "Node name", testNode.Name) | ||
|
||
// save the node's boot time prior to the fence agent call | ||
if cond, errBoot = getKubeletReadyCondition(testNode.Name); errBoot != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that getting the last transition of the Ready condition is necessarily equivalent to the node's boot time.
Here is how we get the boot time in SNR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that getting the last transition of the Ready condition is necessarily equivalent to the node's boot time.
What makes you think this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- why is "kubelet" part of the function name? 🤔
- even when it's ok to use the ready condition as reboot indicator, it's not exactly the same as the boot time, so at least the phrasing around this is misleading
- and I'm also not sure of this will always work. AFAIK it takes 40s until an unresponsive node is marked as not ready. When the reboot is faster than that, you test will probably fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about the comment around getKubeletReadyCondition
.
If the reboot is faster than waiting around 40s until an unresponsive node is marked as not ready, then why this test will fail? IIUC my test will wait more time which is safer than finish testing too early.
It might means that my test will unnecessary last longer. ATM I don't see this as a problem, but when the E2E tests will be more complicated that testing this might be time consuming 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected that you test that the transition time to ready status is after start test time. When the node never gets unready that will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the node never gets unready that will fail.
Is that bad? I think it works as I would expect.
Is there a scenario where a node has been remediated and in the process of remediation the node never gets to Not Ready status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that bad? I think it works as I would expect.
You want to know if FAR works, right? You want to do that by checking that the Node rebooted, right? You want to do that by looking at the Ready condition, right? So when that condition never changes, how do you that the node rebooted and far did something?
Is there a scenario where a node has been remediated and in the process of remediation the node never gets to Not Ready status?
When remediation is faster than 40s: yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When remediation is faster than 40s: yes
Now, I got it. Thanks 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am reverting to verify reboot by looking for boot time rather than using the getKubeletReadyCondition
I think that as long as we are using hard coded node names or ip this test will not be able to run successfully in our CI |
test/e2e/far_e2e_test.go
Outdated
log.Info("Testing Node", "Node name", testNode.Name) | ||
|
||
// save the node's boot time prior to the fence agent call | ||
if cond, errBoot = getKubeletReadyCondition(testNode.Name); errBoot != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- why is "kubelet" part of the function name? 🤔
- even when it's ok to use the ready condition as reboot indicator, it's not exactly the same as the boot time, so at least the phrasing around this is misleading
- and I'm also not sure of this will always work. AFAIK it takes 40s until an unresponsive node is marked as not ready. When the reboot is faster than that, you test will probably fail?
@@ -8,6 +8,8 @@ import ( | |||
"sigs.k8s.io/controller-runtime/pkg/client" | |||
) | |||
|
|||
const WorkerLabelName = "node-role.kubernetes.io/worker" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: might want to add this to common repo, IIRC it is also used in SNR & NHC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this is somewhere in client-go / k8s api server code...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already have an open PR in common, since it's a very small change I thought it would be ok to added that label there as well: medik8s/common#3
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits
/lgtm
giving chance to the other to give feedback
/hold
feel free to unhold
/unhold |
/retest |
/hold when someone requested significant changes or we discussed important issues with the PR, please wait until he had a chance to give another review... |
Sometimes when we taint and reboot a node which had FAR, then that pod will be restarting on a different node, and in the meantime the old pod won't be available until it terminates. Therefore, we update GetFenceAgentsRemediationPod to return the first running pod that match FAR labels. On this scenario GetFenceAgentsRemediationPod will return the new pod, from the other node, the healthy one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good now, nice work 👍🏼
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clobrano, razo7, slintes The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
@razo7: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
The PR adds the support of running a reboot of OCP node on AWS environment, and the E2E test that checks FAR's code.
Until now the E2E test checked if FAR's CR has been created and if the FA CLI command has been executed correctly by reviewing FAR's pod/container logs (both have been done in #32). Now we check the node's boot time (Kubelet ready condition status transition time) to verify that the FA has been doing a successful reboot.
But for running the fence_aws fence agent with a
reboot
action outside of AWS we use the --skip-race-check flag, and we patch a CredentialsRequest in OCP to add missing AWS permissions (e.g., ec2:StartInstances, and ec2:StopInstances).ECOPROJECT-1274