Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check machine status and log details if it is not running #3887

Merged
merged 2 commits into from
Oct 14, 2024

Conversation

ventifus
Copy link
Collaborator

@ventifus ventifus commented Oct 4, 2024

Which issue this PR addresses:

Provides more data for ARO-4247

What this PR does / why we need it:

It is hard to troubleshoot worker node creation failures for new cluster installation. We wait for worker nodes, but sometimes they never show up because there was an Azure failure creating the VM. CAPI logs for machine creation happen too early and don't make it to Kusto.

Test plan for issue:

Deploy test clusters and make sure we get the logs desired. Centraluseuap would be a good place to start if the error creating availability sets reproduces.

Happy-path logs look like the following:

2024-10-07T17:48:32.6932154Z time="2024-10-07T17:48:32Z" level=info msg="running step [Condition pkg/cluster.(*manager).minimumWorkerNodesReady, timeout 30m0s]" func="steps.Run()" file="pkg/util/steps/runner.go:56"
2024-10-07T17:48:32.7492655Z time="2024-10-07T17:48:32Z" level=info msg="Machine v4-e2e-v105110278-eas-lj5v8-worker-eastus1-gsz9j is Running; status: {\"conditions\":[{\"lastTransitionTime\":\"2024-10-07T17:22:58Z\",\"message\":\"machine successfully created\",\"reason\":\"MachineCreationSucceeded\",\"status\":\"True\",\"type\":\"MachineCreated\"}],\"metadata\":{},\"vmId\":\"/subscriptions/***/resourceGroups/aro-v4-e2e-v105110278-eastus/providers/Microsoft.Compute/virtualMachines/v4-e2e-v105110278-eas-lj5v8-worker-eastus1-gsz9j\",\"vmState\":\"Running\"}" func="cluster.(*manager).minimumWorkerNodesReady()" file="pkg/cluster/condition.go:41"
2024-10-07T17:48:32.7498981Z time="2024-10-07T17:48:32Z" level=info msg="Machine v4-e2e-v105110278-eas-lj5v8-worker-eastus2-kfkb7 is Running; status: {\"conditions\":[{\"lastTransitionTime\":\"2024-10-07T17:23:20Z\",\"message\":\"machine successfully created\",\"reason\":\"MachineCreationSucceeded\",\"status\":\"True\",\"type\":\"MachineCreated\"}],\"metadata\":{},\"vmId\":\"/subscriptions/***/resourceGroups/aro-v4-e2e-v105110278-eastus/providers/Microsoft.Compute/virtualMachines/v4-e2e-v105110278-eas-lj5v8-worker-eastus2-kfkb7\",\"vmState\":\"Running\"}" func="cluster.(*manager).minimumWorkerNodesReady()" file="pkg/cluster/condition.go:41"
2024-10-07T17:48:32.7504072Z time="2024-10-07T17:48:32Z" level=info msg="Machine v4-e2e-v105110278-eas-lj5v8-worker-eastus3-zdpfd is Running; status: {\"conditions\":[{\"lastTransitionTime\":\"2024-10-07T17:23:09Z\",\"message\":\"machine successfully created\",\"reason\":\"MachineCreationSucceeded\",\"status\":\"True\",\"type\":\"MachineCreated\"}],\"metadata\":{},\"vmId\":\"/subscriptions/***/resourceGroups/aro-v4-e2e-v105110278-eastus/providers/Microsoft.Compute/virtualMachines/v4-e2e-v105110278-eas-lj5v8-worker-eastus3-zdpfd\",\"vmState\":\"Running\"}" func="cluster.(*manager).minimumWorkerNodesReady()" file="pkg/cluster/condition.go:41"
2024-10-07T17:48:32.8130248Z time="2024-10-07T17:48:32Z" level=info msg="Node v4-e2e-v105110278-eas-lj5v8-worker-eastus1-gsz9j status: [{MemoryPressure False 2024-10-07 17:46:14 +0000 UTC 2024-10-07 17:41:07 +0000 UTC KubeletHasSufficientMemory kubelet has sufficient memory available} {DiskPressure False 2024-10-07 17:46:14 +0000 UTC 2024-10-07 17:41:07 +0000 UTC KubeletHasNoDiskPressure kubelet has no disk pressure} {PIDPressure False 2024-10-07 17:46:14 +0000 UTC 2024-10-07 17:41:07 +0000 UTC KubeletHasSufficientPID kubelet has sufficient PID available} {Ready True 2024-10-07 17:46:14 +0000 UTC 2024-10-07 17:41:08 +0000 UTC KubeletReady kubelet is posting ready status}]" func="cluster.(*manager).minimumWorkerNodesReady()" file="pkg/cluster/condition.go:61"
2024-10-07T17:48:32.8136352Z time="2024-10-07T17:48:32Z" level=info msg="Node v4-e2e-v105110278-eas-lj5v8-worker-eastus2-kfkb7 status: [{MemoryPressure False 2024-10-07 17:45:47 +0000 UTC 2024-10-07 17:44:15 +0000 UTC KubeletHasSufficientMemory kubelet has sufficient memory available} {DiskPressure False 2024-10-07 17:45:47 +0000 UTC 2024-10-07 17:44:15 +0000 UTC KubeletHasNoDiskPressure kubelet has no disk pressure} {PIDPressure False 2024-10-07 17:45:47 +0000 UTC 2024-10-07 17:44:15 +0000 UTC KubeletHasSufficientPID kubelet has sufficient PID available} {Ready True 2024-10-07 17:45:47 +0000 UTC 2024-10-07 17:44:15 +0000 UTC KubeletReady kubelet is posting ready status}]" func="cluster.(*manager).minimumWorkerNodesReady()" file="pkg/cluster/condition.go:61"
2024-10-07T17:48:32.8141057Z time="2024-10-07T17:48:32Z" level=info msg="Node v4-e2e-v105110278-eas-lj5v8-worker-eastus3-zdpfd status: [{MemoryPressure False 2024-10-07 17:47:34 +0000 UTC 2024-10-07 17:47:34 +0000 UTC KubeletHasSufficientMemory kubelet has sufficient memory available} {DiskPressure False 2024-10-07 17:47:34 +0000 UTC 2024-10-07 17:47:34 +0000 UTC KubeletHasNoDiskPressure kubelet has no disk pressure} {PIDPressure False 2024-10-07 17:47:34 +0000 UTC 2024-10-07 17:47:34 +0000 UTC KubeletHasSufficientPID kubelet has sufficient PID available} {Ready True 2024-10-07 17:47:34 +0000 UTC 2024-10-07 17:47:34 +0000 UTC KubeletReady kubelet is posting ready status}]" func="cluster.(*manager).minimumWorkerNodesReady()" file="pkg/cluster/condition.go:61"
2024-10-07T17:48:32.8144826Z time="2024-10-07T17:48:32Z" level=info msg="3 nodes ready" func="cluster.(*manager).minimumWorkerNodesReady()" file="pkg/cluster/condition.go:69"

tsatam
tsatam previously approved these changes Oct 4, 2024
Copy link
Collaborator

@tsatam tsatam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (after fixing the pointer dereference in the log statement) - I think this functionality could do with some more robust unit tests but that can be a follow-up effort.

@ventifus ventifus force-pushed the ventifus/ARO-4247-minimum-worker-machines-ready branch 3 times, most recently from 3f8d33d to 25b7a44 Compare October 4, 2024 19:16
@tsatam
Copy link
Collaborator

tsatam commented Oct 4, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@ventifus ventifus force-pushed the ventifus/ARO-4247-minimum-worker-machines-ready branch 2 times, most recently from 2f9c36c to 89c0c33 Compare October 4, 2024 23:37
@ventifus
Copy link
Collaborator Author

ventifus commented Oct 4, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@ventifus ventifus force-pushed the ventifus/ARO-4247-minimum-worker-machines-ready branch from 89c0c33 to 8f6ca27 Compare October 7, 2024 16:53
@ventifus
Copy link
Collaborator Author

ventifus commented Oct 7, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@ventifus ventifus force-pushed the ventifus/ARO-4247-minimum-worker-machines-ready branch from 8f6ca27 to 8d8c47e Compare October 7, 2024 18:18
@ventifus
Copy link
Collaborator Author

ventifus commented Oct 7, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link
Contributor

@kimorris27 kimorris27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one spot where I think it's worth checking for nil pointers.

pkg/cluster/condition.go Show resolved Hide resolved
@ventifus ventifus force-pushed the ventifus/ARO-4247-minimum-worker-machines-ready branch from 407e4f3 to 9ac41f2 Compare October 9, 2024 23:15
@ventifus
Copy link
Collaborator Author

ventifus commented Oct 9, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link
Collaborator

@cadenmarchese cadenmarchese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cadenmarchese cadenmarchese merged commit 9684d43 into master Oct 14, 2024
20 checks passed
slawande2 pushed a commit that referenced this pull request Oct 15, 2024
* Check machine status and log details if it is not running

* Resolve comments from review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants