Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect logging for exited containers #70

Merged
merged 2 commits into from
Oct 3, 2024
Merged

Conversation

adriansuarez
Copy link
Collaborator

For any containers with non-0 restart counts, collect logging for exited containers using kubectl logs ... --previous.

For any containers with non-0 restart counts, collect logging for exited
containers using `kubectl logs ... --previous`.
@adriansuarez
Copy link
Collaborator Author

adriansuarez commented Oct 3, 2024

FYI @sivanov-nuodb, I created the branch asuarez/test-minikube-2.5.0 which includes this change and also reverts the Control Plane used in Minikube to 2.5.0. You should be able to run the pipeline on that branch to reproduce the webhook issue.

The root cause seems to be that the container exits unexpectedly and takes a long time to acquire the lease on restart. Not sure what is causing the crashes on 2.5.0 or whether it is still relevant in newer versions.

deploy/k8s/logs.sh Outdated Show resolved Hide resolved
Copy link
Collaborator

@sivanov-nuodb sivanov-nuodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addign this!

@sivanov-nuodb
Copy link
Collaborator

sivanov-nuodb commented Oct 3, 2024

FYI @sivanov-nuodb, I created the branch asuarez/test-minikube-2.5.0 which includes this change and also reverts the Control Plane used in Minikube to 2.5.0. You should be able to run the pipeline on that branch to reproduce the webhook issue.

The root cause seems to be that the container exits unexpectedly and takes a long time to acquire the lease on restart. Not sure what is causing the crashes on 2.5.0 or whether it is still relevant in newer versions.

Thanks for creating the branch so that we can investigate further!

The webhook server is started immediately and does not wait for the leader assignment. I agree that the webhook error (connection refused) is due to the operator container being restarted.

The operator will fail if the VolumeSnapshot CRD is not installed in the cluster but the embedded backup plugin is enabled. In CP 2.6.0, the operator Helm chart will automatically disable the backup plugin (fixed by you in https://github.com/nuodb/nuodb-control-plane/commit/9759ef911e4163ee2d48a3cd376d1a3a86487107) which is why the issue is not exposed in CP 2.7.0.

@adriansuarez
Copy link
Collaborator Author

Thanks for creating the branch so that we can investigate further!

The webhook server is started immediately and does not wait for the leader assignment. I agree that the webhook error (connection refused) is due to the operator container being restarted.

The operator will fail if the VolumeSnapshot CRD is not installed in the cluster but the embedded backup plugin is enabled. In CP 2.6.0, the operator Helm chart will automatically disable the backup plugin (fixed by you in nuodb/nuodb-control-plane@9759ef9) which is why the issue is not exposed in CP 2.7.0.

Okay, so it is no longer an issue.

I will keep using 2.6.1 rather than explicitly disabling the backup manager, since it does give us more coverage of past product versions. Currently we have coverage of 2.5.0 via KWOK, and both the KWOK and Minikube variants exercise 2.7.0 (latest).

@adriansuarez adriansuarez merged commit 0809af3 into main Oct 3, 2024
8 of 9 checks passed
@adriansuarez adriansuarez deleted the asuarez/collect-previous branch October 3, 2024 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants