Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online upgrade to account for terminating pods #633

Merged
merged 3 commits into from
Dec 14, 2023
Merged

Conversation

spilchen
Copy link
Collaborator

The controller was not handling terminating pods correctly. This could lead to problems during online upgrades, such as not properly restarting primaries before taking down secondaries.

The issue occurs when the operator deletes pods with the old image. If the pod is still running immediately after the deletion, the operator may skip the restart and move on to the next phase of the upgrade. This is a small window because the termination grace period is 0 seconds, but we still see it occasionally in the e2e runs.

To test this issue, I added a new annotation (vertica.com/termination-grace-period-seconds) that makes it easier to hit this timing scenario. This annotation causes the pod to be in the termination step for a longer period of time.

The solution was to check for pod termination in podfacts and treat the pod as not running if it is set.

The controller was not handling terminating pods correctly. This could
lead to problems during online upgrades, such as not properly restarting
primaries before taking down secondaries.

The issue occurs when the operator deletes pods with the old image. If
the pod is still running immediately after the deletion, the operator
may skip the restart and move on to the next phase of the upgrade. This
is a small window because the termination grace period is 0 seconds, but
we still see it occasionally in the e2e runs.

To test this issue, I added a new annotation
(vertica.com/termination-grace-period-seconds) that makes it easier to
hit this timing scenario. This annotation causes the pod to be in the
termination step for a longer period of time.

The solution was to check for pod termination in podfacts and treat the
pod as not running if it is set.
@spilchen spilchen self-assigned this Dec 13, 2023
Copy link
Collaborator

@roypaulin roypaulin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Comment on lines 72 to 73
// true means the pod has been bound to a node, not in the process of being
// process, and all of the containers have been created. At least one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence does not read well.

@@ -140,6 +140,9 @@ const (
// annotation is not provided the default value "dbadmin" will be used.
SuperuserNameAnnotation = "vertica.com/superuser-name"
SuperuserNameDefaultValue = "dbadmin"

// Annotation to control the termination grace period for vertica pods.
TerminationGracePeriodSecondsAnnotaton = "vertica.com/termination-grace-period-seconds"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any use for this annotation other than reproducing that issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only for test purposes. There are no plans to externalize this annotation.

@spilchen spilchen merged commit 5bb6765 into main Dec 14, 2023
27 checks passed
@spilchen spilchen deleted the spilchen/fix-upgrade branch December 14, 2023 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants