-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online upgrade to account for terminating pods #633
Conversation
The controller was not handling terminating pods correctly. This could lead to problems during online upgrades, such as not properly restarting primaries before taking down secondaries. The issue occurs when the operator deletes pods with the old image. If the pod is still running immediately after the deletion, the operator may skip the restart and move on to the next phase of the upgrade. This is a small window because the termination grace period is 0 seconds, but we still see it occasionally in the e2e runs. To test this issue, I added a new annotation (vertica.com/termination-grace-period-seconds) that makes it easier to hit this timing scenario. This annotation causes the pod to be in the termination step for a longer period of time. The solution was to check for pod termination in podfacts and treat the pod as not running if it is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
pkg/controllers/vdb/podfacts.go
Outdated
// true means the pod has been bound to a node, not in the process of being | ||
// process, and all of the containers have been created. At least one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence does not read well.
@@ -140,6 +140,9 @@ const ( | |||
// annotation is not provided the default value "dbadmin" will be used. | |||
SuperuserNameAnnotation = "vertica.com/superuser-name" | |||
SuperuserNameDefaultValue = "dbadmin" | |||
|
|||
// Annotation to control the termination grace period for vertica pods. | |||
TerminationGracePeriodSecondsAnnotaton = "vertica.com/termination-grace-period-seconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any use for this annotation other than reproducing that issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, only for test purposes. There are no plans to externalize this annotation.
The controller was not handling terminating pods correctly. This could lead to problems during online upgrades, such as not properly restarting primaries before taking down secondaries.
The issue occurs when the operator deletes pods with the old image. If the pod is still running immediately after the deletion, the operator may skip the restart and move on to the next phase of the upgrade. This is a small window because the termination grace period is 0 seconds, but we still see it occasionally in the e2e runs.
To test this issue, I added a new annotation (vertica.com/termination-grace-period-seconds) that makes it easier to hit this timing scenario. This annotation causes the pod to be in the termination step for a longer period of time.
The solution was to check for pod termination in podfacts and treat the pod as not running if it is set.