Support for online upgrade #150

spilchen · 2022-02-02T19:28:46Z

This adds support for online upgrade of the Vertica server and allow multiple subclusters to share the same Service Object.

To initiate an upgrade of a Vertica cluster in Kubernetes you simply change the name of the container image in the CR. Prior to this stage, we had to drive an offline upgrade because Vertica didn't support mixed versions running at the same time. Starting in Vertica 11.1.0 (GA in Feb 2022), a mix of versions is now supported. Vertica supports this by forcing nodes running the older versions to be in read-only mode – only queries can be run on the node, no DML or DDL. We are taking advantage of this feature to implement an online version of an upgrade.

New externals were added to the CRD to support this mode:

spec.upgradePolicy. Defines how upgrades will be managed. Available values are: Offline, Online and Auto.
- Offline: means we take down the entire cluster then bring it back up with the new image. This is the method of upgrade that the operator supported before.
- Online: will keep the cluster up when the upgrade occurs. The data will go into read-only mode until the Vertica nodes from the primary subcluster reform the cluster with the new image.
- Auto: will pick between Offline or Online. Online is only chosen if a license Secret exists, the k-Safety of the database is 1 and we are running with a Vertica version that supports read-only subclusters.
spec.temporarySubclusterRouting.names: When doing an online upgrade, we designate a subcluster to accept traffic while the other subclusters restart. This option is used when you want to reuse an existing subcluster to accept traffic. A list of subcluster names is given. The operator will pick the first subcluster that is online. You typically would want the first subcluster to be a secondary and the second subcluster to some other subcluster that will accept traffic when restarting the first.
spec.temporarySubclusterRouting.template: This specifies that a new subcluster be created to accept traffic while a subcluster is down. The subcluster will be created at the start of an online upgrade and be removed at the end.

New status fields are provided that will allow the upgrade to be monitored. A summary of the new fields are as follows:

OnlineUpgradeInProgress: a new status condition was added that will be activated when doing an online upgrade. It will be toggled in tandem with the existing ImageChangeInProgress status condition. The main purpose of this is an indication that the upgrade is being done online.
OfflineUpgradeInProgress: similar to OnlineUpgradeInProgress except that it is set when doing an offline upgrade.
UpgradeStatus: a human readable message that is updated at various points during an upgrade. It is meant to give an indication about what phase the upgrade is currently working on. This is maintained for both the online and offline upgrades.

The Vertica server needs to be on version 11.1 for this to work. This includes the current version that we are running and the new version we are upgrading. So this means online upgrade won't be usable until after the next release. For instance, you need to be on 11.1.0 already and want to upgrade to 11.1.1.

Prior to this PR, we had a one-to-one mapping of subclusters to Service objects. This meant that a single Service object could only direct traffic to a single subcluster. We needed the ability to have one Service object to direct to multiple subclusters as part of the work for online upgrade. We are showing this new feature in a separate section to call this out as it can have uses that are not specific to online upgrade.

When defining a subcluster in the CR, a new field was added called serviceName. When the operator reconciles the subclusters in the CR, it will create a Service object using the name specifed in serviceName. If you want to have multiple subclusters sharing the same Service object, use the same name for both. The default behaviour is for each subcluster to have its own Service object. We do this by using the name of the subcluster as the serviceName.

Here is a sample CR, where multiple subclusters share the same service object:

apiVersion: vertica.com/v1beta1
kind: VerticaDB
metadata:
  name: sample
spec:
  communal:
    path: "s3://nimbusdb/db"
    endpoint: "http://minio"
    credentialSecret: s3-auth
  subclusters:
    - name: sc1
      size: 1
      serviceName: connections
    - name: sc2
      size: 1
      serviceName: connections

Once created, the following service object will exist that will route connections between the two subclusters: sample-connections.

This is prep for online upgrade. This adds a new status message that we will use to report the phase that the upgrade is in. This also has a fix that will retry status changes for any transient error. You may have seen the error I'm talking about -- it usually complains about an update failing because we are not using the latest copy of the object.

This pulls out some common functions with the new online image change reconciler. A new Go struct, ImageChangeInitiator, was created that handles the common logic. This also adds new parameters to the CR to allow for control of what type of upgrade you want. No reconciler has been added yet for online image change.

This sketches out the flow for the online image change. Online image change is still not functional, but it lays out the structure of the code that I will fill in on subsequent PRs. This also does more refactoring. There were additional changes I wanted to reuse in offlineimagechange_reconcile.go. Those were moved to ImageChangeInitiator, and that struct was renamed to ImageChangeManager.

The online-upgrade process needs more information about the subcluster. This introduces a new data structure SubclusterHandle that has the Subcluster struct that is stored in etcd plus additional runtime info that is needed for the online-upgrade process. There is two parts to this change. First is fetching the additional information through sc_finder. The second is flowing the new SubclusterHandle through various functions. In a subsequent PR, I will start to fill out some of the functions in onlineupgrade_reconcile.go using the data collected in SubclusterHandle.

This is another PR for online-upgrade. It will handle creation and removal of the standby subcluster during the online-image change process. - new state was added to the vapi.Subcluster for this. Originally, I was planning to keep most of this in the SubclusterHandle struct, but we already pass around vapi.Subcluster so it made it easier to have it their - new status conditions for offline and online image change. These are intended to be used by the operator to know what image change to continue with once an image change has started - filled out more of the logic in onlineimagechange_reconciler.go. It will scale-out a new standby subcluster for each primary, then scale them down when we are finishing the image change. - moved more logic into imagechange.go that is common between online and offline image change - restart logic was changed to allow option to restart read-only nodes. When restarting for online, we will skip the read-only modes. Offline restarts everything.

…128) This adds manipulating of the service objects during an online image change. It will route client traffic to the standby when we are upgrading, then reroute the traffic back to the original subclusters when completing the upgrade. As a side affect this also adds the ability for multiple subclusters to share the same service object. This has benefits outside of the online image change process.

Use the term transient instead of standby. This also removes the SubclusterHandle struct.

This is the next set of changes for online image change. It will route to a temporary subcluster, called a transient, so that client connections connect to an up node. It will automatically route back to the original subcluster once the subcluster is back up. It will also process secondary subclusters that they are brought back up. This includes some rework in the onlineimagechange_reconciler to cut down on the amount of code duplication. Added the ability to specify a template for the transient cluster. This will get created when the image change starts and will get cleanup up when the image change is done.

…#133) This is the next set of changes for online image change. The CR parm TransientSubclusterTemplate was renamed to TemporarySubclusterRouting. The parm can be used to provide a template of a subcluster to use for temporary routing while subclusters are down. Or the parm can also be used to specify an existing subcluster. This later option may be useful to those that want to just reuse existing subclusters. This PR also cleans up how we route traffic to the subclusters. We previously had relied on a 'transient' label. But now we route to service name or subcluster name.

This fills in the Status.ImageChangeStatus message as we progress through the online image change. It also reworks the status message updates we do for offline image change to share the same infrastructure. I'm also including a change to keep the ssh key in the Vertica container stable between builds. The ssh key is just used for communication between the Vertica nodes. A stable key allows Vertica nodes from a different container image to be able to communicate. This becomes an issue when doing an online image change because the Vertica pods are either running the old or new image, yet they all have to talk to each other.

This adds e2e tests for online upgrade. I added a new directory (e2e-11.1) since we cannot run these yet in our GitHub CI. This directory will contain all of the e2e tests that must run on a Vertica server 11.1 or higher. We can fold these into the main e2e after 11.1 is GA'd. Also, we use another vertica image in e2e tests called BASE_VERTICA_IMG. The online upgrade tests will change the image from BASE_VERTICA_IMG to VERTICA_IMG. During the testing, a few issues were found that I have fixes for: - ObjReconciler, DBAddNodeReconciler and DBAddSubclusterReconciler will work only on the transient subcluster. We previously added the transient to the VerticaDB, then run these reconciler as-is. But it can pick up other changes that can interfere with the upgrade -- namely scaling out changes. So I no longer update the VerticaDB with the transient, and run the reconciler's just with the transient subcluster. - avoid creating the transient if the cluster is down. We need the cluster to be up to create the transient, so skipping that entire part if the cluster is down - Restart reconciler will avoid restarting pods for the transient subcluster. This subcluster intentionally stays on the old image and there is no way to restart an old image if the primaries are already updated to the new image.

This adds rules to the webhook for online image change: - prevent changes to upgradePolicy when imageChange is in progress - transient subcluster template: isPrimary == false, name cannot be an existing subcluster, size > 0 if name present - transient subcluster cannot be added/removed during an online image change - if multiple subclusters share a ServiceName, service specific things must be common between them (serviceType, NodePort, externalIPs, etc).

- when running AT start_db, we need to run from one of the primary nodes. It won't work if we try from a read-only node that isn't being restarted. - when calling re_ip, use the --force option. This option is new in 11.1.0, so we needed conditional logic to know when we could use this. - when calling start_db, we use the host list. This option exists first in 11.0.1, so like reip, we needed conditional logic to know when we can use that option One thing that isn't related to the title of this PR is some new logic needed in DBAddNodeReconicler. That reconciler will now requeue if some pods aren't yet ready. This was needed so that the upgrade properly waits for the transient subcluster to scale out. Prior to this change, it was possible that the image change went ahead and restarted the primaries before the transient was up. This should be solved now.

This adds drain logic so that we wait for active connections to disappear before taking down a subcluster. I added finer granular messaging for imageChangeStatus so that we will have a clear idea if it is waiting for the drain of a particular subcluster. This change involves sorting the output from sc_finder. This was necessary to match up the status message with order that we will process the subclusters. Also including a fix that waits for the transient pod to be in a ready state. There was a small timing window where we started to route client traffic to the transient before it was ready. The ready probe is run every 10 seconds, so there was a window where vertica was up but k8s didn't yet know about it. A new e2e test was added to make sure draining works for the primary and secondary subclusters.

This adds upgrade logic for VerticaDB created from older versions of the operator. We changed the selector label for pods in the sts. The selector label is immutable, so in order to upgrade to the 1.3.0 release, the sts and their pods need to be destroyed. This is handled automatically by a new reconcile actor. This means that when upgrade to the 1.3.0 release, any running Vertica instance will be stopped then restarted since deleting sts will cause the pods to go away. A new github workflow was added so that we can change operator pod upgrades going forward.

This adds checking in the operator to ensure the proper upgrade path is chosen. It will catch attempts to skip released versions and prevent downgrades. A backdoor was added to the CR for those that don't want this behaviour. You can simply set .spec.ignoreUpgradePath to be true.

This changes the default behaviour for temporarySubclusterRouting. It now defaults to picking existing subclusters rather than creating a transient subcluster.

spilchen added 26 commits December 9, 2021 08:16

Rename offline image change reconciler

30ada10

Merge branch 'main' into online-upgrade

4b85f93

Merge branch 'main' into online-upgrade

ac305ba

Rename standby to transient

1afbfd2

Use the term transient instead of standby. This also removes the SubclusterHandle struct.

Merge branch 'main' into online-upgrade

46122fd

Default routing to use existing subclusters (#149)

7a1e03c

This changes the default behaviour for temporarySubclusterRouting. It now defaults to picking existing subclusters rather than creating a transient subcluster.

Merge branch 'main' into online-upgrade

74d17a1

Renames imageChange to upgrade

be09405

e2e renames

ac1c4d2

More renames

2737d9f

Add changie entries

1e4906b

spilchen self-assigned this Feb 2, 2022

Refresh issue num in changie

266c26d

spilchen merged commit 82396cc into main Feb 2, 2022

spilchen deleted the online-upgrade branch February 2, 2022 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for online upgrade #150

Support for online upgrade #150

spilchen commented Feb 2, 2022

Support for online upgrade #150

Support for online upgrade #150

Conversation

spilchen commented Feb 2, 2022