Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for online upgrade #150

Merged
merged 27 commits into from
Feb 2, 2022
Merged

Support for online upgrade #150

merged 27 commits into from
Feb 2, 2022

Conversation

spilchen
Copy link
Collaborator

@spilchen spilchen commented Feb 2, 2022

This adds support for online upgrade of the Vertica server and allow multiple subclusters to share the same Service Object.

To initiate an upgrade of a Vertica cluster in Kubernetes you simply change the name of the container image in the CR. Prior to this stage, we had to drive an offline upgrade because Vertica didn't support mixed versions running at the same time. Starting in Vertica 11.1.0 (GA in Feb 2022), a mix of versions is now supported. Vertica supports this by forcing nodes running the older versions to be in read-only mode – only queries can be run on the node, no DML or DDL. We are taking advantage of this feature to implement an online version of an upgrade.

New externals were added to the CRD to support this mode:

  • spec.upgradePolicy. Defines how upgrades will be managed. Available values are: Offline, Online and Auto.
    • Offline: means we take down the entire cluster then bring it back up with the new image. This is the method of upgrade that the operator supported before.
    • Online: will keep the cluster up when the upgrade occurs. The data will go into read-only mode until the Vertica nodes from the primary subcluster reform the cluster with the new image.
    • Auto: will pick between Offline or Online. Online is only chosen if a license Secret exists, the k-Safety of the database is 1 and we are running with a Vertica version that supports read-only subclusters.
  • spec.temporarySubclusterRouting.names: When doing an online upgrade, we designate a subcluster to accept traffic while the other subclusters restart. This option is used when you want to reuse an existing subcluster to accept traffic. A list of subcluster names is given. The operator will pick the first subcluster that is online. You typically would want the first subcluster to be a secondary and the second subcluster to some other subcluster that will accept traffic when restarting the first.
  • spec.temporarySubclusterRouting.template: This specifies that a new subcluster be created to accept traffic while a subcluster is down. The subcluster will be created at the start of an online upgrade and be removed at the end.

New status fields are provided that will allow the upgrade to be monitored. A summary of the new fields are as follows:

  • OnlineUpgradeInProgress: a new status condition was added that will be activated when doing an online upgrade. It will be toggled in tandem with the existing ImageChangeInProgress status condition. The main purpose of this is an indication that the upgrade is being done online.
  • OfflineUpgradeInProgress: similar to OnlineUpgradeInProgress except that it is set when doing an offline upgrade.
  • UpgradeStatus: a human readable message that is updated at various points during an upgrade. It is meant to give an indication about what phase the upgrade is currently working on. This is maintained for both the online and offline upgrades.

The Vertica server needs to be on version 11.1 for this to work. This includes the current version that we are running and the new version we are upgrading. So this means online upgrade won't be usable until after the next release. For instance, you need to be on 11.1.0 already and want to upgrade to 11.1.1.


Prior to this PR, we had a one-to-one mapping of subclusters to Service objects. This meant that a single Service object could only direct traffic to a single subcluster. We needed the ability to have one Service object to direct to multiple subclusters as part of the work for online upgrade. We are showing this new feature in a separate section to call this out as it can have uses that are not specific to online upgrade.

When defining a subcluster in the CR, a new field was added called serviceName. When the operator reconciles the subclusters in the CR, it will create a Service object using the name specifed in serviceName. If you want to have multiple subclusters sharing the same Service object, use the same name for both. The default behaviour is for each subcluster to have its own Service object. We do this by using the name of the subcluster as the serviceName.

Here is a sample CR, where multiple subclusters share the same service object:

apiVersion: vertica.com/v1beta1
kind: VerticaDB
metadata:
  name: sample
spec:
  communal:
    path: "s3://nimbusdb/db"
    endpoint: "http://minio"
    credentialSecret: s3-auth
  subclusters:
    - name: sc1
      size: 1
      serviceName: connections
    - name: sc2
      size: 1
      serviceName: connections

Once created, the following service object will exist that will route connections between the two subclusters: sample-connections.

spilchen added 26 commits December 9, 2021 08:16
This is prep for online upgrade. This adds a new status message
that we will use to report the phase that the upgrade is in.
This also has a fix that will retry status changes for any
transient error. You may have seen the error I'm talking about
-- it usually complains about an update failing because we are
not using the latest copy of the object.
This pulls out some common functions with the new online image change
reconciler. A new Go struct, ImageChangeInitiator, was created that handles the
common logic.

This also adds new parameters to the CR to allow for control of what type of
upgrade you want. No reconciler has been added yet for online image change.
This sketches out the flow for the online image change. Online image change is
still not functional, but it lays out the structure of the code that I will
fill in on subsequent PRs.

This also does more refactoring. There were additional changes I wanted to
reuse in offlineimagechange_reconcile.go. Those were moved to
ImageChangeInitiator, and that struct was renamed to ImageChangeManager.
The online-upgrade process needs more information about the subcluster. This
introduces a new data structure SubclusterHandle that has the Subcluster struct
that is stored in etcd plus additional runtime info that is needed for the
online-upgrade process.

There is two parts to this change. First is fetching the additional information
through sc_finder. The second is flowing the new SubclusterHandle through
various functions.

In a subsequent PR, I will start to fill out some of the functions in
onlineupgrade_reconcile.go using the data collected in SubclusterHandle.
This is another PR for online-upgrade. It will handle creation and removal of
the standby subcluster during the online-image change process.

- new state was added to the vapi.Subcluster for this. Originally, I was
  planning to keep most of this in the SubclusterHandle struct, but we already
  pass around vapi.Subcluster so it made it easier to have it their
- new status conditions for offline and online image change. These are intended
  to be used by the operator to know what image change to continue with once an
  image change has started
- filled out more of the logic in onlineimagechange_reconciler.go. It will
  scale-out a new standby subcluster for each primary, then scale them down
  when we are finishing the image change.
- moved more logic into imagechange.go that is common between online and
  offline image change
- restart logic was changed to allow option to restart read-only nodes. When
  restarting for online, we will skip the read-only modes. Offline restarts
  everything.
…128)

This adds manipulating of the service objects during an
online image change. It will route client traffic to the
standby when we are upgrading, then reroute the traffic back
to the original subclusters when completing the upgrade.

As a side affect this also adds the ability for multiple
subclusters to share the same service object. This has
benefits outside of the online image change process.
Use the term transient instead of standby.  This also removes the
SubclusterHandle struct.
This is the next set of changes for online image change. It will
route to a temporary subcluster, called a transient, so that client
connections connect to an up node. It will automatically route back
to the original subcluster once the subcluster is back up. It will
also process secondary subclusters that they are brought back up.

This includes some rework in the onlineimagechange_reconciler to cut
down on the amount of code duplication.

Added the ability to specify a template for the transient cluster.
This will get created when the image change starts and will get
cleanup up when the image change is done.
…#133)

This is the next set of changes for online image change. The CR parm
TransientSubclusterTemplate was renamed to TemporarySubclusterRouting. The parm
can be used to provide a template of a subcluster to use for temporary routing
while subclusters are down. Or the parm can also be used to specify an existing
subcluster. This later option may be useful to those that want to just reuse
existing subclusters.

This PR also cleans up how we route traffic to the subclusters. We previously
had relied on a 'transient' label. But now we route to service name or
subcluster name.
This fills in the Status.ImageChangeStatus message as
we progress through the online image change. It also
reworks the status message updates we do for offline
image change to share the same infrastructure.

I'm also including a change to keep the ssh key in
the Vertica container stable between builds. The ssh
key is just used for communication between the
Vertica nodes. A stable key allows Vertica nodes from
a different container image to be able to
communicate. This becomes an issue when doing an
online image change because the Vertica pods are
either running the old or new image, yet they all
have to talk to each other.
This adds e2e tests for online upgrade. I added a new directory (e2e-11.1)
since we cannot run these yet in our GitHub CI. This directory will contain all
of the e2e tests that must run on a Vertica server 11.1 or higher. We can fold
these into the main e2e after 11.1 is GA'd.

Also, we use another vertica image in e2e tests called BASE_VERTICA_IMG. The
online upgrade tests will change the image from BASE_VERTICA_IMG to
VERTICA_IMG.

During the testing, a few issues were found that I have fixes for:

- ObjReconciler, DBAddNodeReconciler and DBAddSubclusterReconciler will work
  only on the transient subcluster. We previously added the transient to the
  VerticaDB, then run these reconciler as-is. But it can pick up other changes
  that can interfere with the upgrade -- namely scaling out changes. So I no
  longer update the VerticaDB with the transient, and run the reconciler's just
  with the transient subcluster.
- avoid creating the transient if the cluster is down. We need the cluster to
  be up to create the transient, so skipping that entire part if the cluster is
  down
- Restart reconciler will avoid restarting pods for the transient subcluster.
  This subcluster intentionally stays on the old image and there is no way to
  restart an old image if the primaries are already updated to the new image.
This adds rules to the webhook for online image change:

- prevent changes to upgradePolicy when imageChange is in progress
- transient subcluster template: isPrimary == false, name cannot be an existing
  subcluster, size > 0 if name present
- transient subcluster cannot be added/removed during an online image change
- if multiple subclusters share a ServiceName, service specific things must be
  common between them (serviceType, NodePort, externalIPs, etc).
- when running AT start_db, we need to run from one of the primary nodes.  It
  won't work if we try from a read-only node that isn't being restarted.
- when calling re_ip, use the --force option. This option is new in 11.1.0, so
  we needed conditional logic to know when we could use this.
- when calling start_db, we use the host list. This option exists first in
  11.0.1, so like reip, we needed conditional logic to know when we can use
  that option

One thing that isn't related to the title of this PR is some new logic needed
in DBAddNodeReconicler. That reconciler will now requeue if some pods aren't
yet ready. This was needed so that the upgrade properly waits for the transient
subcluster to scale out. Prior to this change, it was possible that the image
change went ahead and restarted the primaries before the transient was up. This
should be solved now.
This adds drain logic so that we wait for active connections to disappear
before taking down a subcluster. I added finer granular messaging for
imageChangeStatus so that we will have a clear idea if it is waiting for the
drain of a particular subcluster.

This change involves sorting the output from sc_finder. This was necessary to
match up the status message with order that we will process the subclusters.

Also including a fix that waits for the transient pod to be in a ready state.
There was a small timing window where we started to route client traffic to the
transient before it was ready. The ready probe is run every 10 seconds, so
there was a window where vertica was up but k8s didn't yet know about it.

A new e2e test was added to make sure draining works for the primary and
secondary subclusters.
This adds upgrade logic for VerticaDB created from older versions of the
operator. We changed the selector label for pods in the sts. The selector label
is immutable, so in order to upgrade to the 1.3.0 release, the sts and their
pods need to be destroyed. This is handled automatically by a new reconcile
actor.

This means that when upgrade to the 1.3.0 release, any running Vertica instance
will be stopped then restarted since deleting sts will cause the pods to go
away.

A new github workflow was added so that we can change operator pod upgrades
going forward.
This adds checking in the operator to ensure the proper upgrade path is chosen.
It will catch attempts to skip released versions and prevent downgrades. A
backdoor was added to the CR for those that don't want this behaviour. You can
simply set .spec.ignoreUpgradePath to be true.
This changes the default behaviour for temporarySubclusterRouting. It now
defaults to picking existing subclusters rather than creating a transient
subcluster.
@spilchen spilchen self-assigned this Feb 2, 2022
@spilchen spilchen merged commit 82396cc into main Feb 2, 2022
@spilchen spilchen deleted the online-upgrade branch February 2, 2022 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant