Support zero-downtime upgrading for the Trident node plugins #740

tksm · 2022-07-07T07:43:33Z

Describe the solution you'd like

We would like the trident operator to upgrade the Trident node plugins without downtime.

The trident operator deletes the Trident DaemonSet once when updating the trident version. It causes downtime for mounting and unmounting until new DamonSet pods become ready.

It becomes a serious issue when one of the plugin pods cannot be deleted for some reason. The trident operator does not create a new DaemonSet until all plugin pods have been deleted since it deletes the DaemonSet with the foreground option. So there will be no nodes that can mount Trident volumes even when one trident pod cannot be deleted.

I understand the foreground deletion is for fixing issues like #444 and #487 . Is it possible to patch the DaemonSet instead of deleting it for recreating the pods? I think that patching the DaemonSet with a dummy annotation like kubectl rollout restart ds lets the DaemonSet controller perform a rolling update without downtime.

The text was updated successfully, but these errors were encountered:

tksm · 2022-07-11T02:08:14Z

It becomes a serious issue when one of the plugin pods cannot be deleted for some reason. The trident operator does not create a new DaemonSet until all plugin pods have been deleted since it deletes the DaemonSet with the foreground option.

For your reference, here is the reproducing step for this issue.

Deploy the trident operator v22.01.1 with the TridentOrchestrator object.
Wait until all trident pods become ready.
Set a dummy finalizer to any one Trident pod.
- e.g. kubectl patch -n trident -p '{"metadata":{"finalizers": ["example.com/dummy"]}}' "$(kubectl get pods -n trident -l app=node.csi.trident.netapp.io -o name | head -1)"
- This step simulates one of the plugin pods cannot be deleted.
Update the trident operator and the TridentOrchestrator object to v22.04.0.
There will be no trident-csi pods except the terminating one.

$ kubectl get pods -n trident -l app=node.csi.trident.netapp.io
NAME                READY   STATUS        RESTARTS   AGE
trident-csi-5tppm   0/2     Terminating   0          3m43s

ysakashita · 2022-08-10T05:45:44Z

@gnarl
In addition, I share the impact on our customers by the issue.
We run a Kubernetes upgrade and a Trident upgrade.
Kubernetes update causes the volume to detach/attach operations.
When NetApp/Trident upgrades, it deletes the old Trident. Then, a new one is installed.

In this situation, unfortunately, during the removal of Trident, the trident-csi(node-plugin_ remained in a terminating state.
The current Trident upgrade (delete->install) will not start a new Trident setup until this terminating pod is deleted.
At this time, there is not a running trident-csi.
So, when detach/attach operation is executed due to Kubernetes node updates,
all pods(apps) have pending status on all nodes and cannot be started.
(It is a Kubernetes update, but the same situation can occur with pod(apps) or node failure during trident upgrading)
Therefore, all apps are down, and customer service has been stopped (see the left side fig).

In actual our cases of failure, it took many hours(4-5H) from the time of failure to the time of recovery.
This is so that after receiving a report from the customer, the root cause can be investigated, and the administrator restore trident-csi by manually.

We want to enhance from delete-install to rolling update of the Trident for it(see the right side fig).
This prevents all Pods(apps) from going down if one trident-csi is terminating status occurs.

As you know, terminating status can be caused by various factors, such as kubelet, and cannot be completely prevented.
But even in that case, we want to prevent all trident-csi from going down.

gnarl · 2022-08-11T17:53:17Z

Hi @ysakashita,

Thank you for this explanation of the outage you've experienced. This helps to clarify the situation your customer experienced. Our team has examined the situation and we don't believe there is a better immediate workaround than monitoring the Trident DaemonSet Pod to determine if it is stuck in terminating state.

The team understands the need to support rolling upgrades for the Trident DaemonSet based on your explanation. There are additional changes that need to be made to properly handle upgrading from N previous versions of Trident. This enhancement will need to be prioritized for a future Trident release.

uppuluri123 · 2023-09-05T19:04:39Z

Trident 23.07.01 is released with fix for this issue. This fix is also present in Trident 23.10.

uppuluri123 · 2023-09-05T19:06:07Z

closing the issue.

tksm added the enhancement label Jul 7, 2022

tksm mentioned this issue Jul 12, 2022

Support zero-downtime upgrading for the Trident controller plugin #745

Open

gnarl added the tracked label Aug 2, 2022

uppuluri123 closed this as completed Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support zero-downtime upgrading for the Trident node plugins #740

Support zero-downtime upgrading for the Trident node plugins #740

tksm commented Jul 7, 2022

tksm commented Jul 11, 2022

ysakashita commented Aug 10, 2022 •

edited

Loading

gnarl commented Aug 11, 2022

uppuluri123 commented Sep 5, 2023 •

edited

Loading

uppuluri123 commented Sep 5, 2023

Support zero-downtime upgrading for the Trident node plugins #740

Support zero-downtime upgrading for the Trident node plugins #740

Comments

tksm commented Jul 7, 2022

tksm commented Jul 11, 2022

ysakashita commented Aug 10, 2022 • edited Loading

gnarl commented Aug 11, 2022

uppuluri123 commented Sep 5, 2023 • edited Loading

uppuluri123 commented Sep 5, 2023

ysakashita commented Aug 10, 2022 •

edited

Loading

uppuluri123 commented Sep 5, 2023 •

edited

Loading