Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support zero-downtime upgrading for the Trident node plugins #740

Closed
tksm opened this issue Jul 7, 2022 · 5 comments
Closed

Support zero-downtime upgrading for the Trident node plugins #740

tksm opened this issue Jul 7, 2022 · 5 comments

Comments

@tksm
Copy link

tksm commented Jul 7, 2022

Describe the solution you'd like

We would like the trident operator to upgrade the Trident node plugins without downtime.

The trident operator deletes the Trident DaemonSet once when updating the trident version. It causes downtime for mounting and unmounting until new DamonSet pods become ready.

It becomes a serious issue when one of the plugin pods cannot be deleted for some reason. The trident operator does not create a new DaemonSet until all plugin pods have been deleted since it deletes the DaemonSet with the foreground option. So there will be no nodes that can mount Trident volumes even when one trident pod cannot be deleted.

I understand the foreground deletion is for fixing issues like #444 and #487 . Is it possible to patch the DaemonSet instead of deleting it for recreating the pods? I think that patching the DaemonSet with a dummy annotation like kubectl rollout restart ds lets the DaemonSet controller perform a rolling update without downtime.

@tksm
Copy link
Author

tksm commented Jul 11, 2022

It becomes a serious issue when one of the plugin pods cannot be deleted for some reason. The trident operator does not create a new DaemonSet until all plugin pods have been deleted since it deletes the DaemonSet with the foreground option.

For your reference, here is the reproducing step for this issue.

  1. Deploy the trident operator v22.01.1 with the TridentOrchestrator object.
  2. Wait until all trident pods become ready.
  3. Set a dummy finalizer to any one Trident pod.
    • e.g. kubectl patch -n trident -p '{"metadata":{"finalizers": ["example.com/dummy"]}}' "$(kubectl get pods -n trident -l app=node.csi.trident.netapp.io -o name | head -1)"
    • This step simulates one of the plugin pods cannot be deleted.
  4. Update the trident operator and the TridentOrchestrator object to v22.04.0.
  5. There will be no trident-csi pods except the terminating one.
$ kubectl get pods -n trident -l app=node.csi.trident.netapp.io
NAME                READY   STATUS        RESTARTS   AGE
trident-csi-5tppm   0/2     Terminating   0          3m43s

@ysakashita
Copy link

ysakashita commented Aug 10, 2022

@gnarl
In addition, I share the impact on our customers by the issue.
We run a Kubernetes upgrade and a Trident upgrade.
Kubernetes update causes the volume to detach/attach operations.
When NetApp/Trident upgrades, it deletes the old Trident. Then, a new one is installed.

In this situation, unfortunately, during the removal of Trident, the trident-csi(node-plugin_ remained in a terminating state.
The current Trident upgrade (delete->install) will not start a new Trident setup until this terminating pod is deleted.
At this time, there is not a running trident-csi.
So, when detach/attach operation is executed due to Kubernetes node updates,
all pods(apps) have pending status on all nodes and cannot be started.
(It is a Kubernetes update, but the same situation can occur with pod(apps) or node failure during trident upgrading)
Therefore, all apps are down, and customer service has been stopped (see the left side fig).

In actual our cases of failure, it took many hours(4-5H) from the time of failure to the time of recovery.
This is so that after receiving a report from the customer, the root cause can be investigated, and the administrator restore trident-csi by manually.

We want to enhance from delete-install to rolling update of the Trident for it(see the right side fig).
This prevents all Pods(apps) from going down if one trident-csi is terminating status occurs.

As you know, terminating status can be caused by various factors, such as kubelet, and cannot be completely prevented.
But even in that case, we want to prevent all trident-csi from going down.

スクリーンショット 2022-08-10 13 35 55

@gnarl
Copy link
Contributor

gnarl commented Aug 11, 2022

Hi @ysakashita,

Thank you for this explanation of the outage you've experienced. This helps to clarify the situation your customer experienced. Our team has examined the situation and we don't believe there is a better immediate workaround than monitoring the Trident DaemonSet Pod to determine if it is stuck in terminating state.

The team understands the need to support rolling upgrades for the Trident DaemonSet based on your explanation. There are additional changes that need to be made to properly handle upgrading from N previous versions of Trident. This enhancement will need to be prioritized for a future Trident release.

@uppuluri123
Copy link

uppuluri123 commented Sep 5, 2023

Trident 23.07.01 is released with fix for this issue. This fix is also present in Trident 23.10.

@uppuluri123
Copy link

closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants