-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controller delete the release during the upgrade process without warning #110
Comments
++ helm_v3 ls --all -f '^elastic-operator$' --namespace elastic-system --output json
+ LINE=1.7.0,failed The release was in a failed state. Failed releases cannot be upgraded, so the helm controller handles this by uninstalling the broken release, and installing it again to a known-good state. For information on why the release was in a failed state, you would need to look at logs from the failed job pod. |
How to check the logs of the failed pod if it got deleted by the controller quickly? This behavior is good for stateless things, but I don't recommend to use the controller for stuff like rook-ceph where data loss happens when reinstalling it. |
Yes, you'd need to catch it pretty quickly. If this is something you run into with some regularity, it is probably worth trying to reproduce the error in a test environment so that you can figure out what's causing the chart upgrade or install to fail initially. |
We could perhaps consider an enhancement to the HelmChart CR spec to support enabling different behavior if the release is in a failed state - for example: rolling back, retrying the upgrade as-is, or simply leaving it failed for an administrator to address manually? In the past we've attempted to avoid adding too much complexity to the helm controller since it's sort of "just enough" to get the basic cluster infra installed via charts; it's definitely not intended to be a complete replacement for more complicated deployment tools. |
Sounds like a good idea to me. I suggest using the safe option to not delete it as a default. I had some bad experiences with this controller and data loss after a few weeks uptime, luckily on a test environment. |
The way we use the Helm controller to deploy core cluster infrastructure whose complete configuration is managed by the chart, it's usually best for us as distribution maintainers to default to behavior that gets the chart installed in the desired configuration at any cost. I can see the value in allowing users to disable that for their own charts though. |
We have seen more or less the same issue during our upgrade of RKE2 from 1.20.8 to 1.20.12:
In our case this hit ingress-nginx (causing downtime) and longhorn (causing data-loss). It this the same problem? |
That's the same issue I had. I think it's best to have the safe option as a default, and the destructive one as opt in. As an alternative I suggest writing some damn big warning signs somewhere to not use this controller with any deployments which require any kind of persistence. |
I am sure our deployments have been installed and running perfectly well - so not sure why / where we can see a failure and is handled "wrong" for sure.. |
We tried to reproduce this issue and it seems that we might have found at least one situation where an "uninstall" happens where it should not: In case the helm upgrade for a given chart / deployment fails i.e. due to hitting a timeout - the deployment is in status "pending-upgrade". Now the process gets restarted, detects "pending-upgrade", sets the deployment on "failed" and re-runs the helm upgrade. Here again a failure happens - i.e. this:
In the next re-run the deployment is on failure and this causes the uninstall / reinstall. Assumption is that this is related to this change of the status from "pending-upgrade" to "failed": Could you verify if this observation is correct? In general - could we disable the "uninstall / install" in case of any errors completely?
|
Fow now I solve this by switching to flux helm controller which do not suffer from this issue https://github.com/fluxcd/helm-controller. |
So you deploy this "in addition" to the one we have in rke2/k3s? |
OK, so the I see that we didn't get this field added to the HelmChart spec in the RKE2 docs, but there is a timeout field available that takes a duration string (`30m' or the like): This is passed through directly to the helm command's timeout arg: |
The „timeout“ situation is just one case that we could reproduce - not sure if there are other situations.. We believe that adjusting the timeout is not a valid workaround, because we never know how much time an upgrade in a given environment might need. We must ensure by 100% that there is no automatic uninstall as an uninstall is downtime and data-loss. So the way how we reproduce the root cause might be only one situation where this „uninstall“ happens - and we have also seen this with the rke2 included ingress-nginx.. Assumption is that any update/upgrade might cause problems in a given cluster or servers / agents in the cluster etc. So to be on the save side - we might have to set the timeout to „never timeout“. How can we configure this as default for everything that happens with the rke2 integrated helm-controller? To have a real solution - could we get the „uninstall / install“ removed from the helm-controller completely to protect against downtime and data-loss? Or what would be the reason to allow uninstall / install in a production environment? |
Tracking this issue since I'm experiencing the same thing with rke2 + rancher-istio. Will test out that timeout field and report back. We're also managing rook-ceph so further prevention of downtime/data-loss is still necessary. |
Hello,
We experienced a release deleting during upgrade, is it expected ?
elastic/cloud-on-k8s#4734 (comment)
The text was updated successfully, but these errors were encountered: