-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rollout stuck issue #3316
Comments
Controller logs after restart (bottom -> top):
events:
|
Are you able to semi reliably reproduce this? One of the big pain points around this bug is that it's been a bit elusive to reproduce. |
@zachaller, unfortunately, I couldn't reliably reproduce it, but I'll keep a close eye on it and try to gather more information. Additionally, for experimental purposes, I've set the |
I am pretty much sure
I will keep you updated as I gather more information. Thanks! |
Just for a bit of context as well we actually see the conflict on replicasets a lot but we are running like 8,000 rollout resources, so I don't think the observation of that log is always leads to a stuck rollout. We have also seen stuck rollouts but not any where close to the rate that we see the conflict. Those conflicts should not in theory cause an issue if the rollouts controller does what it suppose to and retry's the reconcile but something is going wrong at some point it's probably some race condition which makes it hard. I want to add a bit more context to the logs we don't do a great job at logging the whole error function call chain and the function that that log line is comming from is called in various spots. |
Yes, sure. We'd be happy to deploy a custom version to gather more information about this issue. |
@eugenepaniot you mind running 1.6.5 or master I added some logging |
@zachaller, thank you. I have deployed the recent 1.6.5 version. Since we still have
Few words about application setup:
The application also has HPA enabled. Please let me know if you need more information. |
I'm facing with similar issue where my Rollouts with HPA stack on Progressing status permanently in v1.6.2. status:
HPAReplicas: 13
availableReplicas: 13
blueGreen: {}
canary: {}
currentPodHash: 5f9cf5d68b
currentStepHash: d68d4f7d8
currentStepIndex: 2
message: old replicas are pending termination
observedGeneration: '8654'
phase: Progressing
readyReplicas: 11
replicas: 11
stableRS: 5f9cf5d68b
updatedReplicas: 10 Logstime="2024-01-30T17:16:05Z" level=error msg="rollout syncHandler error: Operation cannot be fulfilled on replicasets.apps \"app-6b76b6dd98\": the object has been modified; please apply your changes to the latest version and try again" namespace=apps rollout=app
time="2024-01-30T17:16:05Z" level=info msg="rollout syncHandler queue retries: 1865 : key \"apps/app\"" namespace=apps rollout=app Rollouts is fully promoted and new version released, but stacks on Progressing status because of inconsistent replicaset.count. Additional info
|
I updated the controller to v1.6.5 and this issue still occurs. For now, we run a job to restart the controller every hour, but we still get |
I've seen the issue with 1.6.4 where the newest replicaset have a relatively small number of current and desired replicas (but more than minimum from KEDA HPA), while an older replicaset was getting desired replicas to the max, but was stuck at 0 replicas. It actually used to work for some time after doing a rollout with scaling as expected, but something broke later causing the behavior above. Doing a new rollout mitigated the issue, but it feels like playing a lottery. It seems like one of two things:
Maybe the warning about "the object has been modified; please apply your changes to the latest version and try again" is just a symptom of propagating desired replicas to the older replica set. |
Looking at the Event logs, there are RolloutNotCompleted followed by RolloutUpdated. But there's no RolloutCompleted for the app in question. Argo UI still shows spinning sync circle after several hours too. ScalingReplicaSet events were there for the new replica set, but they were scaling to the small number of replicas (wrongly) until the new rollout began. |
V1.6.6 was released it fixes one reproducible version of a stuck rollout people might want to give it a try. |
hi guys we also experienced some issues that canary rollout stucked, we couldn't reproduce steadily but what I found when I added a couple more logs in argo rollouts, it seems like there is race condition betweem the |
We are on v1.6.6 and were seeing the same errors listed above. We added the --rollout-resync=60 and errors subsided initially but gradually came back. We tried to upgrade version to deal with ReplicaSet not found errors but have had to rollback due to these errors. |
Yup I have also found this there a few places where we clobber the in memory state of the replicaset. I do suspect this is somewhat the root cause. The one thing that bothers me though is why the retry reconcile does not correct it and instead it gets stuck. I think the issues has been there a while but it seems to have gotten worse with recent versions. This is going to become one of my top priorities to figure out. |
Hey @zachaller. Curious if there is an update on this? We have been seeing the same issue on 1.6.3. We are currently resolving via controller restarts, but have had to pause our upgrade to 1.6. Might be able to share some logs next time it happens if there is anything in particular you are looking for. |
Just got in trap with this on 1.6.6 version of Argo Rollouts:
The only what helps is restart of |
We tried version 1.6.6 but decided to revert back to 1.6.5 since we faced issues more often than it used to be. |
Any updates on this? We are facing the same issue with v1.6.6 consistently with argo rollouts and HPA. |
Me too! |
Have you had these problems since 2018? |
@zachaller did you able to check it? What version more or less stable? We are on latest 1.6.6 and it's pretty frequent |
I have not yet but it is on my todo and is getting some priority at Intuit, we don't see it a whole lot across our 8k rollout resources though which effects my priority a bit however that said it is starting to get some traction internally. |
+1 we're seeing issues that require restart on 1.6.6 |
We've also have to restart argo-rollouts to get a few rollouts un-stuck. |
+1 I am trying a cronJob to restart it daily to see if it helps to reduce the frequency of the issue |
Hi all! We've also seen this error still in 1.7.1 provisioned by argo-rollouts 2.37.0 chart, see logs below
|
@jccastillocano conflicts are generally not an issue and a normal k8s pattern, the conflict here is on the rollout resource. The patch in the issue is just on conflicts related to replicasets. This is to insure that we can always update the scale of the replicaset and avoids conflict loops. We have not seen any loops on conflicts on Rollout resources. |
This issue included information on a lot of different bugs. The Intuit team encountered one bug today, so I'll share the details of that particular problem:
We could build a mechanism into Argo Rollouts to re-list ReplicaSets occasionally and clear stale data. That would be a significant refactor. Update: we've found that making an insignificant edit on old ReplicaSets instead of deleting them is a safer and equally effective way to clear the issue. I just added annotation |
Can this be implemented in argo-rollouts? For example, can an annotation change to old RS be made on a cron, or can it be hidden behind a flag that needs to be enabled by the users? Alternatively, would lowering revisionHistoryLimit or setting it to 0 make this problem go away? |
Checklist:
Describe the bug
We seem to have faced the issue described in:
#3272
#3257
#3256
To mitigate the issue we've deployed master branch build from revision:
However it does not solve the issue.
To Reproduce
Expected behavior
Screenshots
Version
Logs
rollout controller:
argo rollouts:
rs:
happy to help with debugging, deploying a custom version, etc.
Workarond
We've restarted
deploy/argo-rollouts
.After controller restart:
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: