-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: make sure Finalizers has chance to be removed. Fixes: #12836 #12831
Conversation
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
f1dbe0e
to
d68f830
Compare
d68f830
to
90bedd8
Compare
@juliev0 Hi, can you help have a look, might help developers. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you paste the link to the failed TestStoppedWorkflow
workflow, that would help us better understand the problem
workflow/controller/operator.go
Outdated
@@ -806,6 +809,10 @@ func (woc *wfOperationCtx) persistUpdates(ctx context.Context) { | |||
woc.log.WithError(err).Warn("failed to delete task-results") | |||
} | |||
} | |||
// If FinalizerArtifactGC exists, requeue to make sure artifact GC can execute. | |||
if woc.wf.Status.Fulfilled() && slices.Contains(wf.GetFinalizers(), common.FinalizerArtifactGC) { | |||
woc.requeue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this result in an infinite loop where the workflow always in the wfqueue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FinalizerArtifactGC should be removed after gc completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should theoretically requeue if there's any Finalizer, not just Artifact GC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, how do you feel about adding to this to the block above if woc.wf.Status.Fulfilled() {
as a nested if statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On version 3.5.5, I've noticed that when I stop & delete workflows in the UI before they are complete, finalizers aren't removed and the workflow gets stuck (but artifact gc succeeds). I believe this is the same issue.
@juliev0 Looking at some of the test failures linked, I'm noticing the following:
Waiting 1m30s for workflows {{ } workflows.argoproj.io/test metadata.name=artgc-dag-wf-stopped-pod-gc-on-pod-completion-qrlp5 false false <nil> 0 }
when.go:356: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
Indicating a rate limit issue with calling wfList, err := w.client.List(ctx, listOptions)
. If we're getting a rate limit error, and execution moves on to the artifact presence check too soon, then yes, it could be that the controller didn't have a chance to finish PodGC yet.
@shuangkun @juliev0 per my first comment here, I still think there's an issue that needs to be solved (and I think your proposed solution could be sufficient). As @tczhao mentioned, requeueing could be an issue, but only if finalizers are never removed for some reason.
If adding rate limiting for WaitForWorkflowList
resolves these transient test issues, then we still need to modify the TestStoppedWorkflow
test to ensure that finalizers are removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry, I just realized my last comment said "PodGC" - I meant to say "ArtifactGC" (which I've since edited))
If we're getting a rate limit error, and execution moves on to the artifact presence check too soon, then yes, it could be that the controller didn't have a chance to finish PodGC yet. <-- I assume you also mean "ArtifactGC"
@Garett-MacGowan Thanks for tying that together. Actually, in my original statement, I didn't realize that this WaitForWorkflowDeletion() call essentially waits for the finalizer to have been removed, so the logic does make sense, except that rate limiting seems to defeat that. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On version 3.5.5, I've noticed that when I stop & delete workflows in the UI before they are complete, finalizers aren't removed and the workflow gets stuck (but artifact gc succeeds).
That's interesting that it sounds like there's also a real issue. It's notable what @tczhao pointed out that the WorkflowArtifactGCTaskInformer should requeue when the ArtifactGCTask has changed, which should cause the Workflow to be processed and the WorkflowArtifactGCTasks listed and read here and then the Finalizer removed. Maybe there's some race condition in there that could prevent that?
I guess at the end of the day we need to determine if there's any harm in this change. If we change woc.requeue()
to woc.requeueAfter(delay)
and reduce the immediacy of the requeue, then maybe it's generally a good thing that we regularly revisit Workflows that still have Finalizers just in case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you also mean "ArtifactGC"
Yes, I do.
Actually, in my original statement, I didn't realize that this WaitForWorkflowDeletion() call essentially waits for the finalizer to have been removed, so the logic does make sense, except that rate limiting seems to defeat that. :(
Ahh, yes, it does handle that properly already. @shuangkun maybe we could add time.Sleep(time.Second)
, or whatever the kubernetes API QPS rate limit is, to WaitForWorkflowList()
?
maybe it's generally a good thing that we regularly revisit Workflows that still have Finalizers just in case?
I think it's not a terrible idea. If it's requeued after a reasonable delay, it should prevent resource hogging to a degree. I can imagine an issue where a cron workflow continuously fails to GC, leading to a workflow build up, and a large amount of requeueing workflows, though. This would bog down other workflows eventually, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, true. I guess there is something broken as far as finalizer logic if we get into that scenario.
After the workflow is set to Failed, the workflow is never in the queue, resulting in no artifact gc. |
I'll post it next time, as this thing is not easy to reproduce. But I've encountered it several times recently. |
https://github.com/argoproj/argo-workflows/actions/runs/8374128100/job/22928794428?pr=12780 Here! |
when workflow marked failed, no artifact gc:
|
Signed-off-by: shuangkun <[email protected]>
b6ee60c
to
d2c6ea0
Compare
I will definitely take a look at this. |
Are you saying that in the current code in master, that we do requeue in the case of the Workflow being in Succeeded state but not Failed state? Or that we don't necessarily requeue for any Completed state? I'm curious where the logic for this is. |
workflow is failed. In the current code, sometimes during the last reconceil, the workflow is set to failed, but the wf queue is emptied. At this time, there is no chance for the next round of reconceil and garbage collection. |
d2c6ea0
to
1288c2f
Compare
1288c2f
to
4c11c51
Compare
workflow/controller/operator.go
Outdated
@@ -806,6 +809,10 @@ func (woc *wfOperationCtx) persistUpdates(ctx context.Context) { | |||
woc.log.WithError(err).Warn("failed to delete task-results") | |||
} | |||
} | |||
// If FinalizerArtifactGC exists, requeue to make sure artifact GC can execute. | |||
if woc.wf.Status.Fulfilled() && slices.Contains(wf.GetFinalizers(), common.FinalizerArtifactGC) { | |||
woc.requeue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the log and my understanding of how artifactGC works, I think the issue is something else.
Currently, the finalizer does have a chance to be removed, see if pseudo code below makes sense
// operate()
if wf.status.fulfilled
garbageCollectArtifacts
create WorkflowArtifactGCTask
create artifact gc pod
loop through WorkflowArtifactGCTask
patch WorkflowArtifactGCTask for each deletion
// controller()
WorkflowArtifactGCTaskInformer.
on WorkflowArtifactGCTask Update
wfqueue.AddRateLimited(key) // this requeue wf to remove finalizer
4c11c51
to
7b4c8ff
Compare
Signed-off-by: shuangkun <[email protected]>
354016f
to
a8077e5
Compare
Signed-off-by: shuangkun <[email protected]>
a8077e5
to
033aaaa
Compare
Is requeueAfter 5s OK? |
Signed-off-by: shuangkun <[email protected]>
@@ -347,6 +347,7 @@ func (w *When) WaitForWorkflowList(listOptions metav1.ListOptions, condition fun | |||
return w | |||
} | |||
} | |||
time.Sleep(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here that we are causing the rate limiting problem ourselves with too many consecutive queries? As far as I know, all of these tests run in parallel as part of CI so they can all affect each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, never mind. It's probably a per-client rate limiting and each test would have its own client I guess?
I think I'm good with this. If there are no objections from the other reviewers I can merge it. Thanks as always for the iterations @shuangkun ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds fair to me
Thank you everyone for reviews! |
…2831) Signed-off-by: shuangkun <[email protected]> (cherry picked from commit fb6c3d0)
Backported cleanly into |
…2836 (argoproj#12831) Signed-off-by: shuangkun <[email protected]>
…2836 (argoproj#12831) Signed-off-by: shuangkun <[email protected]>
Fixes: #12836
When the cluster pressure is high or execution is slow,
TestStoppedWorkflow
often fail.Because artifacts gc didn't execute.
It should be that after the last Failed, the controller has no chance to execute this workflow again.
Motivation
Modifications
Verification