fix: make sure Finalizers has chance to be removed. Fixes: #12836 #12831

shuangkun · 2024-03-21T13:43:01Z

When the cluster pressure is high or execution is slow, TestStoppedWorkflow often fail.
Because artifacts gc didn't execute.

It should be that after the last Failed, the controller has no chance to execute this workflow again.

Motivation

Modifications

Verification

Signed-off-by: shuangkun <[email protected]>

shuangkun · 2024-03-22T05:20:17Z

@juliev0 Hi, can you help have a look, might help developers. Thanks!

tczhao

could you paste the link to the failed TestStoppedWorkflow workflow, that would help us better understand the problem

tczhao · 2024-03-22T05:38:42Z

workflow/controller/operator.go

@@ -806,6 +809,10 @@ func (woc *wfOperationCtx) persistUpdates(ctx context.Context) {
 			woc.log.WithError(err).Warn("failed to delete task-results")
 		}
 	}
+	// If FinalizerArtifactGC exists, requeue to make sure artifact GC can execute.
+	if woc.wf.Status.Fulfilled() && slices.Contains(wf.GetFinalizers(), common.FinalizerArtifactGC) {
+		woc.requeue()


would this result in an infinite loop where the workflow always in the wfqueue?

FinalizerArtifactGC should be removed after gc completed.

maybe we should theoretically requeue if there's any Finalizer, not just Artifact GC?

also, how do you feel about adding to this to the block above if woc.wf.Status.Fulfilled() { as a nested if statement?

On version 3.5.5, I've noticed that when I stop & delete workflows in the UI before they are complete, finalizers aren't removed and the workflow gets stuck (but artifact gc succeeds). I believe this is the same issue.

@juliev0 Looking at some of the test failures linked, I'm noticing the following:

Waiting 1m30s for workflows {{ } workflows.argoproj.io/test metadata.name=artgc-dag-wf-stopped-pod-gc-on-pod-completion-qrlp5 false false <nil> 0 } when.go:356: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline

Indicating a rate limit issue with calling wfList, err := w.client.List(ctx, listOptions). If we're getting a rate limit error, and execution moves on to the artifact presence check too soon, then yes, it could be that the controller didn't have a chance to finish PodGC yet.

@shuangkun @juliev0 per my first comment here, I still think there's an issue that needs to be solved (and I think your proposed solution could be sufficient). As @tczhao mentioned, requeueing could be an issue, but only if finalizers are never removed for some reason.

If adding rate limiting for WaitForWorkflowList resolves these transient test issues, then we still need to modify the TestStoppedWorkflow test to ensure that finalizers are removed.

(Sorry, I just realized my last comment said "PodGC" - I meant to say "ArtifactGC" (which I've since edited))

If we're getting a rate limit error, and execution moves on to the artifact presence check too soon, then yes, it could be that the controller didn't have a chance to finish PodGC yet. <-- I assume you also mean "ArtifactGC"

@Garett-MacGowan Thanks for tying that together. Actually, in my original statement, I didn't realize that this WaitForWorkflowDeletion() call essentially waits for the finalizer to have been removed, so the logic does make sense, except that rate limiting seems to defeat that. :(

On version 3.5.5, I've noticed that when I stop & delete workflows in the UI before they are complete, finalizers aren't removed and the workflow gets stuck (but artifact gc succeeds).

That's interesting that it sounds like there's also a real issue. It's notable what @tczhao pointed out that the WorkflowArtifactGCTaskInformer should requeue when the ArtifactGCTask has changed, which should cause the Workflow to be processed and the WorkflowArtifactGCTasks listed and read here and then the Finalizer removed. Maybe there's some race condition in there that could prevent that?

I guess at the end of the day we need to determine if there's any harm in this change. If we change woc.requeue() to woc.requeueAfter(delay) and reduce the immediacy of the requeue, then maybe it's generally a good thing that we regularly revisit Workflows that still have Finalizers just in case?

I assume you also mean "ArtifactGC"

Yes, I do.

Actually, in my original statement, I didn't realize that this WaitForWorkflowDeletion() call essentially waits for the finalizer to have been removed, so the logic does make sense, except that rate limiting seems to defeat that. :(

Ahh, yes, it does handle that properly already. @shuangkun maybe we could add time.Sleep(time.Second), or whatever the kubernetes API QPS rate limit is, to WaitForWorkflowList()?

maybe it's generally a good thing that we regularly revisit Workflows that still have Finalizers just in case?

I think it's not a terrible idea. If it's requeued after a reasonable delay, it should prevent resource hogging to a degree. I can imagine an issue where a cron workflow continuously fails to GC, leading to a workflow build up, and a large amount of requeueing workflows, though. This would bog down other workflows eventually, right?

Yes, true. I guess there is something broken as far as finalizer logic if we get into that scenario.

shuangkun · 2024-03-22T05:49:01Z

After the workflow is set to Failed, the workflow is never in the queue, resulting in no artifact gc.

shuangkun · 2024-03-22T06:30:22Z

could you paste the link to the failed TestStoppedWorkflow workflow, that would help us better understand the problem

I'll post it next time, as this thing is not easy to reproduce. But I've encountered it several times recently.

shuangkun · 2024-03-22T06:40:07Z

could you paste the link to the failed TestStoppedWorkflow workflow, that would help us better understand the problem

https://github.com/argoproj/argo-workflows/actions/runs/8374128100/job/22928794428?pr=12780 Here!

shuangkun · 2024-03-22T09:48:00Z

when workflow marked failed, no artifact gc:

2024-03-21T11:39:27.9575692Z controller: time="2024-03-21T11:37:57.358Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=3221 namespace=argo workflow=     artgc-dag-wf-stopped-pod-gc-on-pod-completion-vwxxz
9542 2024-03-21T11:39:27.9576502Z controller: time="2024-03-21T11:37:57.359Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=4 workflow=artgc-dag-wf-stopp     ed-pod-gc-on-pod-completion-vwxxz
9543 2024-03-21T11:39:27.9577647Z controller: time="2024-03-21T11:37:57.359Z" level=info msg="task-result changed" namespace=argo nodeID=artgc-dag-wf-stopped-pod-gc-on-pod-co     mpletion-vwxxz-4010160274 workflow=artgc-dag-wf-stopped-pod-gc-on-pod-completion-vwxxz
9544 2024-03-21T11:39:27.9578782Z controller: time="2024-03-21T11:37:57.359Z" level=info msg="task-result changed" namespace=argo nodeID=artgc-dag-wf-stopped-pod-gc-on-pod-co     mpletion-vwxxz-3197474861 workflow=artgc-dag-wf-stopped-pod-gc-on-pod-completion-vwxxz
9545 2024-03-21T11:39:27.9579911Z controller: time="2024-03-21T11:37:57.359Z" level=info msg="task-result changed" namespace=argo nodeID=artgc-dag-wf-stopped-pod-gc-on-pod-co     mpletion-vwxxz-3993382655 workflow=artgc-dag-wf-stopped-pod-gc-on-pod-completion-vwxxz
9546 2024-03-21T11:39:27.9581042Z controller: time="2024-03-21T11:37:57.359Z" level=info msg="task-result changed" namespace=argo nodeID=artgc-dag-wf-stopped-pod-gc-on-pod-co     mpletion-vwxxz-554465011 workflow=artgc-dag-wf-stopped-pod-gc-on-pod-completion-vwxxz
9547 2024-03-21T11:39:27.9581817Z controller: time="2024-03-21T11:37:57.359Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=artgc-dag-wf-stopped-po     d-gc-on-pod-completion-vwxxz
9548 2024-03-21T11:39:27.9582791Z controller: time="2024-03-21T11:37:57.359Z" level=info msg="Updated message  -> Stopped with strategy 'Stop'" namespace=argo workflow=artgc-     dag-wf-stopped-pod-gc-on-pod-completion-vwxxz
9549 2024-03-21T11:39:27.9583656Z controller: time="2024-03-21T11:37:57.359Z" level=info msg="Marking workflow completed" namespace=argo workflow=artgc-dag-wf-stopped-pod-gc-     on-pod-completion-vwxxz
9550 2024-03-21T11:39:27.9584121Z controller: time="2024-03-21T11:37:57.359Z" level=info msg="Workflow to be dehydrated" Workflow Size=7879
9551 2024-03-21T11:39:27.9585013Z controller: time="2024-03-21T11:37:57.365Z" level=info msg="cleaning up pod" action=deletePod key=argo/artgc-dag-wf-stopped-pod-gc-on-pod-co     mpletion-vwxxz-1340600742-agent/deletePod
9552 2024-03-21T11:39:27.9585931Z controller: time="2024-03-21T11:37:57.367Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=3225 wor     kflow=artgc-dag-wf-stopped-pod-gc-on-pod-completion-vwxxz
9553 2024-03-21T11:39:27.9587435Z controller: time="2024-03-21T11:37:57.367Z" level=warning msg="failed to clean-up pod" action=deletePod error="pods \"artgc-dag-wf-stopped-p     od-gc-on-pod-completion-vwxxz-1340600742-agent\" not found" key=argo/artgc-dag-wf-stopped-pod-gc-on-pod-completion-vwxxz-1340600742-agent/deletePod
9554 2024-03-21T11:39:27.9588296Z controller: time="2024-03-21T11:37:57.367Z" level=warning msg="Non-transient error: pods \"artgc-dag-wf-stopped-pod-gc-on-pod-completion-vwx     xz-1340600742-agent\" not found"
9555 2024-03-21T11:39:27.9588475Z port-forward: Handling connection for 9000
9556 2024-03-21T11:39:27.9589786Z controller: time="2024-03-21T11:37:58.569Z" level=info msg="cleaning up pod" action=killContainers key=argo/artgc-dag-wf-stopped-pod-gc-on-p     od-completion-vwxxz-artgc-dag-workflow-stopper-554465011/killContainers
9557 2024-03-21T11:39:27.9590915Z controller: time="2024-03-21T11:37:58.599Z" level=info msg="cleaning up pod" action=killContainers key=argo/artgc-dag-wf-stopped-pod-gc-on-p     od-completion-vwxxz-artgc-dag-artifact-creator-2-3993382655/killContainers
9558 2024-03-21T11:39:27.9591099Z port-forward: Handling connection for 9000
9559 2024-03-21T11:39:27.9591264Z port-forward: Handling connection for 9000
9560 2024-03-21T11:39:27.9591429Z port-forward: Handling connection for 9000
9561 2024-03-21T11:39:27.9591601Z port-forward: Handling connection for 9000
9562 2024-03-21T11:39:27.9591765Z port-forward: Handling connection for 9000
9563 2024-03-21T11:39:27.9591933Z port-forward: Handling connection for 9000

Signed-off-by: shuangkun <[email protected]>

juliev0 · 2024-03-22T22:59:30Z

I will definitely take a look at this.

juliev0 · 2024-03-22T23:45:54Z

Are you saying that in the current code in master, that we do requeue in the case of the Workflow being in Succeeded state but not Failed state? Or that we don't necessarily requeue for any Completed state? I'm curious where the logic for this is.

shuangkun · 2024-03-23T00:08:50Z

Are you saying that in the current code in master, that we do requeue in the case of the Workflow being in Succeeded state but not Failed state? Or that we don't necessarily requeue for any Completed state? I'm curious where the logic for this is.

workflow is failed. In the current code, sometimes during the last reconceil, the workflow is set to failed, but the wf queue is emptied. At this time, there is no chance for the next round of reconceil and garbage collection.

tczhao · 2024-03-25T11:17:19Z

workflow/controller/operator.go

@@ -806,6 +809,10 @@ func (woc *wfOperationCtx) persistUpdates(ctx context.Context) {
 			woc.log.WithError(err).Warn("failed to delete task-results")
 		}
 	}
+	// If FinalizerArtifactGC exists, requeue to make sure artifact GC can execute.
+	if woc.wf.Status.Fulfilled() && slices.Contains(wf.GetFinalizers(), common.FinalizerArtifactGC) {
+		woc.requeue()


Based on the log and my understanding of how artifactGC works, I think the issue is something else.

Currently, the finalizer does have a chance to be removed, see if pseudo code below makes sense

// operate() if wf.status.fulfilled garbageCollectArtifacts create WorkflowArtifactGCTask create artifact gc pod loop through WorkflowArtifactGCTask patch WorkflowArtifactGCTask for each deletion // controller() WorkflowArtifactGCTaskInformer. on WorkflowArtifactGCTask Update wfqueue.AddRateLimited(key) // this requeue wf to remove finalizer

Signed-off-by: shuangkun <[email protected]>

shuangkun · 2024-04-01T02:49:52Z

Is requeueAfter 5s OK?

Signed-off-by: shuangkun <[email protected]>

juliev0 · 2024-04-01T21:23:09Z

test/e2e/fixtures/when.go

@@ -347,6 +347,7 @@ func (w *When) WaitForWorkflowList(listOptions metav1.ListOptions, condition fun
 				return w
 			}
 		}
+		time.Sleep(time.Second)


Is the idea here that we are causing the rate limiting problem ourselves with too many consecutive queries? As far as I know, all of these tests run in parallel as part of CI so they can all affect each other.

Ahh, never mind. It's probably a per-client rate limiting and each test would have its own client I guess?

juliev0 · 2024-04-01T21:30:05Z

I think I'm good with this. If there are no objections from the other reviewers I can merge it. Thanks as always for the iterations @shuangkun !

tczhao

Sounds fair to me

shuangkun · 2024-04-02T03:43:47Z

Thank you everyone for reviews！

…2831) Signed-off-by: shuangkun <[email protected]> (cherry picked from commit fb6c3d0)

agilgur5 · 2024-04-19T17:19:14Z

Backported cleanly into release-3.5 as ce7cad3

…2836 (argoproj#12831) Signed-off-by: shuangkun <[email protected]>

shuangkun added 2 commits March 21, 2024 21:39

fix: artifact GC.

cb98804

Signed-off-by: shuangkun <[email protected]>

fix: artifact

f1dbe0e

Signed-off-by: shuangkun <[email protected]>

shuangkun changed the title ~~fix: artifact GC.~~ fix: make sure artifact GC has a chance to execute. Mar 21, 2024

shuangkun changed the title ~~fix: make sure artifact GC has a chance to execute.~~ fix: make sure artifact GC has chance to execute. Mar 21, 2024

shuangkun closed this Mar 21, 2024

shuangkun reopened this Mar 21, 2024

shuangkun force-pushed the fix/LetGCCanExecute branch from f1dbe0e to d68f830 Compare March 21, 2024 14:59

shuangkun closed this Mar 21, 2024

shuangkun reopened this Mar 21, 2024

shuangkun closed this Mar 21, 2024

shuangkun reopened this Mar 21, 2024

shuangkun force-pushed the fix/LetGCCanExecute branch from d68f830 to 90bedd8 Compare March 21, 2024 23:09

shuangkun closed this Mar 21, 2024

shuangkun reopened this Mar 21, 2024

tczhao reviewed Mar 22, 2024

View reviewed changes

shuangkun changed the title ~~fix: make sure artifact GC has chance to execute.~~ fix: make sure artifact GC has chance to execute. Fixes: #12836 Mar 22, 2024

fix: test

b6ee60c

Signed-off-by: shuangkun <[email protected]>

shuangkun force-pushed the fix/LetGCCanExecute branch 2 times, most recently from b6ee60c to d2c6ea0 Compare March 22, 2024 10:39

shuangkun added area/controller Controller issues, panics area/artifacts S3/GCP/OSS/Git/HDFS etc labels Mar 22, 2024

juliev0 self-assigned this Mar 22, 2024

shuangkun force-pushed the fix/LetGCCanExecute branch from d2c6ea0 to 1288c2f Compare March 24, 2024 02:12

shuangkun changed the title ~~fix: make sure artifact GC has chance to execute. Fixes: #12836~~ fix: make sure Finalizer has chance to be removed. Fixes: #12836 Mar 24, 2024

shuangkun changed the title ~~fix: make sure Finalizer has chance to be removed. Fixes: #12836~~ fix: make sure Finalizers has chance to be removed. Fixes: #12836 Mar 24, 2024

agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Mar 24, 2024

shuangkun force-pushed the fix/LetGCCanExecute branch from 1288c2f to 4c11c51 Compare March 25, 2024 02:36

tczhao requested changes Mar 25, 2024

View reviewed changes

shuangkun force-pushed the fix/LetGCCanExecute branch from 4c11c51 to 7b4c8ff Compare March 26, 2024 05:41

blkperl mentioned this pull request Mar 30, 2024

TestStoppedWorkflow fail sometimes (flakey test) #12836

Closed

4 tasks

fix: finalizers.

354016f

Signed-off-by: shuangkun <[email protected]>

shuangkun force-pushed the fix/LetGCCanExecute branch 2 times, most recently from 354016f to a8077e5 Compare March 31, 2024 05:34

shuangkun closed this Mar 31, 2024

shuangkun reopened this Mar 31, 2024

fix: to requeueAfter

033aaaa

Signed-off-by: shuangkun <[email protected]>

shuangkun force-pushed the fix/LetGCCanExecute branch from a8077e5 to 033aaaa Compare April 1, 2024 02:47

shuangkun requested a review from juliev0 April 1, 2024 06:48

fix: test

04cd242

Signed-off-by: shuangkun <[email protected]>

juliev0 reviewed Apr 1, 2024

View reviewed changes

tczhao approved these changes Apr 2, 2024

View reviewed changes

juliev0 approved these changes Apr 2, 2024

View reviewed changes

juliev0 merged commit fb6c3d0 into argoproj:main Apr 2, 2024
27 checks passed

shuangkun mentioned this pull request Apr 4, 2024

REQUEST: Promotion to Reviewer for @shuangkun argoproj/argoproj#293

Closed

6 tasks

agilgur5 added this to the v3.5.x patches milestone Apr 19, 2024

agilgur5 pushed a commit that referenced this pull request Apr 19, 2024

fix: make sure Finalizers has chance to be removed. Fixes: #12836 (#1…

ce7cad3

…2831) Signed-off-by: shuangkun <[email protected]> (cherry picked from commit fb6c3d0)

isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request May 6, 2024

fix: make sure Finalizers has chance to be removed. Fixes: argoproj#1…

9f47048

…2836 (argoproj#12831) Signed-off-by: shuangkun <[email protected]>

isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request May 7, 2024

fix: make sure Finalizers has chance to be removed. Fixes: argoproj#1…

c5b4935

…2836 (argoproj#12831) Signed-off-by: shuangkun <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make sure Finalizers has chance to be removed. Fixes: #12836 #12831

fix: make sure Finalizers has chance to be removed. Fixes: #12836 #12831

shuangkun commented Mar 21, 2024 •

edited by agilgur5

Loading

shuangkun commented Mar 22, 2024

tczhao left a comment •

edited

Loading

tczhao Mar 22, 2024

shuangkun Mar 22, 2024

juliev0 Mar 23, 2024

shuangkun Mar 23, 2024

juliev0 Mar 24, 2024

Garett-MacGowan Mar 31, 2024 •

edited

Loading

juliev0 Mar 31, 2024 •

edited

Loading

juliev0 Mar 31, 2024

Garett-MacGowan Apr 1, 2024

juliev0 Apr 1, 2024

shuangkun commented Mar 22, 2024

shuangkun commented Mar 22, 2024

shuangkun commented Mar 22, 2024

shuangkun commented Mar 22, 2024

juliev0 commented Mar 22, 2024

juliev0 commented Mar 22, 2024

shuangkun commented Mar 23, 2024

tczhao Mar 25, 2024 •

edited by agilgur5

Loading

shuangkun commented Apr 1, 2024

juliev0 Apr 1, 2024 •

edited

Loading

juliev0 Apr 1, 2024

juliev0 commented Apr 1, 2024

tczhao left a comment

shuangkun commented Apr 2, 2024

agilgur5 commented Apr 19, 2024 •

edited

Loading

fix: make sure Finalizers has chance to be removed. Fixes: #12836 #12831

fix: make sure Finalizers has chance to be removed. Fixes: #12836 #12831

Conversation

shuangkun commented Mar 21, 2024 • edited by agilgur5 Loading

Motivation

Modifications

Verification

shuangkun commented Mar 22, 2024

tczhao left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Garett-MacGowan Mar 31, 2024 • edited Loading

Choose a reason for hiding this comment

juliev0 Mar 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shuangkun commented Mar 22, 2024

shuangkun commented Mar 22, 2024

shuangkun commented Mar 22, 2024

shuangkun commented Mar 22, 2024

juliev0 commented Mar 22, 2024

juliev0 commented Mar 22, 2024

shuangkun commented Mar 23, 2024

tczhao Mar 25, 2024 • edited by agilgur5 Loading

Choose a reason for hiding this comment

shuangkun commented Apr 1, 2024

juliev0 Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliev0 commented Apr 1, 2024

tczhao left a comment

Choose a reason for hiding this comment

shuangkun commented Apr 2, 2024

agilgur5 commented Apr 19, 2024 • edited Loading

shuangkun commented Mar 21, 2024 •

edited by agilgur5

Loading

tczhao left a comment •

edited

Loading

Garett-MacGowan Mar 31, 2024 •

edited

Loading

juliev0 Mar 31, 2024 •

edited

Loading

tczhao Mar 25, 2024 •

edited by agilgur5

Loading

juliev0 Apr 1, 2024 •

edited

Loading

agilgur5 commented Apr 19, 2024 •

edited

Loading