fix: prevent update race in workflow cache. Fixes #9574 #12233

drawlerr · 2023-11-20T18:12:30Z

Motivation

When running high volumes of parallel workflows with high per-workflow node counts (300+ workflows, >100 nodes per workflow), ~1% of workflows will fail or get stuck with a few different errors which have changed over time with previous attempts to fix the problem:

originally, segfault when dereferencing a null pointer. updated to panic: "no Node found" in Argo 3.4.0
if a child node is lost, workflow gets "stuck" waiting forever

Given the random nature of the problem that only surfaces when large numbers of large workflows are running concurrently, I suspected and went looking for a possible race condition in the Workflow cache.

Modifications

I found a few areas where a reference to a workflow is acquired before locking and not re-acquired after locking, or where a lock was not acquired but should have been.
Instead of re-acquiring the workflow after locking I changed the logic to acquire the lock first before fetching the workflow to simplify, on the assumption that this race is rare enough that rarely waiting on the lock should be better than always hitting the cache twice.
In places that weren't locking but should have been, I added some logic to do so.

Verification

For the past 6 months, we have been running a modified version of Controller with the changes in this PR. Not a single further occurrence of the above mentioned problems have occurred whilst running 500-1000 concurrent workflows. At the far end of the extreme, we experienced etcd timeout problems but at least no more random workflow processing failures.

Signed-off-by: Dennis Lawler <[email protected]>

Joibel

This all looks useful and sane. There isn't a useful way of testing this stuff in CI

Thanks for your contribution @drawlerr

Joibel · 2023-12-01T14:17:28Z

@juliev0, could you take a look, I think this is all good.

workflow/controller/controller.go

Signed-off-by: Dennis Lawler <[email protected]>

drawlerr · 2023-12-07T14:06:30Z

@juliev0 / @sarabala1979 could you take another look?

Joibel · 2023-12-07T22:12:13Z

Still happy with this. @terrytangyuan, could you take a look?

terrytangyuan

Looks like there's one unresolved comment from @sarabala1979 so let's wait for another review from him

cfis · 2023-12-18T21:21:26Z

@sarabala1979 any chance you could take another look? We would love to to get this MR (or equivalent fix) merged. Thanks!

Signed-off-by: Dennis Lawler <[email protected]>

sarabala1979

LGTM

agilgur5 · 2024-02-04T00:28:01Z

For posterity, I suspected a (potentially different kind of) race in #11451 (comment) and couldn't quite find it, I wonder if this might have been it. Dennis referred to this in #12132 (comment) too

…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>

…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Isitha Subasinghe <[email protected]>

…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>

…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Isitha Subasinghe <[email protected]>

…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>

…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Isitha Subasinghe <[email protected]>

…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>

tooptoop4 · 2024-05-11T11:24:08Z

can dis be backported to 3.4?

agilgur5 · 2024-05-11T12:52:37Z

It's a fix, so it should be by default unless there are many conflicts. But #11851 would be the right place to ask.

…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>

tooptoop4 · 2024-09-29T01:03:18Z

was backported to 3.4.17

drawlerr changed the title ~~Fixes #9574~~ Fix #9574 Nov 20, 2023

agilgur5 added the area/controller Controller issues, panics label Nov 20, 2023

drawlerr force-pushed the fix-9574 branch from 01a5456 to 62e8b7b Compare November 20, 2023 19:13

drawlerr changed the title ~~Fix #9574~~ Fix: prevent update race in workflow cache Nov 20, 2023

drawlerr force-pushed the fix-9574 branch from 62e8b7b to e13543f Compare November 20, 2023 19:21

drawlerr changed the title ~~Fix: prevent update race in workflow cache~~ fix: prevent update race in workflow cache Nov 20, 2023

drawlerr added 2 commits November 20, 2023 12:28

fix: re-acquire workflow from informer after locking (argoproj#9574)

aac51f4

Signed-off-by: Dennis Lawler <[email protected]>

fix: update go.sum to satisfy lint

4ee6a74

Signed-off-by: Dennis Lawler <[email protected]>

drawlerr force-pushed the fix-9574 branch from e13543f to 4ee6a74 Compare November 20, 2023 19:40

drawlerr marked this pull request as ready for review November 29, 2023 17:46

Merge branch 'master' into fix-9574

d350e19

Signed-off-by: Dennis Lawler <[email protected]>

drawlerr changed the title ~~fix: prevent update race in workflow cache~~ fix: prevent update race in workflow cache (Fixes #9574) Dec 1, 2023

Joibel approved these changes Dec 1, 2023

View reviewed changes

Joibel requested a review from juliev0 December 1, 2023 14:17

sarabala1979 reviewed Dec 3, 2023

View reviewed changes

workflow/controller/controller.go Show resolved Hide resolved

drawlerr added 2 commits December 4, 2023 10:40

Merge branch 'main' into fix-9574

e59a18a

Signed-off-by: Dennis Lawler <[email protected]>

Merge branch 'main' into fix-9574

4765ae7

Signed-off-by: Dennis Lawler <[email protected]>

drawlerr requested a review from sarabala1979 December 6, 2023 03:51

Merge branch 'main' into fix-9574

d6b56a6

drawlerr requested a review from Joibel December 7, 2023 22:02

terrytangyuan reviewed Dec 8, 2023

View reviewed changes

drawlerr and others added 5 commits January 3, 2024 11:33

Merge branch 'main' into fix-9574

19e457c

Signed-off-by: Dennis Lawler <[email protected]>

Merge branch 'main' into fix-9574

6539a0b

Merge branch 'main' into fix-9574

c4984cb

Merge branch 'main' into fix-9574

8c18234

Merge branch 'main' into fix-9574

7bc443d

Merge branch 'main' into fix-9574

dd9a7ec

sarabala1979 approved these changes Jan 16, 2024

View reviewed changes

sarabala1979 enabled auto-merge (squash) January 16, 2024 17:28

sarabala1979 merged commit 1202ae4 into argoproj:main Jan 16, 2024
27 checks passed

This was referenced Jan 30, 2024

Release v3.4 patch releases discussion #11851

Open

Flood of "Unable to obtain node for" error messages #12132

Open

isubasinghe mentioned this pull request Feb 27, 2024

Release v3.5 patch releases discussion #11997

Open

tczhao pushed a commit to tczhao/argo that referenced this pull request Apr 19, 2024

fix: prevent update race in workflow cache (Fixes argoproj#9574) (arg…

cd9b81e

…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>

tooptoop4 mentioned this pull request May 11, 2024

v3.4.11: workflow stuck Running state, but only pod is Completed #12103

Open

3 tasks

tczhao pushed a commit to tczhao/argo that referenced this pull request Jun 10, 2024

fix: prevent update race in workflow cache (Fixes argoproj#9574) (arg…

214b32e

…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>

This was referenced Jun 10, 2024

[atlan fork] v3.5.2-atlan-0.2 #13159

Closed

v3.5.2-atlan-0.2 atlanhq/argo-workflows#2

Open

v3.5.2-atlan-0.3 atlanhq/argo-workflows#3

Open

This was referenced Jun 18, 2024

[atlan fork] v3.5.2-atlan-0.4 #13215

Closed

v3.5.2-atlan-0.4 atlanhq/argo-workflows#4

Open

v3.5.2-atlan-0.5 atlanhq/argo-workflows#5

Open

tczhao mentioned this pull request Jun 28, 2024

v3.5.2-atlan-0.6 atlanhq/argo-workflows#6

Open

agilgur5 changed the title ~~fix: prevent update race in workflow cache (Fixes #9574)~~ fix: prevent update race in workflow cache. Fixes #9574 Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent update race in workflow cache. Fixes #9574 #12233

fix: prevent update race in workflow cache. Fixes #9574 #12233

drawlerr commented Nov 20, 2023 •

edited

Loading

Joibel left a comment

Joibel commented Dec 1, 2023

drawlerr commented Dec 7, 2023

Joibel commented Dec 7, 2023

terrytangyuan left a comment

cfis commented Dec 18, 2023

sarabala1979 left a comment

agilgur5 commented Feb 4, 2024 •

edited

Loading

tooptoop4 commented May 11, 2024

agilgur5 commented May 11, 2024

tooptoop4 commented Sep 29, 2024

fix: prevent update race in workflow cache. Fixes #9574 #12233

fix: prevent update race in workflow cache. Fixes #9574 #12233

Conversation

drawlerr commented Nov 20, 2023 • edited Loading

Motivation

Modifications

Verification

Joibel left a comment

Choose a reason for hiding this comment

Joibel commented Dec 1, 2023

drawlerr commented Dec 7, 2023

Joibel commented Dec 7, 2023

terrytangyuan left a comment

Choose a reason for hiding this comment

cfis commented Dec 18, 2023

sarabala1979 left a comment

Choose a reason for hiding this comment

agilgur5 commented Feb 4, 2024 • edited Loading

tooptoop4 commented May 11, 2024

agilgur5 commented May 11, 2024

tooptoop4 commented Sep 29, 2024

drawlerr commented Nov 20, 2023 •

edited

Loading

agilgur5 commented Feb 4, 2024 •

edited

Loading