Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent update race in workflow cache. Fixes #9574 #12233

Merged
merged 12 commits into from
Jan 16, 2024

Conversation

drawlerr
Copy link
Contributor

@drawlerr drawlerr commented Nov 20, 2023

Fixes #9574

Motivation

When running high volumes of parallel workflows with high per-workflow node counts (300+ workflows, >100 nodes per workflow), ~1% of workflows will fail or get stuck with a few different errors which have changed over time with previous attempts to fix the problem:

  • originally, segfault when dereferencing a null pointer. updated to panic: "no Node found" in Argo 3.4.0
  • if a child node is lost, workflow gets "stuck" waiting forever

Given the random nature of the problem that only surfaces when large numbers of large workflows are running concurrently, I suspected and went looking for a possible race condition in the Workflow cache.

Modifications

I found a few areas where a reference to a workflow is acquired before locking and not re-acquired after locking, or where a lock was not acquired but should have been.
Instead of re-acquiring the workflow after locking I changed the logic to acquire the lock first before fetching the workflow to simplify, on the assumption that this race is rare enough that rarely waiting on the lock should be better than always hitting the cache twice.
In places that weren't locking but should have been, I added some logic to do so.

Verification

For the past 6 months, we have been running a modified version of Controller with the changes in this PR. Not a single further occurrence of the above mentioned problems have occurred whilst running 500-1000 concurrent workflows. At the far end of the extreme, we experienced etcd timeout problems but at least no more random workflow processing failures.

@drawlerr drawlerr changed the title Fixes #9574 Fix #9574 Nov 20, 2023
@agilgur5 agilgur5 added the area/controller Controller issues, panics label Nov 20, 2023
@drawlerr drawlerr changed the title Fix #9574 Fix: prevent update race in workflow cache Nov 20, 2023
@drawlerr drawlerr changed the title Fix: prevent update race in workflow cache fix: prevent update race in workflow cache Nov 20, 2023
@drawlerr drawlerr marked this pull request as ready for review November 29, 2023 17:46
@drawlerr drawlerr changed the title fix: prevent update race in workflow cache fix: prevent update race in workflow cache (Fixes #9574) Dec 1, 2023
Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks useful and sane. There isn't a useful way of testing this stuff in CI

Thanks for your contribution @drawlerr

@Joibel Joibel requested a review from juliev0 December 1, 2023 14:17
@Joibel
Copy link
Member

Joibel commented Dec 1, 2023

@juliev0, could you take a look, I think this is all good.

@drawlerr
Copy link
Contributor Author

drawlerr commented Dec 7, 2023

@juliev0 / @sarabala1979 could you take another look?

@drawlerr drawlerr requested a review from Joibel December 7, 2023 22:02
@Joibel
Copy link
Member

Joibel commented Dec 7, 2023

Still happy with this. @terrytangyuan, could you take a look?

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's one unresolved comment from @sarabala1979 so let's wait for another review from him

@cfis
Copy link

cfis commented Dec 18, 2023

@sarabala1979 any chance you could take another look? We would love to to get this MR (or equivalent fix) merged. Thanks!

Copy link
Member

@sarabala1979 sarabala1979 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sarabala1979 sarabala1979 enabled auto-merge (squash) January 16, 2024 17:28
@sarabala1979 sarabala1979 merged commit 1202ae4 into argoproj:main Jan 16, 2024
27 checks passed
@agilgur5
Copy link
Contributor

agilgur5 commented Feb 4, 2024

For posterity, I suspected a (potentially different kind of) race in #11451 (comment) and couldn't quite find it, I wonder if this might have been it. Dennis referred to this in #12132 (comment) too

isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 20, 2024
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 27, 2024
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 28, 2024
…oproj#12233)

Signed-off-by: Dennis Lawler <[email protected]>
Signed-off-by: Dennis Lawler <[email protected]>
Signed-off-by: Isitha Subasinghe <[email protected]>
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Mar 12, 2024
Joibel pushed a commit to isubasinghe/argo-workflows that referenced this pull request Apr 17, 2024
…oproj#12233)

Signed-off-by: Dennis Lawler <[email protected]>
Signed-off-by: Dennis Lawler <[email protected]>
Signed-off-by: Isitha Subasinghe <[email protected]>
tczhao pushed a commit to tczhao/argo that referenced this pull request Apr 19, 2024
Joibel pushed a commit to pipekit/argo-workflows that referenced this pull request Apr 19, 2024
…oproj#12233)

Signed-off-by: Dennis Lawler <[email protected]>
Signed-off-by: Dennis Lawler <[email protected]>
Signed-off-by: Isitha Subasinghe <[email protected]>
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request May 6, 2024
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request May 7, 2024
@tooptoop4
Copy link
Contributor

can dis be backported to 3.4?

@agilgur5
Copy link
Contributor

It's a fix, so it should be by default unless there are many conflicts. But #11851 would be the right place to ask.

@tooptoop4
Copy link
Contributor

was backported to 3.4.17

@agilgur5 agilgur5 changed the title fix: prevent update race in workflow cache (Fixes #9574) fix: prevent update race in workflow cache. Fixes #9574 Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parent node not initialized
7 participants