-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: prevent update race in workflow cache. Fixes #9574 #12233
Conversation
Signed-off-by: Dennis Lawler <[email protected]>
Signed-off-by: Dennis Lawler <[email protected]>
Signed-off-by: Dennis Lawler <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks useful and sane. There isn't a useful way of testing this stuff in CI
Thanks for your contribution @drawlerr
@juliev0, could you take a look, I think this is all good. |
Signed-off-by: Dennis Lawler <[email protected]>
Signed-off-by: Dennis Lawler <[email protected]>
@juliev0 / @sarabala1979 could you take another look? |
Still happy with this. @terrytangyuan, could you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's one unresolved comment from @sarabala1979 so let's wait for another review from him
@sarabala1979 any chance you could take another look? We would love to to get this MR (or equivalent fix) merged. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
For posterity, I suspected a (potentially different kind of) race in #11451 (comment) and couldn't quite find it, I wonder if this might have been it. Dennis referred to this in #12132 (comment) too |
…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>
…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>
…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Isitha Subasinghe <[email protected]>
…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>
…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Isitha Subasinghe <[email protected]>
…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>
…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Isitha Subasinghe <[email protected]>
…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>
…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>
can dis be backported to 3.4? |
It's a fix, so it should be by default unless there are many conflicts. But #11851 would be the right place to ask. |
…oproj#12233) Signed-off-by: Dennis Lawler <[email protected]> Signed-off-by: Dennis Lawler <[email protected]>
was backported to 3.4.17 |
Fixes #9574
Motivation
When running high volumes of parallel workflows with high per-workflow node counts (300+ workflows, >100 nodes per workflow), ~1% of workflows will fail or get stuck with a few different errors which have changed over time with previous attempts to fix the problem:
Given the random nature of the problem that only surfaces when large numbers of large workflows are running concurrently, I suspected and went looking for a possible race condition in the Workflow cache.
Modifications
I found a few areas where a reference to a workflow is acquired before locking and not re-acquired after locking, or where a lock was not acquired but should have been.
Instead of re-acquiring the workflow after locking I changed the logic to acquire the lock first before fetching the workflow to simplify, on the assumption that this race is rare enough that rarely waiting on the lock should be better than always hitting the cache twice.
In places that weren't locking but should have been, I added some logic to do so.
Verification
For the past 6 months, we have been running a modified version of Controller with the changes in this PR. Not a single further occurrence of the above mentioned problems have occurred whilst running 500-1000 concurrent workflows. At the far end of the extreme, we experienced etcd timeout problems but at least no more random workflow processing failures.