Fix Issue: Ensure proper locking in WorkflowSweeper to prevent race conditions #214

rq-dbrady · 2024-07-18T15:04:48Z

Pull Request type

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

Issue:#213
Correct locking mechanism in WorkflowSweeper to prevent race conditions

Previously, the WorkflowSweeper.class had a potential race condition due to the order of operations in the sweep method. The workflow was being fetched from the executionDaoFacade before acquiring the lock, followed by a verifyAndRepair operation that could mutate the state. This sequence allowed for a small window (~50µ to 100µ seconds) where a workflow could be in two different states across different threads, causing inconsistencies and failures in workflow listeners or completion checks.

Changes made:

Removed the decideWithLock method.
Moved the locking logic directly into the sweep method to ensure the workflow is locked before any operations are performed on it.
Ensured the workflow lock is only released after it is removed from the queue to prevent race conditions.

Observed issues:

Race conditions were observed at large scale (30 replicas in Kubernetes, Redis cluster with Redis lock, ~75-90 workflows/sec).
Workflows could be in a "Running" state even after triggering the finish, leading to listener and completion check failures.

New implementation in sweep method:

Acquire lock before fetching the workflow and performing any operations.
Handle verify and repair within the lock.
Ensure the workflow is locked throughout the decision process.
Release the lock only after removing the workflow from the queue.

This fix ensures atomicity in operations on workflows, preventing the race conditions previously observed.

core/src/main/java/com/netflix/conductor/core/reconciliation/WorkflowSweeper.java

manan164 · 2024-07-22T13:17:22Z

core/src/main/java/com/netflix/conductor/core/reconciliation/WorkflowSweeper.java

@@ -74,24 +79,25 @@ public CompletableFuture<Void> sweepAsync(String workflowId) {

    public void sweep(String workflowId) {
        WorkflowModel workflow = null;
+        StopWatch watch = new StopWatch();


Hi @rq-dbrady , Instead of using stopwatch, Can we try to acquire a lock here and if successful, then we do repair or come out of sweep logic?

So currently it follows the logic:

Create stopwatch.

Set the workflow context.

AcquireLock (if fail return from sweep)

Gets workflow from store.

Run repair

Start Stop Watch.

Decide

Remove From DeciderQueue

Release lock and Stop StopWatch.

Are you proposing we Acquire the lock at the method start and then start the stop watch ? so something like

public void sweep(String workflowId) { WorkflowModel workflow = null; StopWatch watch = new StopWatch(); if (!executionLockService.acquireLock(workflowId)) { watch.start(); //Do repair logic // Decide etc ... }

Hi @rq-dbrady , Yes, the first thing we should try is to acquire a lock. then start a stop watch, run decide, stop stopwatch. release a lock. Optionally we dont need stop watch to measure execution time. We can use System.currentTimeMillis()

@manan164 , Applied changes in the logic order suggested !

Acquire lock

Run repair

Start timer

Decide

End timer

Release lock

marcellusm2 reviewed Jul 19, 2024

View reviewed changes

core/src/main/java/com/netflix/conductor/core/reconciliation/WorkflowSweeper.java Outdated Show resolved Hide resolved

marcellusm2 reviewed Jul 19, 2024

View reviewed changes

core/src/main/java/com/netflix/conductor/core/reconciliation/WorkflowSweeper.java Outdated Show resolved Hide resolved

marcellusm2 reviewed Jul 19, 2024

View reviewed changes

core/src/main/java/com/netflix/conductor/core/reconciliation/WorkflowSweeper.java Show resolved Hide resolved

manan164 reviewed Jul 22, 2024

View reviewed changes

rq-dbrady added 3 commits July 23, 2024 13:33

fix: Ensure proper locking in WorkflowSweeper to prevent race conditions

4943d97

Applied fixes commented by @marcellusm2

2172f88

Apply suggestions from @manan164

4976cc6

rq-dbrady force-pushed the rq-dbrady/redisLockSweeperRaceconditionFixes branch 2 times, most recently from 0c37dca to 76df8d7 Compare July 23, 2024 17:40

Rebased and applied changes from @v1r3n

42a708f

rq-dbrady force-pushed the rq-dbrady/redisLockSweeperRaceconditionFixes branch from 76df8d7 to 42a708f Compare July 23, 2024 17:41

rq-dbrady changed the title ~~fix: Ensure proper locking in WorkflowSweeper to prevent race conditions~~ Fix Issue: Ensure proper locking in WorkflowSweeper to prevent race conditions Jul 23, 2024

v1r3n approved these changes Jul 28, 2024

View reviewed changes

v1r3n merged commit 9da6a4e into conductor-oss:main Jul 28, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Issue: Ensure proper locking in WorkflowSweeper to prevent race conditions #214

Fix Issue: Ensure proper locking in WorkflowSweeper to prevent race conditions #214

rq-dbrady commented Jul 18, 2024

manan164 Jul 22, 2024

rq-dbrady Jul 22, 2024

manan164 Jul 23, 2024

rq-dbrady Jul 23, 2024

Fix Issue: Ensure proper locking in WorkflowSweeper to prevent race conditions #214

Fix Issue: Ensure proper locking in WorkflowSweeper to prevent race conditions #214

Conversation

rq-dbrady commented Jul 18, 2024

Pull Request type

Changes in this PR

manan164 Jul 22, 2024

Choose a reason for hiding this comment

rq-dbrady Jul 22, 2024

Choose a reason for hiding this comment

manan164 Jul 23, 2024

Choose a reason for hiding this comment

rq-dbrady Jul 23, 2024

Choose a reason for hiding this comment