fix: Fixed running multiple workflows with mutex and memo concurrently is broken #11883
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #11853
Motivation
When running multiple workflows with mutex synchronization and cache at the same time, there is an issue that mutex key is not released.
The first workflow will run correctly and release the keys, but the second workflow's 1st step will run but it does not release the key.
So this results in infinite pending nodes (like second workflow's 2nd or later nodes) as described in the issue.
Modifications
Omitting releasing lock in fulfilled node check after memo in
executeTemplate
makes this problem.When checking fulfilled node, this also should release any locks that this node have.
In
executeTemplate
, lock is released at first of the function or after running the details (like this or this)However, if memoization exists, node is returned as a short-cut, so currently releasing lock only depends on the first.
Though most nodes go through
executeTemplate
multiple times by their state, Omitting releasing lock in fulfilled node check after memo surley makes the problem.Like this issue, when running multiple nodes concurrently, second workflow's first node is not
fulfilled
but cache is hit, so it pass the first and fulfilled node check after memo does not release the lock. so the later nodes cannot get the lock forever.I just fix this issue by adding omitted releasing lock to fulfilled node check after memo
Verification
Simply can test this changes works by running the workflow below few times concurrently.
The fist workflows cache hit is 'NO' and 'YES' and seconds workflows cache hit is all 'YES' with no error as expected.