Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fixed running multiple workflows with mutex and memo concurrently is broken #11883

Merged
merged 1 commit into from
Sep 25, 2023

Conversation

shmruin
Copy link
Contributor

@shmruin shmruin commented Sep 24, 2023

Fixes #11853

Motivation

When running multiple workflows with mutex synchronization and cache at the same time, there is an issue that mutex key is not released.

The first workflow will run correctly and release the keys, but the second workflow's 1st step will run but it does not release the key.

So this results in infinite pending nodes (like second workflow's 2nd or later nodes) as described in the issue.

Modifications

Omitting releasing lock in fulfilled node check after memo in executeTemplate makes this problem.

When checking fulfilled node, this also should release any locks that this node have.

In executeTemplate, lock is released at first of the function or after running the details (like this or this)

However, if memoization exists, node is returned as a short-cut, so currently releasing lock only depends on the first.
Though most nodes go through executeTemplate multiple times by their state, Omitting releasing lock in fulfilled node check after memo surley makes the problem.

Like this issue, when running multiple nodes concurrently, second workflow's first node is not fulfilled but cache is hit, so it pass the first and fulfilled node check after memo does not release the lock. so the later nodes cannot get the lock forever.

I just fix this issue by adding omitted releasing lock to fulfilled node check after memo

Verification

image

Simply can test this changes works by running the workflow below few times concurrently.

The fist workflows cache hit is 'NO' and 'YES' and seconds workflows cache hit is all 'YES' with no error as expected.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: example-steps-
spec:
  entrypoint: main

  templates:
    - name: main
      steps:
        - - name: job-1
            template: sleep
            arguments:
              parameters:
                - name: sleep_duration
                  value: 30
          - name: job-2
            template: sleep
            arguments:
              parameters:
                - name: sleep_duration
                  value: 15

    - name: sleep
      synchronization:
        mutex:
          name: mutex-example-steps-simple
      inputs:
        parameters:
          - name: sleep_duration
      script:
        image: alpine:latest
        command: [/bin/sh]
        source: |
          echo "Sleeping for {{ inputs.parameters.sleep_duration }}"
          sleep {{ inputs.parameters.sleep_duration }}
      memoize:
        key: "memo-key-5"
        cache:
          configMap:
            name: cache-example-steps-simple

Copy link
Contributor

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid analysis and description! Makes sense to me.

At worst, this release is a redundant no-op, so very safe change as far as I can tell

@terrytangyuan terrytangyuan merged commit b5f69a8 into argoproj:master Sep 25, 2023
23 checks passed
@terrytangyuan
Copy link
Member

Thanks!

@Paritosh-Anand
Copy link

Thanks for the explanation and the fix!

Can this PR be picked for v3.4.12 #11851

@shmruin
Copy link
Contributor Author

shmruin commented Sep 26, 2023

@Paritosh-Anand I guess this will be automatically included to that release because of no conflict expected. I will keep checking that thread 😀

terrytangyuan pushed a commit that referenced this pull request Oct 19, 2023
dpadhiar pushed a commit to dpadhiar/argo-workflows that referenced this pull request May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: combination of mutex and memoization is broken
4 participants