Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Workflow] Simplify recovery algorithm #24594

Merged
merged 4 commits into from
May 10, 2022

Conversation

suquark
Copy link
Member

@suquark suquark commented May 9, 2022

Why are these changes needed?

The workflow recovery algorithm was originally implemented with recursions, this makes it harder to check, maintain and optimize, because you only debug variables in the current stack frames, and there are potential risks of stack overflow due to recursion. This also results in duplicated code between _reconstruct_wait_step and _construct_resume_workflow_from_step.

This PR simplifies the workflow recovery algorithm. It's now easier to check, maintain and optimize. The recovery algorithm is divided into 3 steps in a clean way:

  1. Collect all steps need to be resumed and construct their dependencies
  2. Use topological sort to determine the order of recovering these steps.
  3. Recover these sorted steps.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone
Copy link
Contributor

fishbone commented May 9, 2022

@suquark could you add more details in description area about what you did here?

@fishbone
Copy link
Contributor

fishbone commented May 9, 2022

Also, does this also support cancellation if we store the execution queue in the workflow manager?

@fishbone fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 9, 2022
@suquark suquark removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 9, 2022
@suquark
Copy link
Member Author

suquark commented May 9, 2022

@iycheng what kind of cancellation you want? If you would like to cancel the "construct the workflow for recovery" process, then yes this PR makes it possible; but it does not directly enables cancelling of a "already recovering" workflow. However, this PR would still be a necessary starting point for fully supporting workflow cancelling. Later I would also refactor the workflow execution code like this, and they would share a few code.

Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think we probably should pass the execution_queue to the execution engine with what we'll be able to have better support of wait/cancel/get and even suspension (workflow event)

@suquark
Copy link
Member Author

suquark commented May 10, 2022

The CI errors are not related. I'll merge it.

@suquark suquark merged commit bf6b7f4 into ray-project:master May 10, 2022
@suquark suquark deleted the non_recursion_recovery branch May 10, 2022 05:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants