Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Workflow] Unify the semantics of max_retries of workflow task and Ray task #26350

Merged
merged 5 commits into from
Jul 19, 2022

Conversation

suquark
Copy link
Member

@suquark suquark commented Jul 7, 2022

Why are these changes needed?

Previously the semantics of max_retries of workflow tasks are different from Ray tasks, because the semantics of max_retries of Ray tasks was not correct. Now Ray fixed the semantics of max_retries, so we can unify it with the workflow. This gets rid of max_retries in workflow options and we can just use max_retries in Ray options.

This PR fixes three things:

  1. Currently workflow task retry only works with system failures and user runtime exceptions separately; that mean the combined total retries could exceed max_retries.
  2. By default, Ray tasks would retry automatically with lineage reconstruction, this would skip workflow checkpoints when we run workflow tasks based on Ray tasks.
  3. The error handling tests are mixed with other tests. This PR moves them to a new module for easier management.

Test test_step_failure is enhanced to reflect the changes of this PR.

Later we should align our semantics with #25896

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@suquark suquark marked this pull request as ready for review July 9, 2022 00:22
@suquark
Copy link
Member Author

suquark commented Jul 9, 2022

(force push before reviewing to get rid of some CI failures from upstream)

@suquark suquark changed the title [Workflow] Fix workflow task retry [Workflow] Unify the semantics of max_retries of workflow task and Ray task Jul 15, 2022
@suquark suquark requested a review from fishbone July 15, 2022 00:16
@@ -309,6 +318,7 @@ async def _post_process_ready_task(
output_ref: WorkflowRef,
) -> None:
state = self._state
state.task_retries.pop(task_id, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for now, but I feel this is generally easy to make mistake by just forgetting the cleanup here.

@@ -269,6 +266,18 @@ async def _handle_ready_task(
f"[{workflow_id}@{task_id}]"
)

# ---------------------- retry the task ----------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never realize it's so easy to do it.

from ray import workflow


def test_step_failure(workflow_start_regular_shared, tmp_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have test to make sure the task is not rerun by ray, like the data is loaded from storage and not reconstructed by the lineage?

Copy link
Member Author

@suquark suquark Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really. is there a test example in Ray? I feel it is a bit tricky to trigger lineage reconstruction in a unittest -- this requires deleting an object in object store (so the task could be actually reconstructed). do you have any ideas (do we have related helper functions)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do this:

  • a cluster with two nodes
  • kill the node where the object ref is generated (thus the data is gone). write a flag to fs indicate it only run once.
  • task does not rerun

Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG! Some comments.

@fishbone fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 15, 2022
@suquark
Copy link
Member Author

suquark commented Jul 18, 2022

@iycheng The workflow does not support 'scheduling_strategy' that is not son-serializable as Ray task options. So I just create the test and skip it. We can enable it later.

@suquark suquark requested a review from fishbone July 18, 2022 22:08
@suquark suquark removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 18, 2022
Signed-off-by: Siyuan Zhuang <[email protected]>
Signed-off-by: Siyuan Zhuang <[email protected]>
Signed-off-by: Siyuan Zhuang <[email protected]>
Signed-off-by: Siyuan Zhuang <[email protected]>
@suquark
Copy link
Member Author

suquark commented Jul 18, 2022

force update for DCO

@suquark
Copy link
Member Author

suquark commented Jul 19, 2022

The CI failure seems unrelated. I'll merge it.

@suquark suquark merged commit eb4ed49 into ray-project:master Jul 19, 2022
@suquark suquark deleted the retry_and_auto_recovery branch July 19, 2022 06:25
xwjiang2010 pushed a commit to xwjiang2010/ray that referenced this pull request Jul 19, 2022
…y task (ray-project#26350)

* workflow task retry

Signed-off-by: Siyuan Zhuang <[email protected]>

* move and enhance tests

Signed-off-by: Siyuan Zhuang <[email protected]>

* use "max_retries" of Ray task

Signed-off-by: Siyuan Zhuang <[email protected]>

* add test for disabling lineage reconstruction in workflow

Signed-off-by: Siyuan Zhuang <[email protected]>
Signed-off-by: Xiaowei Jiang <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…y task (ray-project#26350)

* workflow task retry

Signed-off-by: Siyuan Zhuang <[email protected]>

* move and enhance tests

Signed-off-by: Siyuan Zhuang <[email protected]>

* use "max_retries" of Ray task

Signed-off-by: Siyuan Zhuang <[email protected]>

* add test for disabling lineage reconstruction in workflow

Signed-off-by: Siyuan Zhuang <[email protected]>
Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants