[Workflow] Unify the semantics of max_retries of workflow task and Ray task #26350

suquark · 2022-07-07T07:42:37Z

Why are these changes needed?

Previously the semantics of max_retries of workflow tasks are different from Ray tasks, because the semantics of max_retries of Ray tasks was not correct. Now Ray fixed the semantics of max_retries, so we can unify it with the workflow. This gets rid of max_retries in workflow options and we can just use max_retries in Ray options.

This PR fixes three things:

Currently workflow task retry only works with system failures and user runtime exceptions separately; that mean the combined total retries could exceed max_retries.
By default, Ray tasks would retry automatically with lineage reconstruction, this would skip workflow checkpoints when we run workflow tasks based on Ray tasks.
The error handling tests are mixed with other tests. This PR moves them to a new module for easier management.

Test test_step_failure is enhanced to reflect the changes of this PR.

Later we should align our semantics with #25896

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

suquark · 2022-07-09T00:30:42Z

(force push before reviewing to get rid of some CI failures from upstream)

python/ray/workflow/step_executor.py

fishbone · 2022-07-15T00:25:29Z

python/ray/workflow/workflow_executor.py

@@ -309,6 +318,7 @@ async def _post_process_ready_task(
        output_ref: WorkflowRef,
    ) -> None:
        state = self._state
+        state.task_retries.pop(task_id, None)


Ok for now, but I feel this is generally easy to make mistake by just forgetting the cleanup here.

fishbone · 2022-07-15T00:26:24Z

python/ray/workflow/workflow_executor.py

@@ -269,6 +266,18 @@ async def _handle_ready_task(
                f"[{workflow_id}@{task_id}]"
            )

+            # ---------------------- retry the task ----------------------


I never realize it's so easy to do it.

fishbone · 2022-07-15T00:29:37Z

python/ray/workflow/tests/test_error_handling.py

+from ray import workflow
+
+
+def test_step_failure(workflow_start_regular_shared, tmp_path):


Do we have test to make sure the task is not rerun by ray, like the data is loaded from storage and not reconstructed by the lineage?

not really. is there a test example in Ray? I feel it is a bit tricky to trigger lineage reconstruction in a unittest -- this requires deleting an object in object store (so the task could be actually reconstructed). do you have any ideas (do we have related helper functions)?

I think you can do this:

a cluster with two nodes

kill the node where the object ref is generated (thus the data is gone). write a flag to fs indicate it only run once.

task does not rerun

fishbone

LG! Some comments.

suquark · 2022-07-18T22:06:40Z

@iycheng The workflow does not support 'scheduling_strategy' that is not son-serializable as Ray task options. So I just create the test and skip it. We can enable it later.

Signed-off-by: Siyuan Zhuang <[email protected]>

suquark · 2022-07-18T23:10:10Z

force update for DCO

suquark · 2022-07-19T06:24:49Z

The CI failure seems unrelated. I'll merge it.

…y task (ray-project#26350) * workflow task retry Signed-off-by: Siyuan Zhuang <[email protected]> * move and enhance tests Signed-off-by: Siyuan Zhuang <[email protected]> * use "max_retries" of Ray task Signed-off-by: Siyuan Zhuang <[email protected]> * add test for disabling lineage reconstruction in workflow Signed-off-by: Siyuan Zhuang <[email protected]> Signed-off-by: Xiaowei Jiang <[email protected]>

…y task (ray-project#26350) * workflow task retry Signed-off-by: Siyuan Zhuang <[email protected]> * move and enhance tests Signed-off-by: Siyuan Zhuang <[email protected]> * use "max_retries" of Ray task Signed-off-by: Siyuan Zhuang <[email protected]> * add test for disabling lineage reconstruction in workflow Signed-off-by: Siyuan Zhuang <[email protected]> Signed-off-by: Stefan van der Kleij <[email protected]>

suquark force-pushed the retry_and_auto_recovery branch from 79748fc to ded4f00 Compare July 8, 2022 06:56

suquark marked this pull request as ready for review July 9, 2022 00:22

suquark requested review from ericl, fishbone and stephanie-wang as code owners July 9, 2022 00:22

suquark assigned stephanie-wang and fishbone Jul 9, 2022

suquark force-pushed the retry_and_auto_recovery branch from 517be01 to cf641a1 Compare July 9, 2022 00:29

fishbone reviewed Jul 14, 2022

View reviewed changes

python/ray/workflow/step_executor.py Show resolved Hide resolved

suquark requested a review from maxpumperla as a code owner July 15, 2022 00:10

suquark changed the title ~~[Workflow] Fix workflow task retry~~ [Workflow] Unify the semantics of max_retries of workflow task and Ray task Jul 15, 2022

suquark requested a review from fishbone July 15, 2022 00:16

fishbone reviewed Jul 15, 2022

View reviewed changes

fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 15, 2022

suquark requested a review from fishbone July 18, 2022 22:08

suquark removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 18, 2022

fishbone approved these changes Jul 18, 2022

View reviewed changes

suquark added 5 commits July 18, 2022 16:09

workflow task retry

251337e

Signed-off-by: Siyuan Zhuang <[email protected]>

move and enhance tests

52bc831

Signed-off-by: Siyuan Zhuang <[email protected]>

use "max_retries" of Ray task

f74d8e7

Signed-off-by: Siyuan Zhuang <[email protected]>

add test for disabling lineage reconstruction in workflow

3307020

Signed-off-by: Siyuan Zhuang <[email protected]>

typo

adc732e

Signed-off-by: Siyuan Zhuang <[email protected]>

suquark force-pushed the retry_and_auto_recovery branch from 1e502b0 to adc732e Compare July 18, 2022 23:09

suquark merged commit eb4ed49 into ray-project:master Jul 19, 2022

suquark deleted the retry_and_auto_recovery branch July 19, 2022 06:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Workflow] Unify the semantics of max_retries of workflow task and Ray task #26350

[Workflow] Unify the semantics of max_retries of workflow task and Ray task #26350

suquark commented Jul 7, 2022 •

edited

Loading

suquark commented Jul 9, 2022

fishbone Jul 15, 2022

fishbone Jul 15, 2022

fishbone Jul 15, 2022

suquark Jul 15, 2022 •

edited

Loading

fishbone Jul 15, 2022

fishbone left a comment

suquark commented Jul 18, 2022

suquark commented Jul 18, 2022

suquark commented Jul 19, 2022

		from ray import workflow


		def test_step_failure(workflow_start_regular_shared, tmp_path):

[Workflow] Unify the semantics of max_retries of workflow task and Ray task #26350

[Workflow] Unify the semantics of max_retries of workflow task and Ray task #26350

Conversation

suquark commented Jul 7, 2022 • edited Loading

Why are these changes needed?

Checks

suquark commented Jul 9, 2022

fishbone Jul 15, 2022

Choose a reason for hiding this comment

fishbone Jul 15, 2022

Choose a reason for hiding this comment

fishbone Jul 15, 2022

Choose a reason for hiding this comment

suquark Jul 15, 2022 • edited Loading

Choose a reason for hiding this comment

fishbone Jul 15, 2022

Choose a reason for hiding this comment

fishbone left a comment

Choose a reason for hiding this comment

suquark commented Jul 18, 2022

suquark commented Jul 18, 2022

suquark commented Jul 19, 2022

suquark commented Jul 7, 2022 •

edited

Loading

suquark Jul 15, 2022 •

edited

Loading