-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Workflow] Unify the semantics of max_retries of workflow task and Ray task #26350
Conversation
79748fc
to
ded4f00
Compare
517be01
to
cf641a1
Compare
(force push before reviewing to get rid of some CI failures from upstream) |
@@ -309,6 +318,7 @@ async def _post_process_ready_task( | |||
output_ref: WorkflowRef, | |||
) -> None: | |||
state = self._state | |||
state.task_retries.pop(task_id, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok for now, but I feel this is generally easy to make mistake by just forgetting the cleanup here.
@@ -269,6 +266,18 @@ async def _handle_ready_task( | |||
f"[{workflow_id}@{task_id}]" | |||
) | |||
|
|||
# ---------------------- retry the task ---------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never realize it's so easy to do it.
from ray import workflow | ||
|
||
|
||
def test_step_failure(workflow_start_regular_shared, tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have test to make sure the task is not rerun by ray, like the data is loaded from storage and not reconstructed by the lineage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really. is there a test example in Ray? I feel it is a bit tricky to trigger lineage reconstruction in a unittest -- this requires deleting an object in object store (so the task could be actually reconstructed). do you have any ideas (do we have related helper functions)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do this:
- a cluster with two nodes
- kill the node where the object ref is generated (thus the data is gone). write a flag to fs indicate it only run once.
- task does not rerun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG! Some comments.
@iycheng The workflow does not support 'scheduling_strategy' that is not son-serializable as Ray task options. So I just create the test and skip it. We can enable it later. |
Signed-off-by: Siyuan Zhuang <[email protected]>
Signed-off-by: Siyuan Zhuang <[email protected]>
Signed-off-by: Siyuan Zhuang <[email protected]>
Signed-off-by: Siyuan Zhuang <[email protected]>
Signed-off-by: Siyuan Zhuang <[email protected]>
1e502b0
to
adc732e
Compare
force update for DCO |
The CI failure seems unrelated. I'll merge it. |
…y task (ray-project#26350) * workflow task retry Signed-off-by: Siyuan Zhuang <[email protected]> * move and enhance tests Signed-off-by: Siyuan Zhuang <[email protected]> * use "max_retries" of Ray task Signed-off-by: Siyuan Zhuang <[email protected]> * add test for disabling lineage reconstruction in workflow Signed-off-by: Siyuan Zhuang <[email protected]> Signed-off-by: Xiaowei Jiang <[email protected]>
…y task (ray-project#26350) * workflow task retry Signed-off-by: Siyuan Zhuang <[email protected]> * move and enhance tests Signed-off-by: Siyuan Zhuang <[email protected]> * use "max_retries" of Ray task Signed-off-by: Siyuan Zhuang <[email protected]> * add test for disabling lineage reconstruction in workflow Signed-off-by: Siyuan Zhuang <[email protected]> Signed-off-by: Stefan van der Kleij <[email protected]>
Why are these changes needed?
Previously the semantics of
max_retries
of workflow tasks are different from Ray tasks, because the semantics ofmax_retries
of Ray tasks was not correct. Now Ray fixed the semantics ofmax_retries
, so we can unify it with the workflow. This gets rid ofmax_retries
in workflow options and we can just usemax_retries
in Ray options.This PR fixes three things:
max_retries
.Test
test_step_failure
is enhanced to reflect the changes of this PR.Later we should align our semantics with #25896
Checks
scripts/format.sh
to lint the changes in this PR.