-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Allow to specify application-level error to retry for actor task #42492
Conversation
Signed-off-by: Cheng Su <[email protected]>
python/ray/data/dataset.py
Outdated
ray for each map worker. This applies to Ray tasks and actors. | ||
To request resource requirements for tasks launched by Ray actor, | ||
specify ``ray_actor_task_remote_args={...}`` inside | ||
``ray_remote_args``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the ray_remote_args
is already used for creating the actors. but it also seems unintuitive to use a nested parameter. putting the parameter in the top tier seems better. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes agree with the unintuitive behavior. right now we do not mention actor / task in top-level parameter list for map_batches
. Seems not good to add an ray_actor_task_remote_args
, that brings actor / task back. Another option is to use a data config to specify what's application level error to retry for actor task. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if currently the requirement is only for enabling application level error retry, I'd vote for adding a specific configure in DataContext. This is easier for Data users to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
self._ray_actor_task_remote_args = {} | ||
actor_task_errors = DataContext.get_current().actor_task_retry_on_errors | ||
if len(actor_task_errors) > 0: | ||
self._ray_actor_task_remote_args["retry_exceptions"] = actor_task_errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC retry_exceptions
can either be True/False or a list of exception types in ray core. and defaults to false.
Maybe let's keep the behavior same in data.
also remember to update the comments in DataContext
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, updated.
Signed-off-by: Cheng Su <[email protected]>
…sk (ray-project#42492) User reported issue that they cannot specify application-level exception retry for actor task (`retry_exceptions`). This is due to our actor pool operator does not allow specify ray remote arguments for actor task. This PR adds a config as `DataContext.actor_task_retry_on_errors`, so users can control application-level exceptions retry. Signed-off-by: Cheng Su <[email protected]> Signed-off-by: khluu <[email protected]>
Why are these changes needed?
User reported issue that they cannot specify application-level exception retry for actor task (
retry_exceptions
). This is due to our actor pool operator does not allow specify ray remote arguments for actor task. This PR adds a config asDataContext.actor_task_retry_on_errors
, so users can control application-level exceptions retry.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.