[Data] Allow to specify application-level error to retry for actor task #42492

c21 · 2024-01-18T18:30:37Z

Why are these changes needed?

User reported issue that they cannot specify application-level exception retry for actor task (retry_exceptions). This is due to our actor pool operator does not allow specify ray remote arguments for actor task. This PR adds a config as DataContext.actor_task_retry_on_errors, so users can control application-level exceptions retry.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cheng Su <[email protected]>

raulchen · 2024-01-19T21:09:27Z

python/ray/data/dataset.py

+                ray for each map worker. This applies to Ray tasks and actors.
+                To request resource requirements for tasks launched by Ray actor,
+                specify ``ray_actor_task_remote_args={...}`` inside
+                ``ray_remote_args``.


IIUC, the ray_remote_args is already used for creating the actors. but it also seems unintuitive to use a nested parameter. putting the parameter in the top tier seems better. WDYT?

yes agree with the unintuitive behavior. right now we do not mention actor / task in top-level parameter list for map_batches. Seems not good to add an ray_actor_task_remote_args, that brings actor / task back. Another option is to use a data config to specify what's application level error to retry for actor task. WDYT?

if currently the requirement is only for enabling application level error retry, I'd vote for adding a specific configure in DataContext. This is easier for Data users to understand.

Updated, thanks.

Signed-off-by: Cheng Su <[email protected]>

raulchen · 2024-01-23T00:30:25Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+        self._ray_actor_task_remote_args = {}
+        actor_task_errors = DataContext.get_current().actor_task_retry_on_errors
+        if len(actor_task_errors) > 0:
+            self._ray_actor_task_remote_args["retry_exceptions"] = actor_task_errors


IIRC retry_exceptions can either be True/False or a list of exception types in ray core. and defaults to false.
Maybe let's keep the behavior same in data.
also remember to update the comments in DataContext.

yes, updated.

Signed-off-by: Cheng Su <[email protected]>

…sk (ray-project#42492) User reported issue that they cannot specify application-level exception retry for actor task (`retry_exceptions`). This is due to our actor pool operator does not allow specify ray remote arguments for actor task. This PR adds a config as `DataContext.actor_task_retry_on_errors`, so users can control application-level exceptions retry. Signed-off-by: Cheng Su <[email protected]> Signed-off-by: khluu <[email protected]>

Allow to specify ray remote args for actor task

2559177

Signed-off-by: Cheng Su <[email protected]>

c21 requested review from ericl, scv119, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and Zandew as code owners January 18, 2024 18:30

c21 assigned raulchen, scottjlee and bveeramani Jan 18, 2024

scottjlee approved these changes Jan 18, 2024

View reviewed changes

raulchen reviewed Jan 19, 2024

View reviewed changes

c21 added 2 commits January 22, 2024 14:27

Change to use DataContext config instead

e029465

Signed-off-by: Cheng Su <[email protected]>

Revert unnecessary change in dataset.py

64a03f4

Signed-off-by: Cheng Su <[email protected]>

c21 changed the title ~~[Data] Allow to specify ray remote args for actor task~~ [Data] Allow to specify application-level error to retry for actor task Jan 22, 2024

raulchen reviewed Jan 23, 2024

View reviewed changes

raulchen approved these changes Jan 23, 2024

View reviewed changes

Address comment

191c463

Signed-off-by: Cheng Su <[email protected]>

c21 merged commit bddcaad into ray-project:master Jan 23, 2024
9 checks passed

c21 deleted the actor-task-arg branch January 23, 2024 20:29

meltzerpete mentioned this pull request Mar 8, 2024

[Data] Retry on OSError: AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached #43803

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Allow to specify application-level error to retry for actor task #42492

[Data] Allow to specify application-level error to retry for actor task #42492

c21 commented Jan 18, 2024 •

edited

Loading

raulchen Jan 19, 2024

c21 Jan 19, 2024

raulchen Jan 22, 2024

c21 Jan 22, 2024

raulchen Jan 23, 2024

c21 Jan 23, 2024

[Data] Allow to specify application-level error to retry for actor task #42492

[Data] Allow to specify application-level error to retry for actor task #42492

Conversation

c21 commented Jan 18, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

raulchen Jan 19, 2024

Choose a reason for hiding this comment

c21 Jan 19, 2024

Choose a reason for hiding this comment

raulchen Jan 22, 2024

Choose a reason for hiding this comment

c21 Jan 22, 2024

Choose a reason for hiding this comment

raulchen Jan 23, 2024

Choose a reason for hiding this comment

c21 Jan 23, 2024

Choose a reason for hiding this comment

c21 commented Jan 18, 2024 •

edited

Loading