Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Handle all Ray errors in task compute strategy. #30696

Conversation

clarkzinzow
Copy link
Contributor

If a Ray error other than a RayTaskError is raised

  1. while waiting for map tasks to finish, then we won't properly cancel other map tasks,
  2. while waiting for map tasks to be cancelled, then we won't raise the correct (original) task failure exception.

This PR fixes this by catching RayError in both cases, which is a superclass of RayTaskError, TaskCancelledError, and WorkerCrashedError (which may be raised during graceful task cancellation due to a Ray Core bug). This should deflake the test_select_columns test.

Related issue number

Closes #30644

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to add a unit test for this?

@clarkzinzow
Copy link
Contributor Author

clarkzinzow commented Nov 28, 2022

Confirmed that this deflakes the test.

@c21 I'm attempting to add a test that parametrizes test_dataset_retry_exceptions such that it crashes the worker (sys.exit() or os._exit(0)) to directly test this, but the existing test keeps segfaulting on actor destruction locally... I've been running into this issue with actor garbage collection when running tests locally for a while.

def test_dataset_retry_exceptions(ray_start_regular, local_path):

@clarkzinzow
Copy link
Contributor Author

Test failures are unrelated, merging.

@clarkzinzow clarkzinzow merged commit d716b17 into ray-project:master Nov 29, 2022
clarkzinzow added a commit to clarkzinzow/ray that referenced this pull request Nov 29, 2022
…ct#30696)

If a Ray error other than a RayTaskError is raised
1. while waiting for map tasks to finish, then we won't properly cancel other map tasks,
2. while waiting for map tasks to be cancelled, then we won't raise the correct (original) task failure exception.

This PR fixes this by catching RayError in both cases, which is a superclass of RayTaskError, TaskCancelledError, and WorkerCrashedError (which may be raised during graceful task cancellation due to a Ray Core bug). This should deflake the test_select_columns test.
clarkzinzow added a commit that referenced this pull request Nov 29, 2022
…30720)

If a Ray error other than a RayTaskError is raised
1. while waiting for map tasks to finish, then we won't properly cancel other map tasks,
2. while waiting for map tasks to be cancelled, then we won't raise the correct (original) task failure exception.

This PR fixes this by catching RayError in both cases, which is a superclass of RayTaskError, TaskCancelledError, and WorkerCrashedError (which may be raised during graceful task cancellation due to a Ray Core bug). This should deflake the test_select_columns test.
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…ct#30696)

If a Ray error other than a RayTaskError is raised
1. while waiting for map tasks to finish, then we won't properly cancel other map tasks,
2. while waiting for map tasks to be cancelled, then we won't raise the correct (original) task failure exception.

This PR fixes this by catching RayError in both cases, which is a superclass of RayTaskError, TaskCancelledError, and WorkerCrashedError (which may be raised during graceful task cancellation due to a Ray Core bug). This should deflake the test_select_columns test.

Signed-off-by: Weichen Xu <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…ct#30696)

If a Ray error other than a RayTaskError is raised
1. while waiting for map tasks to finish, then we won't properly cancel other map tasks,
2. while waiting for map tasks to be cancelled, then we won't raise the correct (original) task failure exception.

This PR fixes this by catching RayError in both cases, which is a superclass of RayTaskError, TaskCancelledError, and WorkerCrashedError (which may be raised during graceful task cancellation due to a Ray Core bug). This should deflake the test_select_columns test.

Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ci] test_dataset is failing on release branch
5 participants