Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Emit warning when starting Dataset execution with no CPU resources available #31574

Merged
merged 9 commits into from
Jan 12, 2023

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Jan 10, 2023

Signed-off-by: Scott Lee [email protected]

Why are these changes needed?

The existing Ray/Ray autoscaler warnings for a cluster with no available CPU resources is pretty generic, and can be difficult for users to understand and debug. We check at the start of Dataset execution if there are no available CPU resources, and emit a more detailed warning message with a helpful link to documentation.

Related issue number

Closes #31507

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ericl ericl self-assigned this Jan 10, 2023
@scottjlee scottjlee marked this pull request as draft January 10, 2023 22:32
@scottjlee scottjlee marked this pull request as ready for review January 11, 2023 01:52
python/ray/data/_internal/plan.py Outdated Show resolved Hide resolved
"are freed up. A common reason is that cluster resources are "
"used by Actors or Tune trials, see the link below for more details:"
"https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune" # noqa: E501
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a unit test for this warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl I added a note in the PR description about my attempts at unit testing -- let me know if you have suggestions on how to get around the issue I was facing, or any other questions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I realized that unit testing this would be challenging, because the test will get stuck on ds.take(). This is because since the initial ray cluster is set up with no CPU resources, and there's no way to add more resources mid-execution, it will always get stuck in the ray.available_resources().get("CPU", None) is None case and be unable to finish executing the dataset.

You can run it in a separate thread as a workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran into some trouble using the threading library, but I was able to alternatively leverage Mock and raising a dummy exception via Mock.side_effect to write two simplified unit tests.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 11, 2023
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
@@ -335,9 +335,16 @@ def execute(
Returns:
The blocks of the output dataset.
"""
context = DatasetContext.get_current()
if not ray.available_resources().get("CPU"):
logger.get_logger(log_to_stdout=context.enable_auto_log_stats).warning(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message is not stats, it seems enable_auto_log_stats is not an applicable flag.
Also, should we by default not log to stdout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch, removed here. Yes, by default, the logger will log to stdout. There are some cases within the DatasetStats where we wanted to avoid cluttering stdout with long outputs, since they are always written to the dataset log file. (This is not one of those cases, as we definitely want the warning in stdout)

@c21
Copy link
Contributor

c21 commented Jan 11, 2023

Seems have CI test failure here.

@scottjlee
Copy link
Contributor Author

Seems have CI test failure here.

Yup, almost done fixing this and will push soon, thanks.

Signed-off-by: Scott Lee <[email protected]>
@scottjlee scottjlee removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 12, 2023
@ericl ericl merged commit 6d00593 into ray-project:master Jan 12, 2023
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…esources available (#31574)

The existing Ray/Ray autoscaler warnings for a cluster with no available CPU resources is pretty generic, and can be difficult for users to understand and debug. We check at the start of Dataset execution if there are no available CPU resources, and emit a more detailed warning message with a [helpful link to documentation](https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[data] Should warn if no CPU resources are available at the start of execution.
5 participants