-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Emit warning when starting Dataset execution with no CPU resources available #31574
Conversation
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
"are freed up. A common reason is that cluster resources are " | ||
"used by Actors or Tune trials, see the link below for more details:" | ||
"https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune" # noqa: E501 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a unit test for this warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl I added a note in the PR description about my attempts at unit testing -- let me know if you have suggestions on how to get around the issue I was facing, or any other questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I realized that unit testing this would be challenging, because the test will get stuck on ds.take(). This is because since the initial ray cluster is set up with no CPU resources, and there's no way to add more resources mid-execution, it will always get stuck in the ray.available_resources().get("CPU", None) is None case and be unable to finish executing the dataset.
You can run it in a separate thread as a workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran into some trouble using the threading
library, but I was able to alternatively leverage Mock
and raising a dummy exception via Mock.side_effect
to write two simplified unit tests.
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
python/ray/data/_internal/plan.py
Outdated
@@ -335,9 +335,16 @@ def execute( | |||
Returns: | |||
The blocks of the output dataset. | |||
""" | |||
context = DatasetContext.get_current() | |||
if not ray.available_resources().get("CPU"): | |||
logger.get_logger(log_to_stdout=context.enable_auto_log_stats).warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message is not stats, it seems enable_auto_log_stats
is not an applicable flag.
Also, should we by default not log to stdout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch, removed here. Yes, by default, the logger will log to stdout. There are some cases within the DatasetStats
where we wanted to avoid cluttering stdout with long outputs, since they are always written to the dataset log file. (This is not one of those cases, as we definitely want the warning in stdout)
Seems have CI test failure here. |
Yup, almost done fixing this and will push soon, thanks. |
Signed-off-by: Scott Lee <[email protected]>
…esources available (#31574) The existing Ray/Ray autoscaler warnings for a cluster with no available CPU resources is pretty generic, and can be difficult for users to understand and debug. We check at the start of Dataset execution if there are no available CPU resources, and emit a more detailed warning message with a [helpful link to documentation](https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune).
Signed-off-by: Scott Lee [email protected]
Why are these changes needed?
The existing Ray/Ray autoscaler warnings for a cluster with no available CPU resources is pretty generic, and can be difficult for users to understand and debug. We check at the start of Dataset execution if there are no available CPU resources, and emit a more detailed warning message with a helpful link to documentation.
Related issue number
Closes #31507
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.