Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise an Exception when a task, actor, or placement group is permanently infeasible #18835

Open
ericl opened this issue Sep 23, 2021 · 8 comments
Labels
core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks size:large usability
Milestone

Comments

@ericl
Copy link
Contributor

ericl commented Sep 23, 2021

After #18724, we should be able to raise exceptions when a task, actor, or placement group becomes permanently infeasible. To do this, the autoscaler can periodically publish a list of "permanently infeasible" resource demands to the GCS via an RPC, and this list can be distributed across the cluster in resource poll requests from the GCS.

The errors raised would be as follows:

@ray.remote(num_gpus=999)
class A:
   def f(self):
      pass

@ray.remote(num_gpus=999)
def f():
   pass

ray.get(f.remote())
# -> raises UnschedulableError("No available node types can fulfill request {"GPU": 999}.")

a = A.remote()
ray.get(a.f.remote())
# -> raises UnschedulableError("No available node types can fulfill request {"GPU": 999}.")

pg = ray.placement_group([{"GPU": 999}])
ray.get(pg.ready())
# -> raises UnschedulableError("The cluster configuration cannot fulfill [{"GPU": 999}].")

ray.get(f.options(placement_group=pg).remote())
# -> raises UnschedulableError("The cluster configuration cannot fulfill [{"GPU": 999}].")

PRD doc: https://docs.google.com/document/d/1OT6m4xQDN8UtsBgnAMpX6nhXpNAfdeHJVve-iGhw1WI/edit

cc @edoakes @scv119 @richardliaw @stephanie-wang @rkooo567

@ericl ericl added P1 Issue that should be fixed within a few weeks usability size:medium labels Sep 23, 2021
@ericl ericl added this to the Core Backlog milestone Sep 23, 2021
@ericl ericl changed the title Raise an Exception of a task, actor, or placement groups is permanently infeasible Raise an Exception when a task, actor, or placement group is permanently infeasible Sep 23, 2021
@rkooo567
Copy link
Contributor

pg = ray.placement_group([{"GPU": 999}])
ray.get(pg.ready())
# -> raises UnschedulableError("The cluster configuration cannot fulfill [{"GPU": 999}].")

@ericl do you also plan to mark "placement groups that are already deleted" as infeasible?

@ericl
Copy link
Contributor Author

ericl commented Sep 27, 2021 via email

@rkooo567
Copy link
Contributor

rkooo567 commented Nov 1, 2021

Btw, when do we plan to do this task?

@ericl
Copy link
Contributor Author

ericl commented Dec 7, 2021

@rkooo567 you're currently working on a design for this right? cc @richardliaw

@rkooo567
Copy link
Contributor

rkooo567 commented Dec 7, 2021

Yes. I've been putting it off a little bit (focus on other tasks first) because of the feature freeze. But I am planning to have a concrete proposal by the end of the sprint and work on it next sprint

@rkooo567
Copy link
Contributor

The initial draft; https://docs.google.com/document/d/158cRgisVt6JZ55ARckrRHDLjLJ6YA_9lnzN8bp9F6vU/edit

@xwjiang2010
Copy link
Contributor

Hi @rkooo567 wondering if there is an update on this ticket? There is some feedback about insufficient resource messages can be confusing if autoscaler is enabled.

@ericl
Copy link
Contributor Author

ericl commented Mar 18, 2022

I think this is blocked on pending work from @wuisawesome refactoring the autoscaler interfaces.

@scottsun94 scottsun94 added the observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling label Oct 17, 2022
@rkooo567 rkooo567 removed their assignment Dec 8, 2022
@rkooo567 rkooo567 added the core Issues that should be addressed in Ray Core label Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks size:large usability
Projects
None yet
Development

No branches or pull requests

4 participants