Raise an Exception when a task, actor, or placement group is permanently infeasible #18835

ericl · 2021-09-23T00:10:50Z

After #18724, we should be able to raise exceptions when a task, actor, or placement group becomes permanently infeasible. To do this, the autoscaler can periodically publish a list of "permanently infeasible" resource demands to the GCS via an RPC, and this list can be distributed across the cluster in resource poll requests from the GCS.

The errors raised would be as follows:

@ray.remote(num_gpus=999)
class A:
   def f(self):
      pass

@ray.remote(num_gpus=999)
def f():
   pass

ray.get(f.remote())
# -> raises UnschedulableError("No available node types can fulfill request {"GPU": 999}.")

a = A.remote()
ray.get(a.f.remote())
# -> raises UnschedulableError("No available node types can fulfill request {"GPU": 999}.")

pg = ray.placement_group([{"GPU": 999}])
ray.get(pg.ready())
# -> raises UnschedulableError("The cluster configuration cannot fulfill [{"GPU": 999}].")

ray.get(f.options(placement_group=pg).remote())
# -> raises UnschedulableError("The cluster configuration cannot fulfill [{"GPU": 999}].")

PRD doc: https://docs.google.com/document/d/1OT6m4xQDN8UtsBgnAMpX6nhXpNAfdeHJVve-iGhw1WI/edit

cc @edoakes @scv119 @richardliaw @stephanie-wang @rkooo567

The text was updated successfully, but these errors were encountered:

rkooo567 · 2021-09-27T12:49:32Z

pg = ray.placement_group([{"GPU": 999}])
ray.get(pg.ready())
# -> raises UnschedulableError("The cluster configuration cannot fulfill [{"GPU": 999}].")

@ericl do you also plan to mark "placement groups that are already deleted" as infeasible?

ericl · 2021-09-27T17:28:32Z

Yes we should do that too.

…

On Mon, Sep 27, 2021, 5:49 AM SangBin Cho ***@***.***> wrote: pg = ray.placement_group([{"GPU": 999}]) ray.get(pg.ready()) # -> raises UnschedulableError("The cluster configuration cannot fulfill [{"GPU": 999}].") @ericl <https://github.com/ericl> do you also plan to mark "placement groups that are already deleted" as infeasible? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18835 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSRAXHXKVNQPCMSA6JDUEBR6RANCNFSM5ESQDWUQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

rkooo567 · 2021-11-01T13:58:42Z

Btw, when do we plan to do this task?

ericl · 2021-12-07T22:04:58Z

@rkooo567 you're currently working on a design for this right? cc @richardliaw

rkooo567 · 2021-12-07T22:06:11Z

Yes. I've been putting it off a little bit (focus on other tasks first) because of the feature freeze. But I am planning to have a concrete proposal by the end of the sprint and work on it next sprint

rkooo567 · 2021-12-23T01:42:21Z

The initial draft; https://docs.google.com/document/d/158cRgisVt6JZ55ARckrRHDLjLJ6YA_9lnzN8bp9F6vU/edit

xwjiang2010 · 2022-03-18T23:49:57Z

Hi @rkooo567 wondering if there is an update on this ticket? There is some feedback about insufficient resource messages can be confusing if autoscaler is enabled.

ericl · 2022-03-18T23:53:21Z

I think this is blocked on pending work from @wuisawesome refactoring the autoscaler interfaces.

ericl added P1 Issue that should be fixed within a few weeks usability size:medium labels Sep 23, 2021

ericl added this to the Core Backlog milestone Sep 23, 2021

ericl changed the title ~~Raise an Exception of a task, actor, or placement groups is permanently infeasible~~ Raise an Exception when a task, actor, or placement group is permanently infeasible Sep 23, 2021

rkooo567 mentioned this issue Nov 1, 2021

[Placement group] Refine Remove API #10232

Open

This was referenced Nov 4, 2021

[Core] have the autoscaler raise the error if the placement group is infeasible #20043

Closed

[Tune] test_api.py::testBuiltInTrainableResources failing after using PG path #19985

Closed

ericl added size:large and removed size:medium labels Nov 12, 2021

rkooo567 self-assigned this Nov 26, 2021

ericl mentioned this issue Dec 7, 2021

[placement groups/autoscaler] unfulfillable requests should raise an error #18018

Closed

scottsun94 added the observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling label Oct 17, 2022

rkooo567 removed their assignment Dec 8, 2022

rkooo567 added the core Issues that should be addressed in Ray Core label Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise an Exception when a task, actor, or placement group is permanently infeasible #18835

Raise an Exception when a task, actor, or placement group is permanently infeasible #18835

ericl commented Sep 23, 2021 •

edited

Loading

rkooo567 commented Sep 27, 2021

ericl commented Sep 27, 2021 via email

rkooo567 commented Nov 1, 2021

ericl commented Dec 7, 2021

rkooo567 commented Dec 7, 2021 •

edited

Loading

rkooo567 commented Dec 23, 2021

xwjiang2010 commented Mar 18, 2022

ericl commented Mar 18, 2022

Raise an Exception when a task, actor, or placement group is permanently infeasible #18835

Raise an Exception when a task, actor, or placement group is permanently infeasible #18835

Comments

ericl commented Sep 23, 2021 • edited Loading

rkooo567 commented Sep 27, 2021

ericl commented Sep 27, 2021 via email

rkooo567 commented Nov 1, 2021

ericl commented Dec 7, 2021

rkooo567 commented Dec 7, 2021 • edited Loading

rkooo567 commented Dec 23, 2021

xwjiang2010 commented Mar 18, 2022

ericl commented Mar 18, 2022

ericl commented Sep 23, 2021 •

edited

Loading

rkooo567 commented Dec 7, 2021 •

edited

Loading