[placement groups/autoscaler] unfulfillable requests should raise an error #18018

krfricke · 2021-08-23T15:44:59Z

What is the problem?

Latest master.

This is a follow up to item 3) from #18003 cc @AmeerHajAli

When requesting resources using placement groups that are unfulfillable by the autoscaler, no error is raised. Additionally, nodes are started to fulfill part of the requested resources.

Reproduction (REQUIRED)

If a custom_resource is requested, but no node type can fulfill it, nodes are still started for the resources requested in the bundle.

Using this script:

import ray

ray.init(address="auto")


# If only the first bundle is passed, no nodes are started up
# Nodes are started up to fulfill the 2nd-9th bundle.
pgs = [
    ray.util.placement_group([{"CPU": 4., "custom": 1.}] + [{"CPU": 1.}] * 8)
    for i in range(4)
]

ray.get([pg.ready() for pg in pgs])

and using this cluster config:

cluster_name: ray-tune-custom-resource-test

max_workers: 20
upscaling_speed: 20

idle_timeout_minutes: 0

docker:
    image: rayproject/ray:nightly
    container_name: ray_container
    pull_before_run: true

provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a
    cache_stopped_nodes: false

available_node_types:
    cpu_2_ondemand:
        node_config:
            InstanceType: m5.large
        resources: {"CPU": 2}
        min_workers: 0
        max_workers: 10
    cpu_8_ondemand:
        node_config:
            InstanceType: m5.2xlarge
        resources: {"CPU": 8}
        min_workers: 0
        max_workers: 10

auth:
    ssh_user: ubuntu

head_node_type: cpu_2_ondemand
worker_default_node_type: cpu_2_spot

file_mounts: {
  "/test": "./"
}

Observed behavior:

No error is thrown that this placement group will never be ready
Nodes are started to fulfill the resource requests by the 2nd-9th bundle (1 CPU each)

Expected behavior:

An error should be thrown that this request can never be satisfied
The resources for the child bundles should not be started. The placement group can never be ready, so we shouldn't request any resources for any of the bundles at all.

cc @DmitriGekhtman @sasha-s

The text was updated successfully, but these errors were encountered:

ericl · 2021-08-23T19:27:02Z

There's some overlap with a usability issue we're looking at here: #15933

We plan to prioritize this, which could also fix this issue. Though also open to autoscaler team taking the lead on this!

DmitriGekhtman · 2021-08-25T00:49:13Z

Expected behavior Item 2 seems like it should be implied by the advertised property that Placement groups are atomically created.

DmitriGekhtman · 2021-08-25T01:16:23Z

I could try to take a hack at fixing the autoscaling behavior (expected behavior item 2) -- will bug placement group autoscaling expert @wuisawesome if I get stuck.

wuisawesome · 2021-08-25T02:28:47Z

This looks similar to #17799 and #14908? Btw perhaps the autoscaler can raise an error. Option 2 seems preferable here, since one can always update the autoscaler config to make the node feasible.

DmitriGekhtman · 2021-08-25T03:06:53Z

Updating autoscaling config is a little touchy because
(a) By default ray up restarts ray processes (you have to pass a --no-restart flag)
(b) There are contexts (being vague) in which it's not possible to update the cluster config

sasha-s · 2021-08-25T07:43:42Z

It seems clean to reject inconsistent configs/requests, ideally with a clear error message.

ericl · 2021-12-07T22:04:34Z

Duplicates #18835

krfricke added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 23, 2021

krfricke mentioned this issue Aug 26, 2021

[rllib][gcs][placementgroups] instability issues running tune/rllib #18003

Closed

2 tasks

xwjiang2010 mentioned this issue Sep 10, 2021

[Tune] Access to more detailed resource management information #17799

Closed

matthewdeng mentioned this issue Nov 9, 2021

[train] add placement group support #20091

Merged

6 tasks

ericl closed this as completed Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[placement groups/autoscaler] unfulfillable requests should raise an error #18018

[placement groups/autoscaler] unfulfillable requests should raise an error #18018

krfricke commented Aug 23, 2021 •

edited

Loading

ericl commented Aug 23, 2021

DmitriGekhtman commented Aug 25, 2021

DmitriGekhtman commented Aug 25, 2021 •

edited

Loading

wuisawesome commented Aug 25, 2021

DmitriGekhtman commented Aug 25, 2021 •

edited

Loading

sasha-s commented Aug 25, 2021

ericl commented Dec 7, 2021

[placement groups/autoscaler] unfulfillable requests should raise an error #18018

[placement groups/autoscaler] unfulfillable requests should raise an error #18018

Comments

krfricke commented Aug 23, 2021 • edited Loading

What is the problem?

Reproduction (REQUIRED)

ericl commented Aug 23, 2021

DmitriGekhtman commented Aug 25, 2021

DmitriGekhtman commented Aug 25, 2021 • edited Loading

wuisawesome commented Aug 25, 2021

DmitriGekhtman commented Aug 25, 2021 • edited Loading

sasha-s commented Aug 25, 2021

ericl commented Dec 7, 2021

krfricke commented Aug 23, 2021 •

edited

Loading

DmitriGekhtman commented Aug 25, 2021 •

edited

Loading

DmitriGekhtman commented Aug 25, 2021 •

edited

Loading