Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[placement groups/autoscaler] unfulfillable requests should raise an error #18018

Closed
krfricke opened this issue Aug 23, 2021 · 7 comments
Closed
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@krfricke
Copy link
Contributor

krfricke commented Aug 23, 2021

What is the problem?

Latest master.

This is a follow up to item 3) from #18003 cc @AmeerHajAli

When requesting resources using placement groups that are unfulfillable by the autoscaler, no error is raised. Additionally, nodes are started to fulfill part of the requested resources.

Reproduction (REQUIRED)

If a custom_resource is requested, but no node type can fulfill it, nodes are still started for the resources requested in the bundle.

Using this script:

import ray

ray.init(address="auto")


# If only the first bundle is passed, no nodes are started up
# Nodes are started up to fulfill the 2nd-9th bundle.
pgs = [
    ray.util.placement_group([{"CPU": 4., "custom": 1.}] + [{"CPU": 1.}] * 8)
    for i in range(4)
]

ray.get([pg.ready() for pg in pgs])

and using this cluster config:

cluster_name: ray-tune-custom-resource-test

max_workers: 20
upscaling_speed: 20

idle_timeout_minutes: 0

docker:
    image: rayproject/ray:nightly
    container_name: ray_container
    pull_before_run: true

provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a
    cache_stopped_nodes: false

available_node_types:
    cpu_2_ondemand:
        node_config:
            InstanceType: m5.large
        resources: {"CPU": 2}
        min_workers: 0
        max_workers: 10
    cpu_8_ondemand:
        node_config:
            InstanceType: m5.2xlarge
        resources: {"CPU": 8}
        min_workers: 0
        max_workers: 10

auth:
    ssh_user: ubuntu

head_node_type: cpu_2_ondemand
worker_default_node_type: cpu_2_spot

file_mounts: {
  "/test": "./"
}

Observed behavior:

  1. No error is thrown that this placement group will never be ready
  2. Nodes are started to fulfill the resource requests by the 2nd-9th bundle (1 CPU each)

Expected behavior:

  1. An error should be thrown that this request can never be satisfied
  2. The resources for the child bundles should not be started. The placement group can never be ready, so we shouldn't request any resources for any of the bundles at all.

cc @DmitriGekhtman @sasha-s

@krfricke krfricke added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 23, 2021
@ericl
Copy link
Contributor

ericl commented Aug 23, 2021

There's some overlap with a usability issue we're looking at here: #15933

We plan to prioritize this, which could also fix this issue. Though also open to autoscaler team taking the lead on this!

@DmitriGekhtman
Copy link
Contributor

Expected behavior Item 2 seems like it should be implied by the advertised property that Placement groups are atomically created.

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Aug 25, 2021

I could try to take a hack at fixing the autoscaling behavior (expected behavior item 2) -- will bug placement group autoscaling expert @wuisawesome if I get stuck.

@wuisawesome
Copy link
Contributor

This looks similar to #17799 and #14908? Btw perhaps the autoscaler can raise an error. Option 2 seems preferable here, since one can always update the autoscaler config to make the node feasible.

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Aug 25, 2021

Updating autoscaling config is a little touchy because
(a) By default ray up restarts ray processes (you have to pass a --no-restart flag)
(b) There are contexts (being vague) in which it's not possible to update the cluster config

@sasha-s
Copy link
Contributor

sasha-s commented Aug 25, 2021

It seems clean to reject inconsistent configs/requests, ideally with a clear error message.

@ericl
Copy link
Contributor

ericl commented Dec 7, 2021

Duplicates #18835

@ericl ericl closed this as completed Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

5 participants