Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Tasks/actors with no resource requirements are not scheduled using placement groups as expected #31034

Closed
Yard1 opened this issue Dec 12, 2022 · 2 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@Yard1
Copy link
Member

Yard1 commented Dec 12, 2022

What happened + What you expected to happen

I specify an actor which uses a STRICT_PACK placement group scheduling strategy with placement_group_capture_child_tasks=True. That actor spawns several tasks which have no resource requirements (num_cpus=0). I expect the tasks to be scheduled on the same node as the actor, as the strategy is STRICT_PACK. Instead, the tasks are running on arbitrary nodes. This only happens if the tasks have no resource requirements.

Versions / Dependencies

master (df13a1d)

Reproduction script

import ray
import ray.util
from ray.util.placement_group import placement_group
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy


@ray.remote
def task(ip):
    print(ip, ray.util.get_node_ip_address())
    assert ip == ray.util.get_node_ip_address()
    return


@ray.remote
class Actor:
    def __init__(self, task_cpus):
        self.task_cpus = task_cpus

    def run(self):
        task_with_cpus_set = task.options(num_cpus=self.task_cpus)
        ip = ray.util.get_node_ip_address()
        tasks = [task_with_cpus_set.remote(ip) for i in range(32)]
        ray.get(tasks)


pg = placement_group([{"CPU": 1}] * 8, strategy="STRICT_PACK")
ray.get(pg.ready())

ActorWithPlacementGroup = Actor.options(
    scheduling_strategy=PlacementGroupSchedulingStrategy(
        placement_group=pg, placement_group_capture_child_tasks=True
    )
)

# works
actor = ActorWithPlacementGroup.remote(task_cpus=1)
ray.get(actor.run.remote())

print("fail")

# fails
actor = ActorWithPlacementGroup.remote(task_cpus=0)
ray.get(actor.run.remote())

Run on a cluster with multiple nodes each with <= 8 CPUs (eg. https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_j4rphgl1yttb36lhzw59svlf/ses_subxaevyczw11qxaylf3m9qa)

Issue Severity

High: It blocks me from completing my task.

@Yard1 Yard1 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels Dec 12, 2022
@amogkam
Copy link
Contributor

amogkam commented Dec 12, 2022

Looks like the same issue as #27931

@ericl
Copy link
Contributor

ericl commented Dec 12, 2022

Closed (duplicates #27931)

@ericl ericl closed this as completed Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants