Support scheduling_hint=SPREAD|COLOCATE for tasks and actors #18524

ericl · 2021-09-12T00:45:09Z

Several use cases benefit from finer-grained control over scheduling, and cannot benefit from automatic locality-aware scheduling nor placement groups.

Proposal:

# The scheduler will try to spread the tasks across the cluster.
func.options(scheduling_hint="SPREAD").remote()

# The scheduler will only schedule the task on current node.
func.options(scheduling_hint="COLOCATE").remote()

Data reading tasks: These tasks have no input, but produce large amounts of output. Ideally Ray would spread these tasks across the cluster, but currently there is no way to do so. This causes data imbalance in ML ingest and Dask-on-Ray workloads. Currently Dask-on-Ray recommends a hidden scheduler flag for this: https://docs.ray.io/en/latest/data/dask-on-ray.html#best-practice-for-large-scale-workloads

This is also a blocker for scalable ML ingest without the "resource prefix" hack, since large datasets cause memory imbalance across the cluster without spreading.

Helper tasks relying on local resources: Suppose a task a file locally, but wants to launch sub-tasks for parallelism. There is no current way to do this except by relying on hacky node id resources. Another example is the driver forking a "main" task on the head node for easy debugging.

Related issues: #18465, #5722

The text was updated successfully, but these errors were encountered:

valiantljk · 2021-09-12T03:57:13Z

For colocate, does this api allow users to colocate any two+ tasks easily?

I think we also have this requirement, so far we just hacked with node id as customized res.

Good to see this proposal!

ericl · 2021-09-12T19:22:15Z

In the above proposal, you could force colocation of two tasks if the second task is launched by the first task. If the two tasks are launched independently you can already force colocation using a placement group, hope this helps.

simon-mo · 2021-09-13T05:04:06Z

This would also generalize to actors placement right?

ericl · 2021-09-13T05:08:24Z

Yes, we could support it for both tasks and actors, though the implementation may differ slightly for actors.

mwtian · 2021-09-24T17:15:23Z

For scheduling_hint="SPREAD", do we envision "spread" tasks and actors get launched together or separately, e.g. 100 tasks each launching another, or a single task launching 100 other tasks?

ericl · 2021-09-24T17:59:45Z

For the Datasets use case they're all launched together (by the driver). I think we'd want the hint to work well in both scenarios, are there some implications per scenario you're thinking of?

mwtian · 2021-09-24T18:24:14Z

For the spread hint, it seems we would need a higher level collection abstraction to spread tasks within, or an identifier where tasks / actors with the same identifier value are spread out. The second approach seems to work for both scenarios of launching. Maybe there are more elegant approaches. I'm curious to see what the final API we decide to have.

robertnishihara · 2021-09-26T23:59:16Z

COLOCATE seems to force colocation, whereas "hint" sounds like a best effort thing.

clarkzinzow · 2021-10-27T00:57:40Z

Bump, another OSS user ran into this with Datasets, where read tasks (and therefore downstream map tasks) are packing onto a single node, causing poor performance and cluster instability.

raulchen · 2021-11-15T04:03:36Z

func.options(scheduling_hint="SPREAD").remote()

I'm confused about the semantic of this. Does this mean that func will be scheduled to a node that is different from the current node?
If I submit 5 tasks with scheduling_hint="SPREAD". What happens to them? Are those 5 tasks independent or will they be spread?

ericl · 2021-11-15T05:41:05Z

If I submit 5 tasks with scheduling_hint="SPREAD". What happens to them? Are those 5 tasks independent or will they be spread?

The scheduler will do its best to spread them equally across different nodes, similar to SPREAD in placement groups. No guarantees though. They are independent.

clay4megtr · 2021-11-15T06:10:38Z

If I submit 5 tasks with scheduling_hint="SPREAD". What happens to them? Are those 5 tasks independent or will they be spread?

The scheduler will do its best to spread them equally across different nodes, similar to SPREAD in placement groups. No guarantees though. They are independent.

hmm... It sounds like SOFT SPREAD in placement group but without Gang Scheduler?

jjyao · 2021-11-17T03:31:22Z

@clay4444 Yea, behave similar to soft spread in placement group.

ericl added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks usability size:large labels Sep 12, 2021

ericl added this to the Core Backlog milestone Sep 12, 2021

ericl modified the milestones: Core Backlog, Datasets Beta Sep 14, 2021

ericl assigned jjyao Sep 21, 2021

ericl mentioned this issue Oct 29, 2021

[Bug] ray.data.Dataset.map_batches cannot be given placement groups #19856

Closed

2 tasks

This was referenced Nov 9, 2021

[Feature] get or create logic for placement groups #20196

Closed

[train] add placement group support #20091

Merged

ericl mentioned this issue Nov 16, 2021

[placement group] Support placement groups with no resources #20401

Closed

2 tasks

jjyao mentioned this issue Nov 17, 2021

[Scheduler] Support per task/actor PlacementGroupSchedulingStrategy #20507

Merged

6 tasks

ericl added performance and removed enhancement Request for new feature and/or capability usability size:large labels Nov 17, 2021

ericl added the size:medium label Nov 17, 2021

ericl changed the title ~~[RFC] Support scheduling_hint=SPREAD|COLOCATE for tasks and actors~~ Support scheduling_hint=SPREAD|COLOCATE for tasks and actors Nov 19, 2021

jjyao mentioned this issue Dec 8, 2021

[Scheduler] Support per task/actor SpreadSchedulingStrategy #20972

Merged

6 tasks

ericl closed this as completed Feb 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support scheduling_hint=SPREAD|COLOCATE for tasks and actors #18524

Support scheduling_hint=SPREAD|COLOCATE for tasks and actors #18524

ericl commented Sep 12, 2021 •

edited

Loading

valiantljk commented Sep 12, 2021

ericl commented Sep 12, 2021

simon-mo commented Sep 13, 2021

ericl commented Sep 13, 2021

mwtian commented Sep 24, 2021

ericl commented Sep 24, 2021

mwtian commented Sep 24, 2021

robertnishihara commented Sep 26, 2021

clarkzinzow commented Oct 27, 2021

raulchen commented Nov 15, 2021

ericl commented Nov 15, 2021 •

edited

Loading

clay4megtr commented Nov 15, 2021

jjyao commented Nov 17, 2021

Support scheduling_hint=SPREAD|COLOCATE for tasks and actors #18524

Support scheduling_hint=SPREAD|COLOCATE for tasks and actors #18524

Comments

ericl commented Sep 12, 2021 • edited Loading

valiantljk commented Sep 12, 2021

ericl commented Sep 12, 2021

simon-mo commented Sep 13, 2021

ericl commented Sep 13, 2021

mwtian commented Sep 24, 2021

ericl commented Sep 24, 2021

mwtian commented Sep 24, 2021

robertnishihara commented Sep 26, 2021

clarkzinzow commented Oct 27, 2021

raulchen commented Nov 15, 2021

ericl commented Nov 15, 2021 • edited Loading

clay4megtr commented Nov 15, 2021

jjyao commented Nov 17, 2021

ericl commented Sep 12, 2021 •

edited

Loading

ericl commented Nov 15, 2021 •

edited

Loading