Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support placement_group=None in PlacementGroupSchedulingStrategy #27370

Merged
merged 2 commits into from
Aug 2, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 18 additions & 18 deletions doc/source/ray-core/gotchas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@
Ray Gotchas
===========

Ray sometimes has some aspects of its behavior that might catch
users off guard. There may be sound arguments for these design choices.
Ray sometimes has some aspects of its behavior that might catch
users off guard. There may be sound arguments for these design choices.

In particular, users think of Ray as running on their local machine, and
while this is mostly true, this doesn't work.
In particular, users think of Ray as running on their local machine, and
while this is mostly true, this doesn't work.

Environment variables are not passed from the driver to workers
---------------------------------------------------------------
Expand Down Expand Up @@ -39,9 +39,9 @@ Filenames work sometimes and not at other times
-----------------------------------------------

**Issue**: If you reference a file by name in a task or actor,
it will sometimes work and sometimes fail. This is
because if the task or actor runs on the head node
of the cluster, it will work, but if the task or actor
it will sometimes work and sometimes fail. This is
because if the task or actor runs on the head node
of the cluster, it will work, but if the task or actor
runs on another machine it won't.

**Example**: Let's say we do the following command:
Expand All @@ -50,7 +50,7 @@ runs on another machine it won't.

% touch /tmp/foo.txt

And I have this code:
And I have this code:

.. code-block:: python

Expand All @@ -61,15 +61,15 @@ And I have this code:
def check_file():
foo_exists = os.path.exists("/tmp/foo.txt")
print(f"Foo exists? {foo_exists}")

futures = []
for _ in range(1000):
for _ in range(1000):
futures.append(check_file.remote())

ray.get(futures)


then you will get a mix of True and False. If
then you will get a mix of True and False. If
``check_file()`` runs on the head node, or we're running
locally it works. But if it runs on a worker node, it returns ``False``.

Expand All @@ -78,7 +78,7 @@ It's the same code after all.

**Fix**

- Use only shared paths for such applications -- e.g. if you are using a network file system you can use that, or the files can be on s3.
- Use only shared paths for such applications -- e.g. if you are using a network file system you can use that, or the files can be on s3.
- Do not rely on file path consistency.


Expand All @@ -87,10 +87,10 @@ Placement groups are not composable
-----------------------------------

**Issue**: If you have a task that is called from something that runs in a placement
group, the resources are never allocated and it hangs.
group, the resources are never allocated and it hangs.

**Example**: You are using Ray Tune which creates placement groups, and you want to
apply it to an objective function, but that objective function makes use
apply it to an objective function, but that objective function makes use
of Ray Tasks itself, e.g.

.. code-block:: python
Expand All @@ -111,18 +111,18 @@ of Ray Tasks itself, e.g.
tuner = tune.Tuner(objective, param_space={"a": 1})
tuner.fit()

This will error with message:
ValueError: Cannot schedule create_task_that_uses_resources.<locals>.sample_task with the placement group
This will error with message:
ValueError: Cannot schedule create_task_that_uses_resources.<locals>.sample_task with the placement group
because the resource request {'CPU': 10} cannot fit into any bundles for the placement group, [{'CPU': 1.0}].

**Expected behavior**: The above executes.

**Fix**: In the ``@ray.remote`` declaration of tasks
called by ``create_task_that_uses_resources()`` , include a
``placement_group=None``.
``scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=None)``.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still feel a bit weird that we specify a strategy just to not use that strategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow both right? Because in Datasets, we override using DEFAULT.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently we allow both.


.. code-block:: diff

def create_task_that_uses_resources():
+ @ray.remote(num_cpus=10, placement_group=None)
+ @ray.remote(num_cpus=10, scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=None))
- @ray.remote(num_cpus=10)
12 changes: 6 additions & 6 deletions python/ray/tests/test_placement_group_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,7 @@
)
from ray.util.client.ray_client_helpers import connect_to_client_or_not
from ray.util.placement_group import get_current_placement_group

try:
import pytest_timeout
except ImportError:
pytest_timeout = None
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy


@ray.remote
Expand Down Expand Up @@ -376,7 +372,11 @@ def schedule_nested_actor(self):

def schedule_nested_actor_outside_pg(self):
# Don't use placement group.
actor = NestedActor.options(placement_group=None).remote()
actor = NestedActor.options(
scheduling_strategy=PlacementGroupSchedulingStrategy(
placement_group=None
)
).remote()
ray.get(actor.ready.remote())
self.actors.append(actor)

Expand Down
10 changes: 0 additions & 10 deletions python/ray/tests/test_scheduling_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -244,16 +244,6 @@ def func():

func.options(scheduling_strategy="XXX").remote()

with pytest.raises(ValueError):

@ray.remote
def func():
return 0

func.options(
scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=None)
).remote()


@pytest.mark.parametrize("connect_to_client", [True, False])
def test_node_affinity_scheduling_strategy(
Expand Down
5 changes: 0 additions & 5 deletions python/ray/util/scheduling_strategies.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,6 @@ def __init__(
placement_group_bundle_index: int = -1,
placement_group_capture_child_tasks: Optional[bool] = None,
):
if placement_group is None:
raise ValueError(
"placement_group needs to be an instance of PlacementGroup"
)

self.placement_group = placement_group
self.placement_group_bundle_index = placement_group_bundle_index
self.placement_group_capture_child_tasks = placement_group_capture_child_tasks
Expand Down