Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release][nightly][pg] tune_cifar_pytorch_pbt_example does not work on nightly (Placement group error) #20348

Closed
1 of 2 tasks
AmeerHajAli opened this issue Nov 15, 2021 · 3 comments · Fixed by #20351
Closed
1 of 2 tasks
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@AmeerHajAli
Copy link
Contributor

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

I ran with ray 1.8 it works, when I run with commit c0aeb4a I get the following error:

(run pid=864) ray::TrainTrainable.train_buffered() (pid=253, ip=10.0.24.128, repr=<ray.train.trainer.TrainTrainable object at 0x7efea85879d0>)
(run pid=864)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 262, in run
(run pid=864)     self._entrypoint()
(run pid=864)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 331, in entrypoint
(run pid=864)     self._status_reporter.get_checkpoint())
(run pid=864)   File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/tune/function_runner.py", line 597, in _trainable_func
(run pid=864)   File "/Users/ameerhajali/anaconda3/envs/ray/lib/python3.7/site-packages/ray/train/trainer.py", line 761, in tune_function
(run pid=864)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/trainer.py", line 162, in __init__
(run pid=864)     max_retries=max_retries)
(run pid=864)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/actor.py", line 524, in remote
(run pid=864)     runtime_env=new_runtime_env)
(run pid=864)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/actor.py", line 747, in _remote
(run pid=864)     placement_group=placement_group)
(run pid=864)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/placement_group.py", line 452, in configure_placement_group_based_on_context
(run pid=864)     placement_resources, task_or_actor_repr)
(run pid=864)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/placement_group.py", line 370, in _validate_resource_shape
(run pid=864)     raise ValueError(f"Cannot schedule {task_or_actor_repr} with "
(run pid=864) ValueError: Cannot schedule BackendExecutor with the placement group because the resource request {'node:10.0.24.128': 0.01, 'CPU': 0} cannot fit into any bundles for the placement group, [{'CPU': 1.0}, {'node:10.0.16.137': 0.01}, {'CPU': 1.0}, {'CPU': 1.0}].

Versions / Dependencies

python 3.7, ray c0aeb4a (nightly)

Reproduction script

RAY_ADDRESS=anyscale://ga-demo-aws python tune_cifar_pytorch_pbt_example.py

The cluster env:
Screen Shot 2021-11-14 at 4 11 06 PM

Anything else

@richardliaw , can you please help with this one?

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@AmeerHajAli AmeerHajAli added bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 15, 2021
@AmeerHajAli
Copy link
Contributor Author

AmeerHajAli commented Nov 15, 2021

@scv119 / @ericl do you know what can be going wrong?

@AmeerHajAli
Copy link
Contributor Author

Looks like this commit broke it: #20123
Before that it works.

@AmeerHajAli AmeerHajAli added the release-blocker P0 Issue that blocks the release label Nov 15, 2021
@rkooo567
Copy link
Contributor

rkooo567 commented Nov 15, 2021

The error message shows what’s happening here? It tries to schedule an actor to the bundle that doesn’t have requested resources. (The node resources .128 is not in any bundle). Should be the application error (that it configures resources incorrectly)

@fishbone fishbone changed the title [Bug][pg] tune_cifar_pytorch_pbt_example does not work on nightly (Placement group error) [release][nightly][pg] tune_cifar_pytorch_pbt_example does not work on nightly (Placement group error) Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants