-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Autoscaler] The autoscaler could not find a node type to satisfy the request #27910
Comments
Hi @DmitriGekhtman checking in from our Slack thread, still running into this problem when running jobs. Are there any thoughts? |
I know this is a bit of work, but could you post full reproduction steps? That would include the exact RayCluster config used and the scripts executed (perhaps with sensitive details redacted). |
Hi @DmitriGekhtman, below are the following examples.
And a simple script would be:
|
Just re-confirming --
|
|
Got it, so the sequence of events is
I will try this out to see what's going on. |
Exactly. FYI, the workers become scheduled if I manually scale them up. Here are the additional logs:
|
@DmitriGekhtman Looks like the issue is when "workers" exist in the Cluster it gets confused. I triggered three jobs at the same time. The first one scheduled workers fine, but then I received the error I showed above after submitting the second and third job. I set max workers to 0 (to force delete all workers), then increased it back up to 40. After increasing this number, it spawned the required resources for the first job but not the others. For example:
|
Huh, well the workers have 16CPU each so I guess it might make sense Pack would prioritize scheduling all 17 1-CPU requests on the same node. |
@DmitriGekhtman That makes sense; however, I am receiving the exact behavior when scaling each worker up. I increased the CPU for each to 30 and the memory to 40; but if PACK is scheduling on the same machine then definitely not enough resources on that specific node. However, the docs mention it will schedule elsewhere if unable to schedule onto a single node? Additionally to tune, here are some docs that are informative. . (Also posting for other who read this thread).
Also, below is when I scale up CPUs.
|
Yeah, it is true that PACK is a "soft" constraint... starting to poke around now. |
Just for purposes of documenting my progress here: |
Modified config:
|
Modified script:
|
Now, my issue is that I there weren't enough tasks triggered to cause upscaling, and the job ran to submission successfully. @peterghaddad could you suggest some parameters for the tune job that more closely resemble the ones for your workload? Maybe more workers or more CPUs per worker? And/or more computationally intensive trials? |
Hi @DmitriGekhtman, sure thing! import numpy as np
from ray import tune
analysis = tune.run(
"PPO",
stop={"episode_reward_mean": 200000},
config={
"env": "CartPole-v1",
"num_gpus": 0,
"num_workers": 8,
"lr": tune.grid_search(list(np.arange(0.0, 0.999, 0.2)))
},
) I have submitted jobs back-to-back (2-3). Is there any other information which may be helpful? |
Hey @DmitriGekhtman I triggered some additional experiments with minimum resources after submitting one that required larger resources (what I posted above) Same error, but it's only trying to place 4 CPUs for the smaller experiment.
Below is the code:
I also tried specifying |
The submissions were within seconds of each other? Or back-to-back meaning you waited for one two complete before submitting another? |
Also, could you share what your underlying K8s node infrastructure looks like? |
I've just submitted 3 instances of the tuning job to a Ray cluster running on GKE autopilot. Will see what happens! |
I'm seeing upscaling with |
Submitting three of these jobs in quick succession resulted in upscaling of 8 Ray worker pods. No errors yet -- will just let those jobs keep running for a while and then will try submitting again after they finish running and I see downscaling. By the way, running |
I'm unfortunately not able to reproduce the issue yet -- if you could share the autoscaler's logs, that would be helpful. |
Hi @DmitriGekhtman, thank you for troubleshooting!
As an FYI, I made my workers contain more CPU and memory hence less nodes. Let me know if you would like me to revert and try again. It is now:
I ran
I'm kicking off another job which requires less CPUs now. Below are the autoscaling logs for this job. |
I wonder if it has to do with the presence of GPUs -- there is some logic in the autoscaler which tries to avoid scaling up GPU nodes for CPU tasks. What if you set
for the head and workers? |
I have a very strong hunch that this is what happened: After upscaling a worker the first time, the autoscaler detected from the running worker's load metrics that the worker has access to GPUs. In other words, it's likely an instance of this issue: I do expect that setting the override
will resolve the issue in this case. |
Would you look at that! |
Solved it! I appreciate all of the help. |
Glad we got to the bottom of it! Closing this as a duplicate of the underlying issue. |
What happened + What you expected to happen
I am running the PPO Trainer with num_workers set to 8. It seems when I first launch an experiment after creating a new Ray Cluster, then everything gets scheduled and there are no errors with the autoscaler. After the experiment completes and I re-submit the Job, then I get the following error.
My CPU Worker has the following resources: Request 8 CPUs and Limit 16 CPUs with sufficient memory.
I was able to bypass this error by creating a new worker node called cpu-worker-small that has 1 CPU and 10GI of memory; however, then the actors fail unexpectedly (probably to resource constraints).
I saw some similar issues #12441, but it seems after the first experiment runs there is no resource demands in the autoscaler. I am using default settings from Kuberay.
After running the first experiment I check the following autoscaler state: This means that this issue #24259 made it into Ray 1.12.1
Versions / Dependencies
Ray 2.0
Kuberay on a Kubernetes Cluster (latest version).
Reproduction script
PPO with the following Configuration:
{
"num_gpus" : 1,
"num_workers" : 2,
"num_sgd_iter" : 60,
"train_batch_size" : 12000,
}
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: