-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] More GPUs in progress bar summary than actually running #46384
Comments
Here's a minimal repro: import time
import ray
ray.init(num_gpus=1)
class Identity:
def __call__(self, batch):
time.sleep(1)
return batch
ray.data.range(10, override_num_blocks=10).map_batches(
Identity, num_gpus=1, batch_size=1, concurrency=(1, 2)
).materialize() (Note the 2/1 GPU)
|
This is because a pending actor is also counted in the progress bar as active, fundamentally there shouldn't be a pending actor in the first place because there isn't another GPU available in the cluster, but that doesn't really hurt. So I guess the first step would just be calculating the bar description based on running actors. @scottjlee @bveeramani WDYT?
|
@Superskyyy thanks, good find! That makes sense to use only currently running actors as the denominator, and exclude pending actors. Would you be open to contributing a PR? Otherwise, I will mark this as a good first issue for others to work on, and will try to find time in the next several weeks to get to it. |
Thanks, I will open a PR this week. |
Hi @Superskyyy, I am a newcomer and found this issue interesting and simple enough for me to ramp up with development in Ray. It seems like this issue has lost traction for a bit, so I gave it a shot and validated that after excluding pending actors when calculating active workers in the progress bar, the issue no longer repro's with Balaji's code above. Please let me know whether you'd prefer to take it over. Otherwise, I am more than happy to get this change across the finish line and also to get familiar with Ray development more. Hi @scottjlee @bveeramani, it's great meeting you and thank you for putting instructions together above. The PR that I published resolves this issue, and I would love your review when you have time. Please let me know if there's any improvement/adjustment that I can make before merging. Thank you :) |
What happened + What you expected to happen
When using a Ray Dataset in a CPU+GPU workload, the progress bar can sometimes report more GPUs used than actually running.
Versions / Dependencies
ray master
Reproduction script
The text was updated successfully, but these errors were encountered: