Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] More GPUs in progress bar summary than actually running #46384

Closed
scottjlee opened this issue Jul 2, 2024 · 5 comments · Fixed by #46729
Closed

[Data] More GPUs in progress bar summary than actually running #46384

scottjlee opened this issue Jul 2, 2024 · 5 comments · Fixed by #46729
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues good first issue Great starter issue for someone just starting to contribute to Ray P1 Issue that should be fixed within a few weeks

Comments

@scottjlee
Copy link
Contributor

scottjlee commented Jul 2, 2024

What happened + What you expected to happen

When using a Ray Dataset in a CPU+GPU workload, the progress bar can sometimes report more GPUs used than actually running.

Versions / Dependencies

ray master

Reproduction script

@scottjlee scottjlee added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Jul 2, 2024
@bveeramani
Copy link
Member

Here's a minimal repro:

import time

import ray

ray.init(num_gpus=1)


class Identity:
    def __call__(self, batch):
        time.sleep(1)
        return batch


ray.data.range(10, override_num_blocks=10).map_batches(
    Identity, num_gpus=1, batch_size=1, concurrency=(1, 2)
).materialize()

(Note the 2/1 GPU)

Running: 0/10 CPU, 2/1 GPU, 56.0B/1.0GB object_store_memory:  40%|████████████                  | 4/10 [00:04<00:06,  1.05s/it]
- ReadRange: 0 active, 0 queued, [cpu: 0.0, objects: 48.0B]: 100%|█████████████████████████████| 10/10 [00:04<00:00,  9.38it/s]
- MapBatches(Identity): 5 active, 2 queued, [cpu: 0.0, gpu: 2.0, objects: 8.0B], 1 actors (1 pending) [locality off]:  40%|▍| 4

@Superskyyy
Copy link
Contributor

Superskyyy commented Jul 3, 2024

This is because a pending actor is also counted in the progress bar as active, fundamentally there shouldn't be a pending actor in the first place because there isn't another GPU available in the cluster, but that doesn't really hurt. So I guess the first step would just be calculating the bar description based on running actors. @scottjlee @bveeramani WDYT?

num_active_workers = self._actor_pool.num_running_actors()

@scottjlee
Copy link
Contributor Author

@Superskyyy thanks, good find! That makes sense to use only currently running actors as the denominator, and exclude pending actors. Would you be open to contributing a PR? Otherwise, I will mark this as a good first issue for others to work on, and will try to find time in the next several weeks to get to it.

@scottjlee scottjlee added the good first issue Great starter issue for someone just starting to contribute to Ray label Jul 10, 2024
@Superskyyy
Copy link
Contributor

@Superskyyy thanks, good find! That makes sense to use only currently running actors as the denominator, and exclude pending actors. Would you be open to contributing a PR? Otherwise, I will mark this as a good first issue for others to work on, and will try to find time in the next several weeks to get to it.

Thanks, I will open a PR this week.

@jeffreyjeffreywang
Copy link
Contributor

jeffreyjeffreywang commented Jul 21, 2024

Hi @Superskyyy, I am a newcomer and found this issue interesting and simple enough for me to ramp up with development in Ray. It seems like this issue has lost traction for a bit, so I gave it a shot and validated that after excluding pending actors when calculating active workers in the progress bar, the issue no longer repro's with Balaji's code above. Please let me know whether you'd prefer to take it over. Otherwise, I am more than happy to get this change across the finish line and also to get familiar with Ray development more.

Screenshot 2024-07-20 at 11 30 34 PM

Hi @scottjlee @bveeramani, it's great meeting you and thank you for putting instructions together above. The PR that I published resolves this issue, and I would love your review when you have time. Please let me know if there's any improvement/adjustment that I can make before merging. Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues good first issue Great starter issue for someone just starting to contribute to Ray P1 Issue that should be fixed within a few weeks
Projects
None yet
4 participants