-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's the best way to control which GPUs a worker can use? #303
Comments
One solution for now is to expose a method like For now, we can assume that users will handle things like setting the environment variable |
Hey, it seems that it is still a problem. Any suggestions? |
@GoingMyWay can you share more details about the problem you're seeing? A reproducible script would be ideal. There are more details about using Ray with GPUs here https://docs.ray.io/en/master/ray-core/tasks/using-ray-with-gpus.html. |
@robertnishihara |
I think this is still a problem and it is impacting the performance of Ray Tune. Like @robertnishihara mentioned above, the gpus exposed to the remote function can be overridden by setting the CUDA_VISIBLE_DEVICES environment variable. However, in the case of Ray Tune, we specify the number of gpus exposed per worker in advance, and each worker basically can not have data sharing with more than the number of gpus per worker specified; thus limiting the benefit of data parallelism within a single node. For example, I am running 8 workers (each with num_gpus=1) in a single node machine with 8 gpus. Each worker will not utilize all the gpus available to it because num_gpus=1 isolates the available gpu for each worker. The ideal case is to allow each worker to use all gpus for the sake of data parallelism. One possible solution is to blow up the logical gpu count by the factor of parallelism (logical gpu count = physical gpu count * factor of parallelism) you want to run so that each worker can run with num_gpus=(ideal parallelism a.k.a number of physical gpus) and we can further specify the machines with CUDA_VISIBLE_DEVICES. |
Right now, we allow tasks to specify that they require GPUs by including it in the decorator, e.g.,
#302 introduces the same syntax for actors, e.g.,
So how does the function
f
actually know which GPUs to use?With CPUs, the OS will try to balance things between CPUs, so it's less critical to get it right. That said, the local scheduler (or whoever) could set the affinity of different worker processes for different CPUs to control which CPUs each worker can use. Is there an analogue of all this for GPUs?
GPUs seem different from CPUs. The burden of choosing which GPU to use is often placed on the programmer, not on the OS. For example, the standard way of controlling which GPUs TensorFlow uses is to set the environment variable
CUDA_VISIBLE_DEVICES
, e.g., something likeCUDA_VISIBLE_DEVICES=0,3,4
before runningtf.Session()
. Once you create a session, TensorFlow will reserve a bunch of memory on all visible GPUs. I'm not sure how specific the concept of using GPU device IDs is to TensorFlow or if it is general.We can expose a method
ray.get_gpu_ids()
that could be called inside any task or any actor and would return the IDs (e.g.,[0, 3, 4]
) of the GPUs that that process is allowed to use. This assumes that environment variables can be set from within Python. In the case of TensorFlow, that works, e.g., we can doos.environ["CUDA_VISIBLE_DEVICES"] = ",".join([str(i) for i in ray.get_gpu_ids()])
or something like that from Python (e.g., within an actor constructor) as long as we do it before we runtf.Session()
. But you could imagine a scenario or a different library where the environment variable has to be set BEFORE the worker process is created (or before the library is imported). In that case, there are other options (e.g., having the local scheduler set the environment variable), but they all seem awful (actually, I ran into such a situation recently where I had to set an environment variable likeDISPLAY=:99
before importing a library because otherwise the library crashed when looking for an X server and brought down the worker).The text was updated successfully, but these errors were encountered: