What's the best way to control which GPUs a worker can use? #303

robertnishihara · 2017-02-21T04:57:12Z

Right now, we allow tasks to specify that they require GPUs by including it in the decorator, e.g.,

@ray.remote(num_gpus=2)
def f():
  ...

#302 introduces the same syntax for actors, e.g.,

@ray.actor(num_gpus=3)
class Foo(object):
  ...

So how does the function f actually know which GPUs to use?

With CPUs, the OS will try to balance things between CPUs, so it's less critical to get it right. That said, the local scheduler (or whoever) could set the affinity of different worker processes for different CPUs to control which CPUs each worker can use. Is there an analogue of all this for GPUs?

GPUs seem different from CPUs. The burden of choosing which GPU to use is often placed on the programmer, not on the OS. For example, the standard way of controlling which GPUs TensorFlow uses is to set the environment variable CUDA_VISIBLE_DEVICES, e.g., something like CUDA_VISIBLE_DEVICES=0,3,4 before running tf.Session(). Once you create a session, TensorFlow will reserve a bunch of memory on all visible GPUs. I'm not sure how specific the concept of using GPU device IDs is to TensorFlow or if it is general.

We can expose a method ray.get_gpu_ids() that could be called inside any task or any actor and would return the IDs (e.g., [0, 3, 4]) of the GPUs that that process is allowed to use. This assumes that environment variables can be set from within Python. In the case of TensorFlow, that works, e.g., we can do os.environ["CUDA_VISIBLE_DEVICES"] = ",".join([str(i) for i in ray.get_gpu_ids()]) or something like that from Python (e.g., within an actor constructor) as long as we do it before we run tf.Session(). But you could imagine a scenario or a different library where the environment variable has to be set BEFORE the worker process is created (or before the library is imported). In that case, there are other options (e.g., having the local scheduler set the environment variable), but they all seem awful (actually, I ran into such a situation recently where I had to set an environment variable like DISPLAY=:99 before importing a library because otherwise the library crashed when looking for an X server and brought down the worker).

The text was updated successfully, but these errors were encountered:

robertnishihara · 2017-03-22T06:26:29Z

One solution for now is to expose a method like ray.get_gpu_ids() or ray.get_env()["GPU_IDS"] within tasks and within actor methods.

For now, we can assume that users will handle things like setting the environment variable CUDA_VISIBLE_DEVICES themselves.

GoingMyWay · 2022-06-01T13:06:05Z

Hey, it seems that it is still a problem. Any suggestions?

robertnishihara · 2022-06-01T23:17:41Z

@GoingMyWay can you share more details about the problem you're seeing? A reproducible script would be ideal.

There are more details about using Ray with GPUs here https://docs.ray.io/en/master/ray-core/tasks/using-ray-with-gpus.html.

DanielWicz · 2023-02-04T16:25:27Z

@robertnishihara
What if you use AMD's GPUs ?

amztc34283 · 2023-10-30T05:16:58Z

I think this is still a problem and it is impacting the performance of Ray Tune.

Like @robertnishihara mentioned above, the gpus exposed to the remote function can be overridden by setting the CUDA_VISIBLE_DEVICES environment variable. However, in the case of Ray Tune, we specify the number of gpus exposed per worker in advance, and each worker basically can not have data sharing with more than the number of gpus per worker specified; thus limiting the benefit of data parallelism within a single node.

For example, I am running 8 workers (each with num_gpus=1) in a single node machine with 8 gpus. Each worker will not utilize all the gpus available to it because num_gpus=1 isolates the available gpu for each worker. The ideal case is to allow each worker to use all gpus for the sake of data parallelism.

One possible solution is to blow up the logical gpu count by the factor of parallelism (logical gpu count = physical gpu count * factor of parallelism) you want to run so that each worker can run with num_gpus=(ideal parallelism a.k.a number of physical gpus) and we can further specify the machines with CUDA_VISIBLE_DEVICES.

robertnishihara added the question Just a question :) label Feb 21, 2017

robertnishihara closed this as completed Mar 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the best way to control which GPUs a worker can use? #303

What's the best way to control which GPUs a worker can use? #303

robertnishihara commented Feb 21, 2017

robertnishihara commented Mar 22, 2017

GoingMyWay commented Jun 1, 2022

robertnishihara commented Jun 1, 2022

DanielWicz commented Feb 4, 2023

amztc34283 commented Oct 30, 2023 •

edited

Loading

What's the best way to control which GPUs a worker can use? #303

What's the best way to control which GPUs a worker can use? #303

Comments

robertnishihara commented Feb 21, 2017

robertnishihara commented Mar 22, 2017

GoingMyWay commented Jun 1, 2022

robertnishihara commented Jun 1, 2022

DanielWicz commented Feb 4, 2023

amztc34283 commented Oct 30, 2023 • edited Loading

amztc34283 commented Oct 30, 2023 •

edited

Loading