Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAY AIR] set the correct gpu id in TorchTrainer #26493

Merged
merged 15 commits into from
Jul 19, 2022

Conversation

JiahaoYao
Copy link
Contributor

@JiahaoYao JiahaoYao commented Jul 13, 2022

Why are these changes needed?

this pr fixes the issue raised in #26490. In the pytorch trainer and ray.tune, the cude device is not set correctly and has run into the following error:

(BackendExecutor pid=46172)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/torch.py", line 614, in prepare_model
(BackendExecutor pid=46172)     return get_accelerator(TorchAccelerator).prepare_model(
(BackendExecutor pid=46172)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/torch.py", line 95, in prepare_model
(BackendExecutor pid=46172)     torch.cuda.set_device(device)
(BackendExecutor pid=46172)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
(BackendExecutor pid=46172)     torch._C._cuda_setDevice(device)
(BackendExecutor pid=46172) RuntimeError: CUDA error: invalid device ordinal

The pr tries to align the gpu id from the ray.get_gpu_id and the gpu id from the cuda_visible_devices. The index is used to get the torch.cuda.device.

Per Torch code, generally you want to use the local rank to set the device:
https://github.com/pytorch/pytorch/blob/35563f4fcd28e486cc58053acc15fe280123e7be/torch/distributed/launch.py#L72-L97

However, there are some edge cases in using fractional GPUs or multiple GPUs.

Related issue number

#26490

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Comment on lines 65 to 66
elif num_gpus_per_worker == 2:
assert devices == [0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this run? Might need to increase the fixture to have 4 GPUs since there are 2 workers?

Also this probably doesn't test complete coverage (I think the previous code would succeed here as well).

Would be good to have a test where the index is not equal to the GPU ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JiahaoYao!

@@ -349,6 +349,10 @@ def test_tune_tensorflow_mnist_gpu(ray_start_4_cpus_2_gpus):
tune_tensorflow_mnist(num_workers=2, use_gpu=True, num_samples=1)


def test_concurrent_tune_tensorflow_mnist_gpu(ray_start_4_cpus_2_gpus):
tune_tensorflow_mnist(num_workers=1, use_gpu=True, num_samples=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change was made to train.torch, so tensorflow tests are not actually testing the changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, my bad.

device_id = gpu_ids[0]
gpu_id = gpu_ids[0]
cuda_visible_list = list(
map(int, ray._private.utils.get_cuda_visible_devices())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we manually get the cuda visible devices via the env var instead of using a private ray API?

Or move the private api to ray.util and mark it as a developer api.

cuda_visible_list = list(
map(int, ray._private.utils.get_cuda_visible_devices())
)
device_id = cuda_visible_list.index(gpu_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this fail if cuda_visible_list is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogkam for the review, about to modify the code accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(gpu_ids) > 0 will ensure this?

@xwjiang2010
Copy link
Contributor

Can we add some PR descriptions? Thanks!!

@amogkam amogkam self-assigned this Jul 18, 2022
@amogkam amogkam mentioned this pull request Jul 18, 2022
6 tasks
@JiahaoYao
Copy link
Contributor Author

the virtual gpu does not work

(ray_test) ~/ScratchGym/Scratch/test0718/ray/python/ray/train/tests python -m pytest -rP --capture no test_gpu.py::test_torch_get_device_dist
Test session starts (platform: darwin, Python 3.8.13, pytest 5.4.3, pytest-sugar 0.9.5)
rootdir: /Users/jimmy/ScratchGym/Scratch/test0718/ray/python
plugins: asyncio-0.16.0, anyio-3.6.1, forked-1.4.0, sugar-0.9.5, pytest_docker_tools-0.2.3, timeout-2.1.0, shutil-1.7.0, rerunfailures-10.2, virtualenv-1.7.0, lazy-fixture-0.6.3
collecting ... 2022-07-18 11:04:52,131  INFO trainer.py:247 -- Trainer logs will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-04-52
2022-07-18 11:04:52,133 INFO plugin_schema_manager.py:51 -- Loading the default runtime env schemas: ['/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/runtime_env/../../runtime_env/schemas/working_dir_schema.json', '/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/runtime_env/../../runtime_env/schemas/pip_schema.json'].
(BaseWorkerMixin pid=3658) 2022-07-18 11:04:58,145      INFO config.py:70 -- Setting up process group for: env:// [rank=2, world_size=16]
(BaseWorkerMixin pid=3661) 2022-07-18 11:04:58,165      INFO config.py:70 -- Setting up process group for: env:// [rank=5, world_size=16]
(BaseWorkerMixin pid=3664) 2022-07-18 11:04:58,155      INFO config.py:70 -- Setting up process group for: env:// [rank=8, world_size=16]
(BaseWorkerMixin pid=3660) 2022-07-18 11:04:58,140      INFO config.py:70 -- Setting up process group for: env:// [rank=4, world_size=16]
(BaseWorkerMixin pid=3662) 2022-07-18 11:04:58,160      INFO config.py:70 -- Setting up process group for: env:// [rank=6, world_size=16]
(BaseWorkerMixin pid=3669) 2022-07-18 11:04:58,161      INFO config.py:70 -- Setting up process group for: env:// [rank=13, world_size=16]
(BaseWorkerMixin pid=3659) 2022-07-18 11:04:58,163      INFO config.py:70 -- Setting up process group for: env:// [rank=3, world_size=16]
(BaseWorkerMixin pid=3665) 2022-07-18 11:04:58,163      INFO config.py:70 -- Setting up process group for: env:// [rank=9, world_size=16]
(BaseWorkerMixin pid=3663) 2022-07-18 11:04:58,159      INFO config.py:70 -- Setting up process group for: env:// [rank=7, world_size=16]
(BaseWorkerMixin pid=3657) 2022-07-18 11:04:58,157      INFO config.py:70 -- Setting up process group for: env:// [rank=1, world_size=16]
(BaseWorkerMixin pid=3666) 2022-07-18 11:04:58,161      INFO config.py:70 -- Setting up process group for: env:// [rank=10, world_size=16]
(BaseWorkerMixin pid=3656) 2022-07-18 11:04:58,138      INFO config.py:70 -- Setting up process group for: env:// [rank=0, world_size=16]
(BaseWorkerMixin pid=3670) 2022-07-18 11:04:58,159      INFO config.py:70 -- Setting up process group for: env:// [rank=14, world_size=16]
(BaseWorkerMixin pid=3668) 2022-07-18 11:04:58,161      INFO config.py:70 -- Setting up process group for: env:// [rank=12, world_size=16]
(BaseWorkerMixin pid=3671) 2022-07-18 11:04:58,163      INFO config.py:70 -- Setting up process group for: env:// [rank=15, world_size=16]
2022-07-18 11:04:58,215 INFO trainer.py:253 -- Run results will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-04-52/run_001
(BaseWorkerMixin pid=3667) 2022-07-18 11:04:58,213      INFO config.py:70 -- Setting up process group for: env:// [rank=11, world_size=16]
Counter({None: 16})

 ray/train/tests/test_gpu.py33% ███▍      2022-07-18 11:05:02,241 ERROR services.py:1494 -- Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:05:02,242 ERROR services.py:1495 -- Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
    port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
    raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:05:02,531 INFO worker.py:1295 -- Connecting to existing Ray cluster at address: 127.0.0.1:65478
2022-07-18 11:05:02,561 INFO trainer.py:247 -- Trainer logs will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-05-02
2022-07-18 11:05:07,332 INFO trainer.py:253 -- Run results will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-05-02/run_001
(BaseWorkerMixin pid=3724) 2022-07-18 11:05:07,327      INFO config.py:70 -- Setting up process group for: env:// [rank=5, world_size=8]
(BaseWorkerMixin pid=3722) 2022-07-18 11:05:07,328      INFO config.py:70 -- Setting up process group for: env:// [rank=3, world_size=8]
(BaseWorkerMixin pid=3723) 2022-07-18 11:05:07,326      INFO config.py:70 -- Setting up process group for: env:// [rank=4, world_size=8]
(BaseWorkerMixin pid=3725) 2022-07-18 11:05:07,331      INFO config.py:70 -- Setting up process group for: env:// [rank=6, world_size=8]
(BaseWorkerMixin pid=3726) 2022-07-18 11:05:07,331      INFO config.py:70 -- Setting up process group for: env:// [rank=7, world_size=8]
(BaseWorkerMixin pid=3721) 2022-07-18 11:05:07,327      INFO config.py:70 -- Setting up process group for: env:// [rank=2, world_size=8]
(BaseWorkerMixin pid=3720) 2022-07-18 11:05:07,325      INFO config.py:70 -- Setting up process group for: env:// [rank=1, world_size=8]
(BaseWorkerMixin pid=3719) 2022-07-18 11:05:07,327      INFO config.py:70 -- Setting up process group for: env:// [rank=0, world_size=8]
Counter({None: 8})
 ray/train/tests/test_gpu.py ✓✓                                                                                                                                                67% ██████▋   2022-07-18 11:05:11,405 ERROR services.py:1494 -- Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:05:11,406 ERROR services.py:1495 -- Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
    port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
    raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:05:11,695 INFO worker.py:1295 -- Connecting to existing Ray cluster at address: 127.0.0.1:64480
2022-07-18 11:05:11,723 INFO trainer.py:247 -- Trainer logs will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-05-11
2022-07-18 11:05:16,002 INFO trainer.py:253 -- Run results will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-05-11/run_001
(BaseWorkerMixin pid=3763) 2022-07-18 11:05:15,999      INFO config.py:70 -- Setting up process group for: env:// [rank=2, world_size=4]
(BaseWorkerMixin pid=3761) 2022-07-18 11:05:16,000      INFO config.py:70 -- Setting up process group for: env:// [rank=0, world_size=4]
(BaseWorkerMixin pid=3762) 2022-07-18 11:05:15,999      INFO config.py:70 -- Setting up process group for: env:// [rank=1, world_size=4]
(BaseWorkerMixin pid=3764) 2022-07-18 11:05:16,001      INFO config.py:70 -- Setting up process group for: env:// [rank=3, world_size=4]
Counter({None: 4})
 ray/train/tests/test_gpu.py ✓✓✓                                                                                                                                              100% ██████████
===================================================================================== warnings summary ======================================================================================
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:36
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:36: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
    'nearest': pil_image.NEAREST,

/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:37
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:37: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
    'bilinear': pil_image.BILINEAR,

/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:38
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
    'bicubic': pil_image.BICUBIC,

/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:39
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:39: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
    'hamming': pil_image.HAMMING,

/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:40
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:40: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
    'box': pil_image.BOX,

/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:41
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:41: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
    'lanczos': pil_image.LANCZOS,

/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/air/util/tensor_extensions/pandas.py:168
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/air/util/tensor_extensions/pandas.py:168: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if LooseVersion("1.1.0") <= LooseVersion(pd.__version__) < LooseVersion("1.3.0"):

ray/train/tests/test_gpu.py::test_torch_get_device_dist[0.5]
ray/train/tests/test_gpu.py::test_torch_get_device_dist[1]
ray/train/tests/test_gpu.py::test_torch_get_device_dist[2]
  /Users/jimmy/ScratchGym/Scratch/test0718/ray/python/ray/train/tests/test_gpu.py:95: DeprecationWarning: The `ray.train.Trainer` API is deprecated in Ray 2.0, and is replaced by Ray AI Runtime (Ray AIR). Ray AIR (https://docs.ray.io/en/latest/ray-air/getting-started.html) will provide greater functionality than `ray.train.Trainer`, and with a more flexible and easy-to-use API.
    trainer = Trainer(

-- Docs: https://docs.pytest.org/en/latest/warnings.html
========================================================================================== PASSES ===========================================================================================
______________________________________________________________________________ test_torch_get_device_dist[0.5] ______________________________________________________________________________
------------------------------------------------------------------------------------ Captured log setup -------------------------------------------------------------------------------------
ERROR    ray._private.services:services.py:1494 Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
ERROR    ray._private.services:services.py:1495 Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
    port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
    raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
______________________________________________________________________________ test_torch_get_device_dist[0.5] ______________________________________________________________________________
------------------------------------------------------------------------------------ Captured log setup -------------------------------------------------------------------------------------
ERROR    ray._private.services:services.py:1494 Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
ERROR    ray._private.services:services.py:1495 Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
    port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
    raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
______________________________________________________________________________ test_torch_get_device_dist[0.5] ______________________________________________________________________________
------------------------------------------------------------------------------------ Captured log setup -------------------------------------------------------------------------------------
ERROR    ray._private.services:services.py:1494 Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
ERROR    ray._private.services:services.py:1495 Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
    port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
    raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.

Results (34.25s):
       3 passed
(ray_test) ~/ScratchGym/Scratch/test0718/ray/python/ray/train/tests python -m pytest -rP --capture no test_gpu.py::test_torch_get_device_dist
Test session starts (platform: darwin, Python 3.8.13, pytest 5.4.3, pytest-sugar 0.9.5)
rootdir: /Users/jimmy/ScratchGym/Scratch/test0718/ray/python
plugins: asyncio-0.16.0, anyio-3.6.1, forked-1.4.0, sugar-0.9.5, pytest_docker_tools-0.2.3, timeout-2.1.0, shutil-1.7.0, rerunfailures-10.2, virtualenv-1.7.0, lazy-fixture-0.6.3
collecting ... 2022-07-18 11:09:42,305  INFO trainer.py:247 -- Trainer logs will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-09-42
2022-07-18 11:09:42,308 INFO plugin_schema_manager.py:51 -- Loading the default runtime env schemas: ['/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/runtime_env/../../runtime_env/schemas/working_dir_schema.json', '/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/runtime_env/../../runtime_env/schemas/pip_schema.json'].
(BaseWorkerMixin pid=4378) 2022-07-18 11:09:50,904      INFO config.py:71 -- Setting up process group for: env:// [rank=4, world_size=16]
(BaseWorkerMixin pid=4377) 2022-07-18 11:09:50,901      INFO config.py:71 -- Setting up process group for: env:// [rank=3, world_size=16]
(BaseWorkerMixin pid=4376) 2022-07-18 11:09:50,909      INFO config.py:71 -- Setting up process group for: env:// [rank=2, world_size=16]
(BaseWorkerMixin pid=4380) 2022-07-18 11:09:50,904      INFO config.py:71 -- Setting up process group for: env:// [rank=6, world_size=16]
(BaseWorkerMixin pid=4374) 2022-07-18 11:09:50,904      INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=16]
(BaseWorkerMixin pid=4375) 2022-07-18 11:09:50,901      INFO config.py:71 -- Setting up process group for: env:// [rank=1, world_size=16]
(BaseWorkerMixin pid=4379) 2022-07-18 11:09:50,903      INFO config.py:71 -- Setting up process group for: env:// [rank=5, world_size=16]
(BaseWorkerMixin pid=4383) 2022-07-18 11:09:50,904      INFO config.py:71 -- Setting up process group for: env:// [rank=9, world_size=16]
(BaseWorkerMixin pid=4381) 2022-07-18 11:09:50,902      INFO config.py:71 -- Setting up process group for: env:// [rank=7, world_size=16]
(BaseWorkerMixin pid=4388) 2022-07-18 11:09:50,904      INFO config.py:71 -- Setting up process group for: env:// [rank=14, world_size=16]
(BaseWorkerMixin pid=4382) 2022-07-18 11:09:50,904      INFO config.py:71 -- Setting up process group for: env:// [rank=8, world_size=16]
(BaseWorkerMixin pid=4389) 2022-07-18 11:09:50,902      INFO config.py:71 -- Setting up process group for: env:// [rank=15, world_size=16]
(BaseWorkerMixin pid=4385) 2022-07-18 11:09:50,901      INFO config.py:71 -- Setting up process group for: env:// [rank=11, world_size=16]
(BaseWorkerMixin pid=4387) 2022-07-18 11:09:50,902      INFO config.py:71 -- Setting up process group for: env:// [rank=13, world_size=16]
(BaseWorkerMixin pid=4384) 2022-07-18 11:09:50,903      INFO config.py:71 -- Setting up process group for: env:// [rank=10, world_size=16]
(BaseWorkerMixin pid=4386) 2022-07-18 11:09:50,901      INFO config.py:71 -- Setting up process group for: env:// [rank=12, world_size=16]
2022-07-18 11:09:51,963 INFO trainer.py:253 -- Run results will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-09-42/run_001
Counter({None: 16})

 ray/train/tests/test_gpu.py33% ███▍      2022-07-18 11:09:55,808 ERROR services.py:1494 -- Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:09:55,808 ERROR services.py:1495 -- Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
    port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
    raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:09:56,100 INFO worker.py:1295 -- Connecting to existing Ray cluster at address: 127.0.0.1:60201
2022-07-18 11:09:56,131 INFO trainer.py:247 -- Trainer logs will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-09-56
(BaseWorkerMixin pid=4467) 2022-07-18 11:10:01,323      INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=8]
(BaseWorkerMixin pid=4468) 2022-07-18 11:10:01,325      INFO config.py:71 -- Setting up process group for: env:// [rank=1, world_size=8]
(BaseWorkerMixin pid=4471) 2022-07-18 11:10:01,328      INFO config.py:71 -- Setting up process group for: env:// [rank=4, world_size=8]
(BaseWorkerMixin pid=4473) 2022-07-18 11:10:01,325      INFO config.py:71 -- Setting up process group for: env:// [rank=6, world_size=8]
(BaseWorkerMixin pid=4470) 2022-07-18 11:10:01,323      INFO config.py:71 -- Setting up process group for: env:// [rank=3, world_size=8]
(BaseWorkerMixin pid=4472) 2022-07-18 11:10:01,328      INFO config.py:71 -- Setting up process group for: env:// [rank=5, world_size=8]
(BaseWorkerMixin pid=4469) 2022-07-18 11:10:01,323      INFO config.py:71 -- Setting up process group for: env:// [rank=2, world_size=8]
2022-07-18 11:10:01,366 INFO trainer.py:253 -- Run results will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-09-56/run_001
(BaseWorkerMixin pid=4474) 2022-07-18 11:10:01,336      INFO config.py:71 -- Setting up process group for: env:// [rank=7, world_size=8]
Counter({None: 8})
 ray/train/tests/test_gpu.py ✓✓                                                                                                                                                67% ██████▋   2022-07-18 11:10:05,802 ERROR services.py:1494 -- Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:10:05,803 ERROR services.py:1495 -- Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
    port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
    raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:10:06,091 INFO worker.py:1295 -- Connecting to existing Ray cluster at address: 127.0.0.1:64379
2022-07-18 11:10:06,119 INFO trainer.py:247 -- Trainer logs will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-10-06
2022-07-18 11:10:10,828 INFO trainer.py:253 -- Run results will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-10-06/run_001
(BaseWorkerMixin pid=4544) 2022-07-18 11:10:10,785      INFO config.py:71 -- Setting up process group for: env:// [rank=1, world_size=4]
(BaseWorkerMixin pid=4546) 2022-07-18 11:10:10,784      INFO config.py:71 -- Setting up process group for: env:// [rank=3, world_size=4]
(BaseWorkerMixin pid=4543) 2022-07-18 11:10:10,784      INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]
(BaseWorkerMixin pid=4545) 2022-07-18 11:10:10,784      INFO config.py:71 -- Setting up process group for: env:// [rank=2, world_size=4]
Counter({None: 4})
 ray/train/tests/test_gpu.py ✓✓✓                                                                                                                                              100% ██████████
===================================================================================== warnings summary ======================================================================================
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:36
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:36: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
    'nearest': pil_image.NEAREST,

/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:37
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:37: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
    'bilinear': pil_image.BILINEAR,

/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:38
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
    'bicubic': pil_image.BICUBIC,

/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:39
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:39: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
    'hamming': pil_image.HAMMING,

/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:40
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:40: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
    'box': pil_image.BOX,

/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:41
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:41: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
    'lanczos': pil_image.LANCZOS,

/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/air/util/tensor_extensions/pandas.py:168
  /Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/air/util/tensor_extensions/pandas.py:168: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if LooseVersion("1.1.0") <= LooseVersion(pd.__version__) < LooseVersion("1.3.0"):

ray/train/tests/test_gpu.py::test_torch_get_device_dist[0.5]
ray/train/tests/test_gpu.py::test_torch_get_device_dist[1]
ray/train/tests/test_gpu.py::test_torch_get_device_dist[2]
  /Users/jimmy/ScratchGym/Scratch/test0718/ray/python/ray/train/tests/test_gpu.py:95: DeprecationWarning: The `ray.train.Trainer` API is deprecated in Ray 2.0, and is replaced by Ray AI Runtime (Ray AIR). Ray AIR (https://docs.ray.io/en/latest/ray-air/getting-started.html) will provide greater functionality than `ray.train.Trainer`, and with a more flexible and easy-to-use API.
    trainer = Trainer(

-- Docs: https://docs.pytest.org/en/latest/warnings.html
========================================================================================== PASSES ===========================================================================================
______________________________________________________________________________ test_torch_get_device_dist[0.5] ______________________________________________________________________________
------------------------------------------------------------------------------------ Captured log setup -------------------------------------------------------------------------------------
ERROR    ray._private.services:services.py:1494 Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
ERROR    ray._private.services:services.py:1495 Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
    port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
    raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
______________________________________________________________________________ test_torch_get_device_dist[0.5] ______________________________________________________________________________
------------------------------------------------------------------------------------ Captured log setup -------------------------------------------------------------------------------------
ERROR    ray._private.services:services.py:1494 Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
ERROR    ray._private.services:services.py:1495 Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
    port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
    raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
______________________________________________________________________________ test_torch_get_device_dist[0.5] ______________________________________________________________________________
------------------------------------------------------------------------------------ Captured log setup -------------------------------------------------------------------------------------
ERROR    ray._private.services:services.py:1494 Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
ERROR    ray._private.services:services.py:1495 Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
    port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
    raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.

@JiahaoYao
Copy link
Contributor Author

this is because torch.cuda.is_available() == False

@richardliaw
Copy link
Contributor

@JiahaoYao can you just monkeypatch torch.cuda.is_available?

https://docs.python.org/3/library/unittest.mock.html#attaching-mocks-as-attributes

@JiahaoYao
Copy link
Contributor Author

@richardliaw thanks, let me try this

python/ray/train/torch/train_loop_utils.py Show resolved Hide resolved
@@ -299,6 +299,40 @@ def train_func():
assert len(trial_dfs[0]["training_iteration"]) == 4


def test_tune_torch_get_device(num_workers=1, num_gpus_per_worker=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_tune is not run in GPU test suite. Do we still need this test or is it a duplicate of the other tests that we have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i followed the previous style and the actual test is happening in test_gpu.py

Copy link
Contributor Author

@JiahaoYao JiahaoYao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

@JiahaoYao JiahaoYao changed the title [RAY AIR] FIX gpu id to be local rank [RAY AIR] set the correct gpu id in TorchTrainer Jul 18, 2022
@richardliaw richardliaw merged commit 8284270 into ray-project:master Jul 19, 2022
xwjiang2010 pushed a commit to xwjiang2010/ray that referenced this pull request Jul 19, 2022
@krfricke krfricke mentioned this pull request Jul 19, 2022
6 tasks
krfricke added a commit that referenced this pull request Jul 19, 2022
Broken by #26493

Signed-off-by: Kai Fricke <[email protected]>
danielwen002 pushed a commit to danielwen002/ray that referenced this pull request Jul 19, 2022
Copy link
Contributor Author

@JiahaoYao JiahaoYao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update my review

python/ray/train/torch/train_loop_utils.py Show resolved Hide resolved
richardliaw pushed a commit that referenced this pull request Jul 27, 2022
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
Broken by ray-project#26493

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Rohan138 <[email protected]>
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Broken by ray-project#26493

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Stefan van der Kleij <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants