-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RAY AIR] set the correct gpu id in TorchTrainer
#26493
Conversation
python/ray/train/tests/test_gpu.py
Outdated
elif num_gpus_per_worker == 2: | ||
assert devices == [0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this run? Might need to increase the fixture to have 4 GPUs since there are 2 workers?
Also this probably doesn't test complete coverage (I think the previous code would succeed here as well).
Would be good to have a test where the index is not equal to the GPU ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JiahaoYao!
python/ray/train/tests/test_gpu.py
Outdated
@@ -349,6 +349,10 @@ def test_tune_tensorflow_mnist_gpu(ray_start_4_cpus_2_gpus): | |||
tune_tensorflow_mnist(num_workers=2, use_gpu=True, num_samples=1) | |||
|
|||
|
|||
def test_concurrent_tune_tensorflow_mnist_gpu(ray_start_4_cpus_2_gpus): | |||
tune_tensorflow_mnist(num_workers=1, use_gpu=True, num_samples=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change was made to train.torch
, so tensorflow tests are not actually testing the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops, my bad.
device_id = gpu_ids[0] | ||
gpu_id = gpu_ids[0] | ||
cuda_visible_list = list( | ||
map(int, ray._private.utils.get_cuda_visible_devices()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we manually get the cuda visible devices via the env var instead of using a private ray API?
Or move the private api to ray.util
and mark it as a developer api.
cuda_visible_list = list( | ||
map(int, ray._private.utils.get_cuda_visible_devices()) | ||
) | ||
device_id = cuda_visible_list.index(gpu_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this fail if cuda_visible_list is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amogkam for the review, about to modify the code accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(gpu_ids) > 0
will ensure this?
Can we add some PR descriptions? Thanks!! |
the virtual gpu does not work (ray_test) ~/ScratchGym/Scratch/test0718/ray/python/ray/train/tests python -m pytest -rP --capture no test_gpu.py::test_torch_get_device_dist
Test session starts (platform: darwin, Python 3.8.13, pytest 5.4.3, pytest-sugar 0.9.5)
rootdir: /Users/jimmy/ScratchGym/Scratch/test0718/ray/python
plugins: asyncio-0.16.0, anyio-3.6.1, forked-1.4.0, sugar-0.9.5, pytest_docker_tools-0.2.3, timeout-2.1.0, shutil-1.7.0, rerunfailures-10.2, virtualenv-1.7.0, lazy-fixture-0.6.3
collecting ... 2022-07-18 11:04:52,131 INFO trainer.py:247 -- Trainer logs will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-04-52
2022-07-18 11:04:52,133 INFO plugin_schema_manager.py:51 -- Loading the default runtime env schemas: ['/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/runtime_env/../../runtime_env/schemas/working_dir_schema.json', '/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/runtime_env/../../runtime_env/schemas/pip_schema.json'].
(BaseWorkerMixin pid=3658) 2022-07-18 11:04:58,145 INFO config.py:70 -- Setting up process group for: env:// [rank=2, world_size=16]
(BaseWorkerMixin pid=3661) 2022-07-18 11:04:58,165 INFO config.py:70 -- Setting up process group for: env:// [rank=5, world_size=16]
(BaseWorkerMixin pid=3664) 2022-07-18 11:04:58,155 INFO config.py:70 -- Setting up process group for: env:// [rank=8, world_size=16]
(BaseWorkerMixin pid=3660) 2022-07-18 11:04:58,140 INFO config.py:70 -- Setting up process group for: env:// [rank=4, world_size=16]
(BaseWorkerMixin pid=3662) 2022-07-18 11:04:58,160 INFO config.py:70 -- Setting up process group for: env:// [rank=6, world_size=16]
(BaseWorkerMixin pid=3669) 2022-07-18 11:04:58,161 INFO config.py:70 -- Setting up process group for: env:// [rank=13, world_size=16]
(BaseWorkerMixin pid=3659) 2022-07-18 11:04:58,163 INFO config.py:70 -- Setting up process group for: env:// [rank=3, world_size=16]
(BaseWorkerMixin pid=3665) 2022-07-18 11:04:58,163 INFO config.py:70 -- Setting up process group for: env:// [rank=9, world_size=16]
(BaseWorkerMixin pid=3663) 2022-07-18 11:04:58,159 INFO config.py:70 -- Setting up process group for: env:// [rank=7, world_size=16]
(BaseWorkerMixin pid=3657) 2022-07-18 11:04:58,157 INFO config.py:70 -- Setting up process group for: env:// [rank=1, world_size=16]
(BaseWorkerMixin pid=3666) 2022-07-18 11:04:58,161 INFO config.py:70 -- Setting up process group for: env:// [rank=10, world_size=16]
(BaseWorkerMixin pid=3656) 2022-07-18 11:04:58,138 INFO config.py:70 -- Setting up process group for: env:// [rank=0, world_size=16]
(BaseWorkerMixin pid=3670) 2022-07-18 11:04:58,159 INFO config.py:70 -- Setting up process group for: env:// [rank=14, world_size=16]
(BaseWorkerMixin pid=3668) 2022-07-18 11:04:58,161 INFO config.py:70 -- Setting up process group for: env:// [rank=12, world_size=16]
(BaseWorkerMixin pid=3671) 2022-07-18 11:04:58,163 INFO config.py:70 -- Setting up process group for: env:// [rank=15, world_size=16]
2022-07-18 11:04:58,215 INFO trainer.py:253 -- Run results will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-04-52/run_001
(BaseWorkerMixin pid=3667) 2022-07-18 11:04:58,213 INFO config.py:70 -- Setting up process group for: env:// [rank=11, world_size=16]
Counter({None: 16})
ray/train/tests/test_gpu.py ✓ 33% ███▍ 2022-07-18 11:05:02,241 ERROR services.py:1494 -- Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:05:02,242 ERROR services.py:1495 -- Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:05:02,531 INFO worker.py:1295 -- Connecting to existing Ray cluster at address: 127.0.0.1:65478
2022-07-18 11:05:02,561 INFO trainer.py:247 -- Trainer logs will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-05-02
2022-07-18 11:05:07,332 INFO trainer.py:253 -- Run results will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-05-02/run_001
(BaseWorkerMixin pid=3724) 2022-07-18 11:05:07,327 INFO config.py:70 -- Setting up process group for: env:// [rank=5, world_size=8]
(BaseWorkerMixin pid=3722) 2022-07-18 11:05:07,328 INFO config.py:70 -- Setting up process group for: env:// [rank=3, world_size=8]
(BaseWorkerMixin pid=3723) 2022-07-18 11:05:07,326 INFO config.py:70 -- Setting up process group for: env:// [rank=4, world_size=8]
(BaseWorkerMixin pid=3725) 2022-07-18 11:05:07,331 INFO config.py:70 -- Setting up process group for: env:// [rank=6, world_size=8]
(BaseWorkerMixin pid=3726) 2022-07-18 11:05:07,331 INFO config.py:70 -- Setting up process group for: env:// [rank=7, world_size=8]
(BaseWorkerMixin pid=3721) 2022-07-18 11:05:07,327 INFO config.py:70 -- Setting up process group for: env:// [rank=2, world_size=8]
(BaseWorkerMixin pid=3720) 2022-07-18 11:05:07,325 INFO config.py:70 -- Setting up process group for: env:// [rank=1, world_size=8]
(BaseWorkerMixin pid=3719) 2022-07-18 11:05:07,327 INFO config.py:70 -- Setting up process group for: env:// [rank=0, world_size=8]
Counter({None: 8})
ray/train/tests/test_gpu.py ✓✓ 67% ██████▋ 2022-07-18 11:05:11,405 ERROR services.py:1494 -- Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:05:11,406 ERROR services.py:1495 -- Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:05:11,695 INFO worker.py:1295 -- Connecting to existing Ray cluster at address: 127.0.0.1:64480
2022-07-18 11:05:11,723 INFO trainer.py:247 -- Trainer logs will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-05-11
2022-07-18 11:05:16,002 INFO trainer.py:253 -- Run results will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-05-11/run_001
(BaseWorkerMixin pid=3763) 2022-07-18 11:05:15,999 INFO config.py:70 -- Setting up process group for: env:// [rank=2, world_size=4]
(BaseWorkerMixin pid=3761) 2022-07-18 11:05:16,000 INFO config.py:70 -- Setting up process group for: env:// [rank=0, world_size=4]
(BaseWorkerMixin pid=3762) 2022-07-18 11:05:15,999 INFO config.py:70 -- Setting up process group for: env:// [rank=1, world_size=4]
(BaseWorkerMixin pid=3764) 2022-07-18 11:05:16,001 INFO config.py:70 -- Setting up process group for: env:// [rank=3, world_size=4]
Counter({None: 4})
ray/train/tests/test_gpu.py ✓✓✓ 100% ██████████
===================================================================================== warnings summary ======================================================================================
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:36
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:36: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
'nearest': pil_image.NEAREST,
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:37
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:37: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
'bilinear': pil_image.BILINEAR,
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:38
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
'bicubic': pil_image.BICUBIC,
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:39
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:39: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
'hamming': pil_image.HAMMING,
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:40
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:40: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
'box': pil_image.BOX,
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:41
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:41: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
'lanczos': pil_image.LANCZOS,
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/air/util/tensor_extensions/pandas.py:168
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/air/util/tensor_extensions/pandas.py:168: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion("1.1.0") <= LooseVersion(pd.__version__) < LooseVersion("1.3.0"):
ray/train/tests/test_gpu.py::test_torch_get_device_dist[0.5]
ray/train/tests/test_gpu.py::test_torch_get_device_dist[1]
ray/train/tests/test_gpu.py::test_torch_get_device_dist[2]
/Users/jimmy/ScratchGym/Scratch/test0718/ray/python/ray/train/tests/test_gpu.py:95: DeprecationWarning: The `ray.train.Trainer` API is deprecated in Ray 2.0, and is replaced by Ray AI Runtime (Ray AIR). Ray AIR (https://docs.ray.io/en/latest/ray-air/getting-started.html) will provide greater functionality than `ray.train.Trainer`, and with a more flexible and easy-to-use API.
trainer = Trainer(
-- Docs: https://docs.pytest.org/en/latest/warnings.html
========================================================================================== PASSES ===========================================================================================
______________________________________________________________________________ test_torch_get_device_dist[0.5] ______________________________________________________________________________
------------------------------------------------------------------------------------ Captured log setup -------------------------------------------------------------------------------------
ERROR ray._private.services:services.py:1494 Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
ERROR ray._private.services:services.py:1495 Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
______________________________________________________________________________ test_torch_get_device_dist[0.5] ______________________________________________________________________________
------------------------------------------------------------------------------------ Captured log setup -------------------------------------------------------------------------------------
ERROR ray._private.services:services.py:1494 Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
ERROR ray._private.services:services.py:1495 Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
______________________________________________________________________________ test_torch_get_device_dist[0.5] ______________________________________________________________________________
------------------------------------------------------------------------------------ Captured log setup -------------------------------------------------------------------------------------
ERROR ray._private.services:services.py:1494 Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
ERROR ray._private.services:services.py:1495 Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Results (34.25s):
3 passed
(ray_test) ~/ScratchGym/Scratch/test0718/ray/python/ray/train/tests python -m pytest -rP --capture no test_gpu.py::test_torch_get_device_dist
Test session starts (platform: darwin, Python 3.8.13, pytest 5.4.3, pytest-sugar 0.9.5)
rootdir: /Users/jimmy/ScratchGym/Scratch/test0718/ray/python
plugins: asyncio-0.16.0, anyio-3.6.1, forked-1.4.0, sugar-0.9.5, pytest_docker_tools-0.2.3, timeout-2.1.0, shutil-1.7.0, rerunfailures-10.2, virtualenv-1.7.0, lazy-fixture-0.6.3
collecting ... 2022-07-18 11:09:42,305 INFO trainer.py:247 -- Trainer logs will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-09-42
2022-07-18 11:09:42,308 INFO plugin_schema_manager.py:51 -- Loading the default runtime env schemas: ['/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/runtime_env/../../runtime_env/schemas/working_dir_schema.json', '/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/runtime_env/../../runtime_env/schemas/pip_schema.json'].
(BaseWorkerMixin pid=4378) 2022-07-18 11:09:50,904 INFO config.py:71 -- Setting up process group for: env:// [rank=4, world_size=16]
(BaseWorkerMixin pid=4377) 2022-07-18 11:09:50,901 INFO config.py:71 -- Setting up process group for: env:// [rank=3, world_size=16]
(BaseWorkerMixin pid=4376) 2022-07-18 11:09:50,909 INFO config.py:71 -- Setting up process group for: env:// [rank=2, world_size=16]
(BaseWorkerMixin pid=4380) 2022-07-18 11:09:50,904 INFO config.py:71 -- Setting up process group for: env:// [rank=6, world_size=16]
(BaseWorkerMixin pid=4374) 2022-07-18 11:09:50,904 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=16]
(BaseWorkerMixin pid=4375) 2022-07-18 11:09:50,901 INFO config.py:71 -- Setting up process group for: env:// [rank=1, world_size=16]
(BaseWorkerMixin pid=4379) 2022-07-18 11:09:50,903 INFO config.py:71 -- Setting up process group for: env:// [rank=5, world_size=16]
(BaseWorkerMixin pid=4383) 2022-07-18 11:09:50,904 INFO config.py:71 -- Setting up process group for: env:// [rank=9, world_size=16]
(BaseWorkerMixin pid=4381) 2022-07-18 11:09:50,902 INFO config.py:71 -- Setting up process group for: env:// [rank=7, world_size=16]
(BaseWorkerMixin pid=4388) 2022-07-18 11:09:50,904 INFO config.py:71 -- Setting up process group for: env:// [rank=14, world_size=16]
(BaseWorkerMixin pid=4382) 2022-07-18 11:09:50,904 INFO config.py:71 -- Setting up process group for: env:// [rank=8, world_size=16]
(BaseWorkerMixin pid=4389) 2022-07-18 11:09:50,902 INFO config.py:71 -- Setting up process group for: env:// [rank=15, world_size=16]
(BaseWorkerMixin pid=4385) 2022-07-18 11:09:50,901 INFO config.py:71 -- Setting up process group for: env:// [rank=11, world_size=16]
(BaseWorkerMixin pid=4387) 2022-07-18 11:09:50,902 INFO config.py:71 -- Setting up process group for: env:// [rank=13, world_size=16]
(BaseWorkerMixin pid=4384) 2022-07-18 11:09:50,903 INFO config.py:71 -- Setting up process group for: env:// [rank=10, world_size=16]
(BaseWorkerMixin pid=4386) 2022-07-18 11:09:50,901 INFO config.py:71 -- Setting up process group for: env:// [rank=12, world_size=16]
2022-07-18 11:09:51,963 INFO trainer.py:253 -- Run results will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-09-42/run_001
Counter({None: 16})
ray/train/tests/test_gpu.py ✓ 33% ███▍ 2022-07-18 11:09:55,808 ERROR services.py:1494 -- Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:09:55,808 ERROR services.py:1495 -- Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:09:56,100 INFO worker.py:1295 -- Connecting to existing Ray cluster at address: 127.0.0.1:60201
2022-07-18 11:09:56,131 INFO trainer.py:247 -- Trainer logs will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-09-56
(BaseWorkerMixin pid=4467) 2022-07-18 11:10:01,323 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=8]
(BaseWorkerMixin pid=4468) 2022-07-18 11:10:01,325 INFO config.py:71 -- Setting up process group for: env:// [rank=1, world_size=8]
(BaseWorkerMixin pid=4471) 2022-07-18 11:10:01,328 INFO config.py:71 -- Setting up process group for: env:// [rank=4, world_size=8]
(BaseWorkerMixin pid=4473) 2022-07-18 11:10:01,325 INFO config.py:71 -- Setting up process group for: env:// [rank=6, world_size=8]
(BaseWorkerMixin pid=4470) 2022-07-18 11:10:01,323 INFO config.py:71 -- Setting up process group for: env:// [rank=3, world_size=8]
(BaseWorkerMixin pid=4472) 2022-07-18 11:10:01,328 INFO config.py:71 -- Setting up process group for: env:// [rank=5, world_size=8]
(BaseWorkerMixin pid=4469) 2022-07-18 11:10:01,323 INFO config.py:71 -- Setting up process group for: env:// [rank=2, world_size=8]
2022-07-18 11:10:01,366 INFO trainer.py:253 -- Run results will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-09-56/run_001
(BaseWorkerMixin pid=4474) 2022-07-18 11:10:01,336 INFO config.py:71 -- Setting up process group for: env:// [rank=7, world_size=8]
Counter({None: 8})
ray/train/tests/test_gpu.py ✓✓ 67% ██████▋ 2022-07-18 11:10:05,802 ERROR services.py:1494 -- Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:10:05,803 ERROR services.py:1495 -- Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
2022-07-18 11:10:06,091 INFO worker.py:1295 -- Connecting to existing Ray cluster at address: 127.0.0.1:64379
2022-07-18 11:10:06,119 INFO trainer.py:247 -- Trainer logs will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-10-06
2022-07-18 11:10:10,828 INFO trainer.py:253 -- Run results will be logged in: /Users/jimmy/ray_results/train_2022-07-18_11-10-06/run_001
(BaseWorkerMixin pid=4544) 2022-07-18 11:10:10,785 INFO config.py:71 -- Setting up process group for: env:// [rank=1, world_size=4]
(BaseWorkerMixin pid=4546) 2022-07-18 11:10:10,784 INFO config.py:71 -- Setting up process group for: env:// [rank=3, world_size=4]
(BaseWorkerMixin pid=4543) 2022-07-18 11:10:10,784 INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]
(BaseWorkerMixin pid=4545) 2022-07-18 11:10:10,784 INFO config.py:71 -- Setting up process group for: env:// [rank=2, world_size=4]
Counter({None: 4})
ray/train/tests/test_gpu.py ✓✓✓ 100% ██████████
===================================================================================== warnings summary ======================================================================================
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:36
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:36: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
'nearest': pil_image.NEAREST,
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:37
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:37: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
'bilinear': pil_image.BILINEAR,
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:38
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
'bicubic': pil_image.BICUBIC,
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:39
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:39: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
'hamming': pil_image.HAMMING,
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:40
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:40: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
'box': pil_image.BOX,
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:41
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/keras/utils/image_utils.py:41: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
'lanczos': pil_image.LANCZOS,
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/air/util/tensor_extensions/pandas.py:168
/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/air/util/tensor_extensions/pandas.py:168: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion("1.1.0") <= LooseVersion(pd.__version__) < LooseVersion("1.3.0"):
ray/train/tests/test_gpu.py::test_torch_get_device_dist[0.5]
ray/train/tests/test_gpu.py::test_torch_get_device_dist[1]
ray/train/tests/test_gpu.py::test_torch_get_device_dist[2]
/Users/jimmy/ScratchGym/Scratch/test0718/ray/python/ray/train/tests/test_gpu.py:95: DeprecationWarning: The `ray.train.Trainer` API is deprecated in Ray 2.0, and is replaced by Ray AI Runtime (Ray AIR). Ray AIR (https://docs.ray.io/en/latest/ray-air/getting-started.html) will provide greater functionality than `ray.train.Trainer`, and with a more flexible and easy-to-use API.
trainer = Trainer(
-- Docs: https://docs.pytest.org/en/latest/warnings.html
========================================================================================== PASSES ===========================================================================================
______________________________________________________________________________ test_torch_get_device_dist[0.5] ______________________________________________________________________________
------------------------------------------------------------------------------------ Captured log setup -------------------------------------------------------------------------------------
ERROR ray._private.services:services.py:1494 Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
ERROR ray._private.services:services.py:1495 Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
______________________________________________________________________________ test_torch_get_device_dist[0.5] ______________________________________________________________________________
------------------------------------------------------------------------------------ Captured log setup -------------------------------------------------------------------------------------
ERROR ray._private.services:services.py:1494 Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
ERROR ray._private.services:services.py:1495 Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
______________________________________________________________________________ test_torch_get_device_dist[0.5] ______________________________________________________________________________
------------------------------------------------------------------------------------ Captured log setup -------------------------------------------------------------------------------------
ERROR ray._private.services:services.py:1494 Failed to start the dashboard: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
ERROR ray._private.services:services.py:1495 Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1362, in start_dashboard
port_test_socket.bind((host, port))
OSError: [Errno 48] Address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jimmy/opt/anaconda3/envs/ray_test/lib/python3.8/site-packages/ray/_private/services.py", line 1369, in start_dashboard
raise ValueError(
ValueError: Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port. |
this is because |
@JiahaoYao can you just monkeypatch torch.cuda.is_available? https://docs.python.org/3/library/unittest.mock.html#attaching-mocks-as-attributes |
@richardliaw thanks, let me try this |
python/ray/train/tests/test_tune.py
Outdated
@@ -299,6 +299,40 @@ def train_func(): | |||
assert len(trial_dfs[0]["training_iteration"]) == 4 | |||
|
|||
|
|||
def test_tune_torch_get_device(num_workers=1, num_gpus_per_worker=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_tune is not run in GPU test suite. Do we still need this test or is it a duplicate of the other tests that we have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i followed the previous style and the actual test is happening in test_gpu.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
TorchTrainer
Signed-off-by: Xiaowei Jiang <[email protected]>
Broken by #26493 Signed-off-by: Kai Fricke <[email protected]>
Broken by ray-project#26493 Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Daniel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update my review
Broken by ray-project#26493 Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Rohan138 <[email protected]>
…id in TorchTrainer (2.0) (ray-project#26704) Signed-off-by: Rohan138 <[email protected]>
Signed-off-by: Stefan van der Kleij <[email protected]>
Broken by ray-project#26493 Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Stefan van der Kleij <[email protected]>
…id in TorchTrainer (2.0) (ray-project#26704) Signed-off-by: Stefan van der Kleij <[email protected]>
Why are these changes needed?
this pr fixes the issue raised in #26490. In the pytorch trainer and ray.tune, the cude device is not set correctly and has run into the following error:
The pr tries to align the gpu id from the ray.get_gpu_id and the gpu id from the
cuda_visible_devices
. The index is used to get thetorch.cuda.device
.Per Torch code, generally you want to use the local rank to set the device:
https://github.com/pytorch/pytorch/blob/35563f4fcd28e486cc58053acc15fe280123e7be/torch/distributed/launch.py#L72-L97
However, there are some edge cases in using fractional GPUs or multiple GPUs.
Related issue number
#26490
Checks
scripts/format.sh
to lint the changes in this PR.