[Ray component: Air] (torch trainer) cuda visiual devices when using ray tune #26490

JiahaoYao · 2022-07-13T00:19:57Z

What happened + What you expected to happen

This is the error message

(plt) ubuntu@ip-172-31-50-232:~/ray_lightning/ray_lightning/launchers$ python test_torchtrainer.py 
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to /home/ubuntu/data/FashionMNIST/raw/train-images-idx3-ubyte.gz
100%|███████████████████████████████████████████████████████████████████| 26421880/26421880 [00:03<00:00, 8341256.27it/s]
Extracting /home/ubuntu/data/FashionMNIST/raw/train-images-idx3-ubyte.gz to /home/ubuntu/data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to /home/ubuntu/data/FashionMNIST/raw/train-labels-idx1-ubyte.gz
100%|██████████████████████████████████████████████████████████████████████████| 29515/29515 [00:00<00:00, 205203.03it/s]
Extracting /home/ubuntu/data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to /home/ubuntu/data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to /home/ubuntu/data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz
100%|█████████████████████████████████████████████████████████████████████| 4422102/4422102 [00:01<00:00, 3822473.71it/s]
Extracting /home/ubuntu/data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to /home/ubuntu/data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to /home/ubuntu/data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz
100%|██████████████████████████████████████████████████████████████████████████| 5148/5148 [00:00<00:00, 48741031.58it/s]
Extracting /home/ubuntu/data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to /home/ubuntu/data/FashionMNIST/raw

Traceback (most recent call last):
  File "test_torchtrainer.py", line 208, in <module>
    ray.init('auto')
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/worker.py", line 954, in init
    bootstrap_address = services.canonicalize_bootstrap_address(address)
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/_private/services.py", line 451, in canonicalize_bootstrap_address
    addr = get_ray_address_from_environment()
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/_private/services.py", line 358, in get_ray_address_from_environment
    addr = _find_gcs_address_or_die()
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/_private/services.py", line 340, in _find_gcs_address_or_die
    raise ConnectionError(
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting `--address` flag or `RAY_ADDRESS` environment variable.
(plt) ubuntu@ip-172-31-50-232:~/ray_lightning/ray_lightning/launchers$ python test_torchtrainer.py 
2022-07-13 00:07:55,893 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265
2022-07-13 00:07:57,564 INFO trainer.py:243 -- Trainer logs will be logged in: /home/ubuntu/ray_results/train_2022-07-13_00-07-57
2022-07-13 00:08:03,511 WARNING worker.py:1404 -- Warning: The actor TrainTrainable is very large (52 MiB). Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use ray.put() to put large objects in the Ray object store.
== Status ==
Current time: 2022-07-13 00:08:06 (running for 00:00:05.66)
Memory usage on this node: 6.4/186.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/48 CPUs, 2.0/4 GPUs, 0.0/120.27 GiB heap, 0.0/55.53 GiB objects
Result logdir: /home/ubuntu/ray_results/tune_function_2022-07-13_00-07-59
Number of trials: 2/2 (1 PENDING, 1 RUNNING)
+---------------------------+----------+---------------------+--------------+-------------+
| Trial name                | status   | loc                 |   batch_size |          lr |
|---------------------------+----------+---------------------+--------------+-------------|
| tune_function_db143_00000 | RUNNING  | 172.31.50.232:45780 |           32 | 0.0220007   |
| tune_function_db143_00001 | PENDING  |                     |          128 | 0.000709082 |
+---------------------------+----------+---------------------+--------------+-------------+


(TrainTrainable pid=45780) 2022-07-13 00:08:06,382      INFO trainer.py:243 -- Trainer logs will be logged in: /home/ubuntu/ray_results/train_2022-07-13_00-08-06
(TrainTrainable pid=45854) 2022-07-13 00:08:09,480      INFO trainer.py:243 -- Trainer logs will be logged in: /home/ubuntu/ray_results/train_2022-07-13_00-08-09
(BaseWorkerMixin pid=46104) 2022-07-13 00:08:10,970     INFO torch.py:346 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=46105) 2022-07-13 00:08:10,963     INFO torch.py:346 -- Setting up process group for: env:// [rank=1, world_size=2]
== Status ==
Current time: 2022-07-13 00:08:11 (running for 00:00:11.18)
Memory usage on this node: 7.0/186.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 6.0/48 CPUs, 4.0/4 GPUs, 0.0/120.27 GiB heap, 0.0/55.53 GiB objects
Result logdir: /home/ubuntu/ray_results/tune_function_2022-07-13_00-07-59
Number of trials: 2/2 (2 RUNNING)
+---------------------------+----------+---------------------+--------------+-------------+
| Trial name                | status   | loc                 |   batch_size |          lr |
|---------------------------+----------+---------------------+--------------+-------------|
| tune_function_db143_00000 | RUNNING  | 172.31.50.232:45780 |           32 | 0.0220007   |
| tune_function_db143_00001 | RUNNING  | 172.31.50.232:45854 |          128 | 0.000709082 |
+---------------------------+----------+---------------------+--------------+-------------+


(TrainTrainable pid=45780) 2022-07-13 00:08:11,976      INFO trainer.py:249 -- Run results will be logged in: /home/ubuntu/ray_results/train_2022-07-13_00-08-06/run_001
(BaseWorkerMixin pid=46213) 2022-07-13 00:08:14,055     INFO torch.py:346 -- Setting up process group for: env:// [rank=1, world_size=2]
(BaseWorkerMixin pid=46212) 2022-07-13 00:08:14,057     INFO torch.py:346 -- Setting up process group for: env:// [rank=0, world_size=2]
(BaseWorkerMixin pid=46104) 2022-07-13 00:08:14,743     INFO torch.py:98 -- Moving model to device: cuda:0
(BaseWorkerMixin pid=46104) 2022-07-13 00:08:14,758     INFO torch.py:132 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=46105) 2022-07-13 00:08:14,744     INFO torch.py:98 -- Moving model to device: cuda:1
(TrainTrainable pid=45854) 2022-07-13 00:08:15,068      INFO trainer.py:249 -- Run results will be logged in: /home/ubuntu/ray_results/train_2022-07-13_00-08-09/run_001
(BaseWorkerMixin pid=46105) 2022-07-13 00:08:15,859     INFO torch.py:132 -- Wrapping provided model in DDP.
== Status ==
Current time: 2022-07-13 00:08:16 (running for 00:00:16.18)
Memory usage on this node: 12.0/186.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 6.0/48 CPUs, 4.0/4 GPUs, 0.0/120.27 GiB heap, 0.0/55.53 GiB objects
Result logdir: /home/ubuntu/ray_results/tune_function_2022-07-13_00-07-59
Number of trials: 2/2 (2 RUNNING)
+---------------------------+----------+---------------------+--------------+-------------+
| Trial name                | status   | loc                 |   batch_size |          lr |
|---------------------------+----------+---------------------+--------------+-------------|
| tune_function_db143_00000 | RUNNING  | 172.31.50.232:45780 |           32 | 0.0220007   |
| tune_function_db143_00001 | RUNNING  | 172.31.50.232:45854 |          128 | 0.000709082 |
+---------------------------+----------+---------------------+--------------+-------------+


(BaseWorkerMixin pid=46104) loss: 2.295513  [    0/30000]
(BaseWorkerMixin pid=46105) loss: 2.300330  [    0/30000]
(BaseWorkerMixin pid=46104) loss: 1.856370  [ 1600/30000]
(BaseWorkerMixin pid=46105) loss: 2.008597  [ 1600/30000]
(BaseWorkerMixin pid=46104) loss: 1.670580  [ 3200/30000]
(BaseWorkerMixin pid=46105) loss: 1.576656  [ 3200/30000]
(TrainTrainable pid=45854) 2022-07-13 00:08:17,592      ERROR function_runner.py:286 -- Runner Thread raised error.
(TrainTrainable pid=45854) Traceback (most recent call last):
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/tune/function_runner.py", line 277, in run
(TrainTrainable pid=45854)     self._entrypoint()
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
(TrainTrainable pid=45854)     return self._trainable_func(
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(TrainTrainable pid=45854)     return method(self, *_args, **_kwargs)
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
(TrainTrainable pid=45854)     output = fn()
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/trainer.py", line 888, in tune_function
(TrainTrainable pid=45854)     for results in iterator:
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/trainer.py", line 752, in __next__
(TrainTrainable pid=45854)     self._final_results = self._run_with_error_handling(
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/trainer.py", line 713, in _run_with_error_handling
(TrainTrainable pid=45854)     return func()
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/trainer.py", line 824, in _finish_training
(TrainTrainable pid=45854)     return self._backend_executor.finish_training()
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/utils.py", line 168, in <lambda>
(TrainTrainable pid=45854)     return lambda *args, **kwargs: ray.get(actor_method.remote(*args, **kwargs))
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(TrainTrainable pid=45854)     return func(*args, **kwargs)
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/worker.py", line 1831, in get
(TrainTrainable pid=45854)     raise value.as_instanceof_cause()
(TrainTrainable pid=45854) ray.exceptions.RayTaskError(RuntimeError): ray::BackendExecutor.finish_training() (pid=46172, ip=172.31.50.232, repr=<ray.train.backend.BackendExecutor object at 0x7f73dae43b80>)
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/backend.py", line 498, in finish_training
(TrainTrainable pid=45854)     results = self.get_with_failure_handling(futures)
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/backend.py", line 517, in get_with_failure_handling
(TrainTrainable pid=45854)     success = check_for_failure(remote_values)
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/utils.py", line 50, in check_for_failure
(TrainTrainable pid=45854)     ray.get(object_ref)
(TrainTrainable pid=45854) ray.exceptions.RayTaskError(RuntimeError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=46213, ip=172.31.50.232, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7f222839ba30>)
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/worker_group.py", line 26, in __execute
(TrainTrainable pid=45854)     return func(*args, **kwargs)
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/backend.py", line 489, in end_training
(TrainTrainable pid=45854)     output = session.finish()
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/session.py", line 118, in finish
(TrainTrainable pid=45854)     func_output = self.training_thread.join()
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/utils.py", line 96, in join
(TrainTrainable pid=45854)     raise self.exc
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/utils.py", line 89, in run
(TrainTrainable pid=45854)     self.ret = self._target(*self._args, **self._kwargs)
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/utils.py", line 138, in <lambda>
(TrainTrainable pid=45854)     return lambda: train_func(config)
(TrainTrainable pid=45854)   File "test_torchtrainer.py", line 123, in train_func
(TrainTrainable pid=45854)     model = train.torch.prepare_model(model)
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/torch.py", line 614, in prepare_model
(TrainTrainable pid=45854)     return get_accelerator(TorchAccelerator).prepare_model(
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/torch.py", line 95, in prepare_model
(TrainTrainable pid=45854)     torch.cuda.set_device(device)
(TrainTrainable pid=45854)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
(TrainTrainable pid=45854)     torch._C._cuda_setDevice(device)
(TrainTrainable pid=45854) RuntimeError: CUDA error: invalid device ordinal
(TrainTrainable pid=45854) CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
(TrainTrainable pid=45854) For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(BackendExecutor pid=46172) 2022-07-13 00:08:17,592     ERROR worker.py:94 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=46212, ip=172.31.50.232, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7f322a5f0a30>)
(BackendExecutor pid=46172)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/worker_group.py", line 26, in __execute
(BackendExecutor pid=46172)     return func(*args, **kwargs)
(BackendExecutor pid=46172)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/backend.py", line 489, in end_training
(BackendExecutor pid=46172)     output = session.finish()
(BackendExecutor pid=46172)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/session.py", line 118, in finish
(BackendExecutor pid=46172)     func_output = self.training_thread.join()
(BackendExecutor pid=46172)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/utils.py", line 96, in join
(BackendExecutor pid=46172)     raise self.exc
(BackendExecutor pid=46172)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/utils.py", line 89, in run
(BackendExecutor pid=46172)     self.ret = self._target(*self._args, **self._kwargs)
(BackendExecutor pid=46172)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/utils.py", line 138, in <lambda>
(BackendExecutor pid=46172)     return lambda: train_func(config)
(BackendExecutor pid=46172)   File "test_torchtrainer.py", line 123, in train_func
(BackendExecutor pid=46172)     model = train.torch.prepare_model(model)
(BackendExecutor pid=46172)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/torch.py", line 614, in prepare_model
(BackendExecutor pid=46172)     return get_accelerator(TorchAccelerator).prepare_model(
(BackendExecutor pid=46172)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/train/torch.py", line 95, in prepare_model
(BackendExecutor pid=46172)     torch.cuda.set_device(device)
(BackendExecutor pid=46172)   File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
(BackendExecutor pid=46172)     torch._C._cuda_setDevice(device)
(BackendExecutor pid=46172) RuntimeError: CUDA error: invalid device ordinal
(BackendExecutor pid=46172) CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
(BackendExecutor pid=46172) For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2022-07-13 00:08:17,787 ERROR trial_runner.py:886 -- Trial tune_function_db143_00001: Error processing event.
NoneType: None
Result for tune_function_db143_00001:
  date: 2022-07-13_00-08-09
  experiment_id: 63118dfb6aa34836b05856dcf4f5fd62
  hostname: ip-172-31-50-232
  node_ip: 172.31.50.232
  pid: 45854
  timestamp: 1657670889
  trial_id: db143_00001
  
(BaseWorkerMixin pid=46104) loss: 1.247969  [ 4800/30000]
(BaseWorkerMixin pid=46105) loss: 1.523197  [ 4800/30000]
(BaseWorkerMixin pid=46104) loss: 1.359455  [ 6400/30000]
(BaseWorkerMixin pid=46105) loss: 0.810615  [ 6400/30000]
(BaseWorkerMixin pid=46104) loss: 1.087785  [ 8000/30000]
(BaseWorkerMixin pid=46105) loss: 1.161864  [ 8000/30000]
(BaseWorkerMixin pid=46104) loss: 0.790139  [ 9600/30000]
(BaseWorkerMixin pid=46105) loss: 1.048966  [ 9600/30000]
(BaseWorkerMixin pid=46104) loss: 1.116953  [11200/30000]
(BaseWorkerMixin pid=46105) loss: 1.120018  [11200/30000]
(BaseWorkerMixin pid=46104) loss: 0.971860  [12800/30000]
(BaseWorkerMixin pid=46105) loss: 1.299540  [12800/30000]
(BaseWorkerMixin pid=46104) loss: 0.782351  [14400/30000]
(BaseWorkerMixin pid=46105) loss: 1.811201  [14400/30000]
(BaseWorkerMixin pid=46104) loss: 1.287236  [16000/30000]
(BaseWorkerMixin pid=46105) loss: 1.014336  [16000/30000]
(BaseWorkerMixin pid=46104) loss: 1.066879  [17600/30000]
(BaseWorkerMixin pid=46105) loss: 0.691322  [17600/30000]
(BaseWorkerMixin pid=46104) loss: 0.700346  [19200/30000]
(BaseWorkerMixin pid=46105) loss: 1.203963  [19200/30000]
(BaseWorkerMixin pid=46104) loss: 1.238495  [20800/30000]
(BaseWorkerMixin pid=46105) loss: 0.996168  [20800/30000]
(BaseWorkerMixin pid=46104) loss: 1.130319  [22400/30000]
(BaseWorkerMixin pid=46105) loss: 0.856216  [22400/30000]
(BaseWorkerMixin pid=46104) loss: 0.502875  [24000/30000]
(BaseWorkerMixin pid=46105) loss: 1.015783  [24000/30000]
(BaseWorkerMixin pid=46104) loss: 1.158590  [25600/30000]
(BaseWorkerMixin pid=46105) loss: 0.809457  [25600/30000]
(BaseWorkerMixin pid=46104) loss: 0.808960  [27200/30000]
(BaseWorkerMixin pid=46105) loss: 1.194568  [27200/30000]
(BaseWorkerMixin pid=46104) loss: 1.312078  [28800/30000]
(BaseWorkerMixin pid=46105) loss: 1.062378  [28800/30000]
Result for tune_function_db143_00000:
  _time_this_iter_s: 9.980828285217285
  _timestamp: 1657670902
  _training_iteration: 1
  date: 2022-07-13_00-08-22
  done: false
  experiment_id: 196125398d614bc2959f9b9bd20629e8
  hostname: ip-172-31-50-232
  iterations_since_restore: 1
  loss: 0.9640937376731691
  node_ip: 172.31.50.232
  pid: 45780
  time_since_restore: 16.32908034324646
  time_this_iter_s: 16.32908034324646
  time_total_s: 16.32908034324646
  timestamp: 1657670902
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: db143_00000
  warmup_time: 0.0026230812072753906
  
== Status ==
Current time: 2022-07-13 00:08:22 (running for 00:00:22.00)
Memory usage on this node: 10.1/186.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/48 CPUs, 2.0/4 GPUs, 0.0/120.27 GiB heap, 0.0/55.53 GiB objects
Result logdir: /home/ubuntu/ray_results/tune_function_2022-07-13_00-07-59
Number of trials: 2/2 (1 ERROR, 1 RUNNING)
+---------------------------+----------+---------------------+--------------+-------------+--------+------------------+----------+--------------+---------------------+
| Trial name                | status   | loc                 |   batch_size |          lr |   iter |   total time (s) |     loss |   _timestamp |   _time_this_iter_s |
|---------------------------+----------+---------------------+--------------+-------------+--------+------------------+----------+--------------+---------------------|
| tune_function_db143_00000 | RUNNING  | 172.31.50.232:45780 |           32 | 0.0220007   |      1 |          16.3291 | 0.964094 |   1657670902 |             9.98083 |
| tune_function_db143_00001 | ERROR    | 172.31.50.232:45854 |          128 | 0.000709082 |        |                  |          |              |                     |
+---------------------------+----------+---------------------+--------------+-------------+--------+------------------+----------+--------------+---------------------+
Number of errored trials: 1
+---------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                |   # failures | error file                                                                                                                                    |
|---------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------|
| tune_function_db143_00001 |            1 | /home/ubuntu/ray_results/tune_function_2022-07-13_00-07-59/tune_function_db143_00001_1_batch_size=128,lr=0.0007_2022-07-13_00-08-06/error.txt |
+---------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------+

(BaseWorkerMixin pid=46104) Test Error: 
(BaseWorkerMixin pid=46104)  Accuracy: 68.8%, Avg loss: 0.964094 
(BaseWorkerMixin pid=46104) 
(BaseWorkerMixin pid=46104) loss: 1.379129  [    0/30000]
(BaseWorkerMixin pid=46105) Test Error: 
(BaseWorkerMixin pid=46105)  Accuracy: 66.7%, Avg loss: 1.018786 
(BaseWorkerMixin pid=46105) 
(BaseWorkerMixin pid=46105) loss: 0.939172  [    0/30000]
(BaseWorkerMixin pid=46104) loss: 1.039721  [ 1600/30000]
(BaseWorkerMixin pid=46105) loss: 0.752197  [ 1600/30000]
(BaseWorkerMixin pid=46104) loss: 0.789385  [ 3200/30000]
(BaseWorkerMixin pid=46105) loss: 1.059903  [ 3200/30000]
(BaseWorkerMixin pid=46104) loss: 1.018644  [ 4800/30000]
(BaseWorkerMixin pid=46105) loss: 0.833854  [ 4800/30000]
(BaseWorkerMixin pid=46104) loss: 0.997580  [ 6400/30000]
(BaseWorkerMixin pid=46105) loss: 0.518008  [ 6400/30000]
(BaseWorkerMixin pid=46104) loss: 0.607216  [ 8000/30000]
(BaseWorkerMixin pid=46105) loss: 0.905233  [ 8000/30000]
(BaseWorkerMixin pid=46104) loss: 0.634860  [ 9600/30000]
(BaseWorkerMixin pid=46105) loss: 0.916595  [ 9600/30000]
(BaseWorkerMixin pid=46104) loss: 0.960436  [11200/30000]
(BaseWorkerMixin pid=46105) loss: 0.832843  [11200/30000]
(BaseWorkerMixin pid=46104) loss: 0.846607  [12800/30000]
(BaseWorkerMixin pid=46105) loss: 1.197762  [12800/30000]
(BaseWorkerMixin pid=46104) loss: 0.504696  [14400/30000]
(BaseWorkerMixin pid=46105) loss: 1.417728  [14400/30000]
(BaseWorkerMixin pid=46104) loss: 1.148403  [16000/30000]
(BaseWorkerMixin pid=46105) loss: 0.934103  [16000/30000]
(BaseWorkerMixin pid=46104) loss: 1.257948  [17600/30000]
(BaseWorkerMixin pid=46105) loss: 0.548713  [17600/30000]
(BaseWorkerMixin pid=46104) loss: 0.670694  [19200/30000]
(BaseWorkerMixin pid=46105) loss: 1.103186  [19200/30000]
(BaseWorkerMixin pid=46104) loss: 1.132061  [20800/30000]
(BaseWorkerMixin pid=46105) loss: 0.891098  [20800/30000]
(BaseWorkerMixin pid=46104) loss: 1.075081  [22400/30000]
(BaseWorkerMixin pid=46105) loss: 0.824542  [22400/30000]
(BaseWorkerMixin pid=46104) loss: 0.436869  [24000/30000]
(BaseWorkerMixin pid=46105) loss: 1.003291  [24000/30000]
(BaseWorkerMixin pid=46104) loss: 1.110539  [25600/30000]
(BaseWorkerMixin pid=46105) loss: 0.737462  [25600/30000]
(BaseWorkerMixin pid=46104) loss: 0.685153  [27200/30000]
(BaseWorkerMixin pid=46105) loss: 1.106474  [27200/30000]
(BaseWorkerMixin pid=46104) loss: 1.149808  [28800/30000]
(BaseWorkerMixin pid=46105) loss: 1.022572  [28800/30000]
== Status ==
Current time: 2022-07-13 00:08:27 (running for 00:00:27.00)
Memory usage on this node: 10.1/186.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/48 CPUs, 2.0/4 GPUs, 0.0/120.27 GiB heap, 0.0/55.53 GiB objects
Result logdir: /home/ubuntu/ray_results/tune_function_2022-07-13_00-07-59
Number of trials: 2/2 (1 ERROR, 1 RUNNING)
+---------------------------+----------+---------------------+--------------+-------------+--------+------------------+----------+--------------+---------------------+
| Trial name                | status   | loc                 |   batch_size |          lr |   iter |   total time (s) |     loss |   _timestamp |   _time_this_iter_s |
|---------------------------+----------+---------------------+--------------+-------------+--------+------------------+----------+--------------+---------------------|
| tune_function_db143_00000 | RUNNING  | 172.31.50.232:45780 |           32 | 0.0220007   |      1 |          16.3291 | 0.964094 |   1657670902 |             9.98083 |
| tune_function_db143_00001 | ERROR    | 172.31.50.232:45854 |          128 | 0.000709082 |        |                  |          |              |                     |
+---------------------------+----------+---------------------+--------------+-------------+--------+------------------+----------+--------------+---------------------+
Number of errored trials: 1
+---------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                |   # failures | error file                                                                                                                                    |
|---------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------|
| tune_function_db143_00001 |            1 | /home/ubuntu/ray_results/tune_function_2022-07-13_00-07-59/tune_function_db143_00001_1_batch_size=128,lr=0.0007_2022-07-13_00-08-06/error.txt |
+---------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------+

Result for tune_function_db143_00000:
  _time_this_iter_s: 5.761234998703003
  _timestamp: 1657670908
  _training_iteration: 2
  date: 2022-07-13_00-08-28
  done: false
  experiment_id: 196125398d614bc2959f9b9bd20629e8
  hostname: ip-172-31-50-232
  iterations_since_restore: 2
  loss: 0.8621342661299597
  node_ip: 172.31.50.232
  pid: 45780
  time_since_restore: 22.082415342330933
  time_this_iter_s: 5.753334999084473
  time_total_s: 22.082415342330933
  timestamp: 1657670908
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: db143_00000
  warmup_time: 0.0026230812072753906
  
(BaseWorkerMixin pid=46104) Test Error: 
(BaseWorkerMixin pid=46104)  Accuracy: 73.5%, Avg loss: 0.862134 
(BaseWorkerMixin pid=46104) 
(BaseWorkerMixin pid=46105) Test Error: 
(BaseWorkerMixin pid=46105)  Accuracy: 71.7%, Avg loss: 0.915298 
(BaseWorkerMixin pid=46105) 
Result for tune_function_db143_00000:
  _time_this_iter_s: 5.761234998703003
  _timestamp: 1657670908
  _training_iteration: 2
  date: 2022-07-13_00-08-28
  done: true
  experiment_id: 196125398d614bc2959f9b9bd20629e8
  experiment_tag: 0_batch_size=32,lr=0.0220
  hostname: ip-172-31-50-232
  iterations_since_restore: 2
  loss: 0.8621342661299597
  node_ip: 172.31.50.232
  pid: 45780
  time_since_restore: 22.082415342330933
  time_this_iter_s: 5.753334999084473
  time_total_s: 22.082415342330933
  timestamp: 1657670908
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: db143_00000
  warmup_time: 0.0026230812072753906
  
== Status ==
Current time: 2022-07-13 00:08:28 (running for 00:00:28.13)
Memory usage on this node: 9.9/186.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/48 CPUs, 0/4 GPUs, 0.0/120.27 GiB heap, 0.0/55.53 GiB objects
Result logdir: /home/ubuntu/ray_results/tune_function_2022-07-13_00-07-59
Number of trials: 2/2 (1 ERROR, 1 TERMINATED)
+---------------------------+------------+---------------------+--------------+-------------+--------+------------------+----------+--------------+---------------------+
| Trial name                | status     | loc                 |   batch_size |          lr |   iter |   total time (s) |     loss |   _timestamp |   _time_this_iter_s |
|---------------------------+------------+---------------------+--------------+-------------+--------+------------------+----------+--------------+---------------------|
| tune_function_db143_00000 | TERMINATED | 172.31.50.232:45780 |           32 | 0.0220007   |      2 |          22.0824 | 0.862134 |   1657670908 |             5.76123 |
| tune_function_db143_00001 | ERROR      | 172.31.50.232:45854 |          128 | 0.000709082 |        |                  |          |              |                     |
+---------------------------+------------+---------------------+--------------+-------------+--------+------------------+----------+--------------+---------------------+
Number of errored trials: 1
+---------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                |   # failures | error file                                                                                                                                    |
|---------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------|
| tune_function_db143_00001 |            1 | /home/ubuntu/ray_results/tune_function_2022-07-13_00-07-59/tune_function_db143_00001_1_batch_size=128,lr=0.0007_2022-07-13_00-08-06/error.txt |
+---------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------+

Traceback (most recent call last):
  File "test_torchtrainer.py", line 209, in <module>
    test_tune_torch_fashion_mnist()
  File "test_torchtrainer.py", line 204, in test_tune_torch_fashion_mnist
    torch_fashion_mnist(num_workers=2, use_gpu=True, num_samples=2)
  File "test_torchtrainer.py", line 189, in torch_fashion_mnist
    analysis = tune.run(
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/tune/tune.py", line 741, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [tune_function_db143_00001])

Versions / Dependencies

ray 1.13

Reproduction script

import os

import pytest
import ray
import ray.train as train
from ray import tune, cloudpickle
from ray.tune import TuneError
from ray.train import Trainer
from ray.train.backend import Backend, BackendConfig
from ray.train.constants import TUNE_CHECKPOINT_FILE_NAME
from ray.train.worker_group import WorkerGroup


import argparse
from typing import Dict

import torch
import ray.train as train
from ray.train.trainer import Trainer
from ray.train.callbacks import JsonLoggerCallback
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

# Download training data from open datasets.
training_data = datasets.FashionMNIST(
    root="~/data",
    train=True,
    download=True,
    transform=ToTensor(),
)

# Download test data from open datasets.
test_data = datasets.FashionMNIST(
    root="~/data",
    train=False,
    download=True,
    transform=ToTensor(),
)


# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
            nn.ReLU(),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits


def train_epoch(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset) // train.world_size()
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def validate_epoch(dataloader, model, loss_fn):
    size = len(dataloader.dataset) // train.world_size()
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(
        f"Test Error: \n "
        f"Accuracy: {(100 * correct):>0.1f}%, "
        f"Avg loss: {test_loss:>8f} \n"
    )
    return test_loss


def train_func(config: Dict):
    batch_size = config["batch_size"]
    lr = config["lr"]
    epochs = config["epochs"]

    worker_batch_size = batch_size // train.world_size()

    # Create data loaders.
    train_dataloader = DataLoader(training_data, batch_size=worker_batch_size)
    test_dataloader = DataLoader(test_data, batch_size=worker_batch_size)

    train_dataloader = train.torch.prepare_data_loader(train_dataloader)
    test_dataloader = train.torch.prepare_data_loader(test_dataloader)

    # Create model.
    model = NeuralNetwork()
    model = train.torch.prepare_model(model)

    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    loss_results = []

    for _ in range(epochs):
        train_epoch(train_dataloader, model, loss_fn, optimizer)
        loss = validate_epoch(test_dataloader, model, loss_fn)
        train.report(loss=loss)
        loss_results.append(loss)

    return loss_results


def train_fashion_mnist(num_workers=2, use_gpu=False):
    trainer = Trainer(backend="torch", num_workers=num_workers, use_gpu=use_gpu)
    trainer.start()
    result = trainer.run(
        train_func=train_func,
        config={"lr": 1e-3, "batch_size": 64, "epochs": 4},
        callbacks=[JsonLoggerCallback()],
    )
    trainer.shutdown()
    print(f"Loss results: {result}")




@pytest.fixture
def ray_start_2_cpus():
    address_info = ray.init(num_cpus=2)
    yield address_info
    # The code after the yield will run as teardown code.
    ray.shutdown()


@pytest.fixture
def ray_start_8_cpus():
    address_info = ray.init(num_cpus=8)
    yield address_info
    # The code after the yield will run as teardown code.
    ray.shutdown()


class TestConfig(BackendConfig):
    @property
    def backend_cls(self):
        return TestBackend


class TestBackend(Backend):
    def on_start(self, worker_group: WorkerGroup, backend_config: TestConfig):
        pass

    def on_shutdown(self, worker_group: WorkerGroup, backend_config: TestConfig):
        pass


def torch_fashion_mnist(num_workers, use_gpu, num_samples):
    epochs = 2

    trainer = Trainer("torch", num_workers=num_workers, use_gpu=use_gpu)
    MnistTrainable = trainer.to_tune_trainable(train_func)

    analysis = tune.run(
        MnistTrainable,
        num_samples=num_samples,
        config={
            "lr": tune.loguniform(1e-4, 1e-1),
            "batch_size": tune.choice([32, 64, 128]),
            "epochs": epochs,
        },
    )

    # Check that loss decreases in each trial.
    for path, df in analysis.trial_dataframes.items():
        assert df.loc[1, "loss"] < df.loc[0, "loss"]

def test_tune_torch_fashion_mnist():
    torch_fashion_mnist(num_workers=2, use_gpu=True, num_samples=2)


if __name__ == '__main__':
    ray.init()
    test_tune_torch_fashion_mnist()

Issue Severity

High: It blocks me from completing my task.

JiahaoYao · 2022-07-13T00:20:42Z

use the local rank here

https://github.com/ray-project/ray/blob/master/python/ray/train/torch/train_loop_utils.py#L466-L478

JiahaoYao · 2022-07-13T18:37:22Z

is fixed in the pr #26493

JiahaoYao added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 13, 2022

JiahaoYao closed this as completed Jul 13, 2022

JiahaoYao mentioned this issue Jul 18, 2022

[RAY AIR] set the correct gpu id in TorchTrainer #26493

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray component: Air] (torch trainer) cuda visiual devices when using ray tune #26490

[Ray component: Air] (torch trainer) cuda visiual devices when using ray tune #26490

JiahaoYao commented Jul 13, 2022

JiahaoYao commented Jul 13, 2022

JiahaoYao commented Jul 13, 2022

[Ray component: Air] (torch trainer) cuda visiual devices when using ray tune #26490

[Ray component: Air] (torch trainer) cuda visiual devices when using ray tune #26490

Comments

JiahaoYao commented Jul 13, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

JiahaoYao commented Jul 13, 2022

JiahaoYao commented Jul 13, 2022