[Air|Tune] `enable_reproducibility` is not compatible with `prepare_data_loader` #30247

Qinghao-Hu · 2022-11-14T08:57:50Z

What happened + What you expected to happen

If enable ResourceChangingScheduler and add one line enable_reproducibility(seed=config["seed"]) in train_func (e.g., tune version of air example), there will be an error make all trial pause/ error:

2022-11-14 16:55:46,257 ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::_Inner.train() (pid=3914833, ip=10.100.77.179, repr=TorchTrainer)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 355, in train
    raise skipped from exception_cause(skipped)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(TypeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=3916273, ip=10.100.77.179, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f60f79f8f10>)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/qhhu/workdir/HPO/hydro/workloads/ray_func_tuner_forSH.py", line 125, in train_func
    train_epoch(train_loader, model, criterion, optimizer, fusion_num)
  File "/home/qhhu/workdir/HPO/hydro/workloads/ray_func_tuner_forSH.py", line 40, in train_epoch
    for batch, (X, y) in enumerate(dataloader):
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/torch/train_loop_utils.py", line 641, in __iter__
    self._prefetch_next_batch()
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/torch/train_loop_utils.py", line 636, in _prefetch_next_batch
    next_batch = next(self.dataloader_iter, None)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 244, in _worker_loop
    init_fn(worker_id)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/torch/train_loop_utils.py", line 433, in wrapper
    worker_init_fn(worker_id)
TypeError: 'NoneType' object is not callable

Versions / Dependencies

Ray 2.1

Reproduction script

import ray.train.torch as ht

def train_func(config):
    ht.accelerate(amp=config["amp"])  # For AMP support
    ht.enable_reproducibility(seed=config["seed"])
    ...


tune_scheduler = ResourceChangingScheduler(
            base_scheduler=tune_scheduler,
            resources_allocation_function=DistributeResources(add_bundles=True),  # default
        )

Issue Severity

No response

The text was updated successfully, but these errors were encountered:

Qinghao-Hu · 2022-11-14T09:27:18Z

Besides, another error (which is not related to enable_reproducibility) is that if I set a large trial number (e.g., 200). Two trials will be paused after one epoch and no trial will continue:

(ResourceChangingScheduler) Using FIFO scheduling algorithm.
Resources requested: 27.0/64 CPUs, 3.0/4 GPUs, 0.0/60.17 GiB heap, 0.0/29.78 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 4e458_00003 with val_acc=0.5092 and parameters={'train_loop_config': {'lr': 0.0551, 'momentum': 0.602, 'batch_size': 512, 'gamma': 0.21, 'model': 'resnet18', 'dataset': 'cifar10', 'seed': 1, 'amp': False}}
Number of trials: 75/200 (2 PAUSED, 70 PENDING, 3 RUNNING)
+--------------------------+----------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+-----------+--------------+
| Trial name               | status   | loc                  |   train_loop_config/ba |   train_loop_config/ga |   train_loop_config/lr |   train_loop_config/mo |   iter |   total time (s) |    loss |   val_acc |   _timestamp |
|                          |          |                      |               tch_size |                    mma |                        |                 mentum |        |                  |         |           |              |
|--------------------------+----------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+-----------+--------------|
| TorchTrainer_4e458_00001 | RUNNING  | 10.100.73.27:4068812 |                    128 |                   0.36 |                 0.0004 |                  0.546 |        |                  |         |           |              |
| TorchTrainer_4e458_00002 | RUNNING  | 10.100.73.27:4068814 |                    128 |                   0.38 |                 0.0478 |                  0.967 |        |                  |         |           |              |
| TorchTrainer_4e458_00004 | RUNNING  | 10.100.73.27:4072302 |                    256 |                   0.39 |                 0.0137 |                  0.956 |        |                  |         |           |              |
| TorchTrainer_4e458_00000 | PAUSED   | 10.100.73.27:4068581 |                    256 |                   0.28 |                 0.0047 |                  0.859 |      1 |          12.605  | 1.37914 |    0.5059 |   1668417525 |
| TorchTrainer_4e458_00003 | PAUSED   | 10.100.73.27:4068816 |                    512 |                   0.21 |                 0.0551 |                  0.602 |      1 |          15.3606 | 1.41452 |    0.5092 |   1668417530 |
| TorchTrainer_4e458_00005 | PENDING  |                      |                    256 |                   0.87 |                 0.5708 |                  0.888 |        |                  |         |           |              |
| TorchTrainer_4e458_00006 | PENDING  |                      |                    256 |                   0.78 |                 0.0018 |                  0.845 |        |                  |         |           |              |
| TorchTrainer_4e458_00007 | PENDING  |                      |                    512 |                   0.79 |                 0.2073 |                  0.914 |        |                  |         |           |              |
| TorchTrainer_4e458_00008 | PENDING  |                      |                    256 |                   0.38 |                 0.0002 |                  0.71  |        |                  |         |           |              |
| TorchTrainer_4e458_00009 | PENDING  |                      |                    128 |                   0.71 |                 0.0006 |                  0.645 |        |                  |         |           |              |

An error is raised after a while.

2022-11-14 17:21:03,370 ERROR tune.py:773 -- Trials did not complete: [TorchTrainer_4e458_00000, TorchTrainer_4e458_00001, TorchTrainer_4e458_00002, TorchTrainer_4e458_00003, TorchTrainer_4e458_00004, TorchTrainer_4e458_00005, TorchTrainer_4e458_00006, TorchTrainer_4e458_00007, TorchTrainer_4e458_00008, TorchTrainer_4e458_00009, TorchTrainer_4e458_00010, TorchTrainer_4e458_00011, TorchTrainer_4e458_00012, TorchTrainer_4e458_00013, TorchTrainer_4e458_00014, TorchTrainer_4e458_00015, TorchTrainer_4e458_00016, TorchTrainer_4e458_00017, TorchTrainer_4e458_00018, TorchTrainer_4e458_00019, TorchTrainer_4e458_00020, TorchTrainer_4e458_00021, TorchTrainer_4e458_00022, TorchTrainer_4e458_00023, TorchTrainer_4e458_00024, TorchTrainer_4e458_00025, TorchTrainer_4e458_00026, TorchTrainer_4e458_00027, TorchTrainer_4e458_00028, TorchTrainer_4e458_00029, TorchTrainer_4e458_00030, TorchTrainer_4e458_00031, TorchTrainer_4e458_00032, TorchTrainer_4e458_00033, TorchTrainer_4e458_00034, TorchTrainer_4e458_00035, TorchTrainer_4e458_00036, TorchTrainer_4e458_00037, TorchTrainer_4e458_00038, TorchTrainer_4e458_00039, TorchTrainer_4e458_00040, TorchTrainer_4e458_00041, TorchTrainer_4e458_00042, TorchTrainer_4e458_00043, TorchTrainer_4e458_00044, TorchTrainer_4e458_00045, TorchTrainer_4e458_00046, TorchTrainer_4e458_00047, TorchTrainer_4e458_00048, TorchTrainer_4e458_00049, TorchTrainer_4e458_00050, TorchTrainer_4e458_00051, TorchTrainer_4e458_00052, TorchTrainer_4e458_00053, TorchTrainer_4e458_00054, TorchTrainer_4e458_00055, TorchTrainer_4e458_00056, TorchTrainer_4e458_00057, TorchTrainer_4e458_00058, TorchTrainer_4e458_00059, TorchTrainer_4e458_00060, TorchTrainer_4e458_00061, TorchTrainer_4e458_00062, TorchTrainer_4e458_00063, TorchTrainer_4e458_00064, TorchTrainer_4e458_00065, TorchTrainer_4e458_00066, TorchTrainer_4e458_00067, TorchTrainer_4e458_00068, TorchTrainer_4e458_00069, TorchTrainer_4e458_00070, TorchTrainer_4e458_00071, TorchTrainer_4e458_00072, TorchTrainer_4e458_00073, TorchTrainer_4e458_00074]
2022-11-14 17:21:03,371 INFO tune.py:777 -- Total run time: 152.43 seconds (152.00 seconds for the tuning loop).
2022-11-14 17:21:03,371 WARNING tune.py:783 -- Experiment has been interrupted, but the most recent state was saved. You can continue running this experiment by passing `resume=True` to `tune.run()`
Result(metrics={'loss': 1.4145198225975038, 'val_acc': 0.5092, '_timestamp': 1668417530, '_time_this_iter_s': 12.510530948638916, '_training_iteration': 1, 'should_checkpoint': True, 'done': False, 'trial_id': '4e458_00003', 'experiment_tag': '3_batch_size=512,gamma=0.2100,lr=0.0551,momentum=0.6020'}, error=None, log_dir=PosixPath('/home/qhhu/workdir/HPO/hydro/ray_results/resnet18_cifar10_s200_e100_fifo_seed1_ela/TorchTrainer_4e458_00003_3_batch_size=512,gamma=0.2100,lr=0.0551,momentum=0.6020_2022-11-14_17-18-33'))
2022-11-14 17:21:03,527 WARNING experiment_analysis.py:542 -- Couldn't read config from 70 paths

xwjiang2010 · 2022-11-14T22:02:36Z

cc @Yard1

Yard1 · 2022-11-14T22:16:47Z

Hey @Tonyhao96, thanks for reporting this. Is the second issue reproducible using the same example as for the first issue?

It looks like the first problem isn't caused by ResourceChangingScheduler but that it merely triggers the circumstance for it to be raised (pausing and unpausing a trial). Seems to be a simple oversight!

Will look into the second one.

Yard1 · 2022-11-14T22:48:53Z

I will split the second one into it's own issue!

Qinghao-Hu added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 14, 2022

xwjiang2010 added tune Tune-related issues air labels Nov 14, 2022

hora-anyscale added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 14, 2022

Yard1 self-assigned this Nov 14, 2022

Yard1 mentioned this issue Nov 14, 2022

[AIR] ResourceChangingScheduler causes tuning to hang with a large trial number #30265

Closed

Yard1 changed the title ~~[Air|Tune] enable_reproducibility is not compatible with ResourceChangingScheduler~~ [Air|Tune] enable_reproducibility is not compatible with prepare_data_loader Nov 14, 2022

Yard1 mentioned this issue Nov 14, 2022

[Train] Fix prepare_data_loader with enable_reproducibility #30266

Merged

7 tasks

amogkam closed this as completed in #30266 Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Air|Tune] `enable_reproducibility` is not compatible with `prepare_data_loader` #30247

[Air|Tune] `enable_reproducibility` is not compatible with `prepare_data_loader` #30247

Qinghao-Hu commented Nov 14, 2022

Qinghao-Hu commented Nov 14, 2022

xwjiang2010 commented Nov 14, 2022

Yard1 commented Nov 14, 2022 •

edited

Loading

Yard1 commented Nov 14, 2022

[Air|Tune] enable_reproducibility is not compatible with prepare_data_loader #30247

[Air|Tune] enable_reproducibility is not compatible with prepare_data_loader #30247

Comments

Qinghao-Hu commented Nov 14, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Qinghao-Hu commented Nov 14, 2022

xwjiang2010 commented Nov 14, 2022

Yard1 commented Nov 14, 2022 • edited Loading

Yard1 commented Nov 14, 2022

[Air|Tune] `enable_reproducibility` is not compatible with `prepare_data_loader` #30247

[Air|Tune] `enable_reproducibility` is not compatible with `prepare_data_loader` #30247

Yard1 commented Nov 14, 2022 •

edited

Loading