Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Air|Tune] enable_reproducibility is not compatible with prepare_data_loader #30247

Closed
Qinghao-Hu opened this issue Nov 14, 2022 · 4 comments · Fixed by #30266
Closed

[Air|Tune] enable_reproducibility is not compatible with prepare_data_loader #30247

Qinghao-Hu opened this issue Nov 14, 2022 · 4 comments · Fixed by #30266
Assignees
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues

Comments

@Qinghao-Hu
Copy link
Contributor

What happened + What you expected to happen

If enable ResourceChangingScheduler and add one line enable_reproducibility(seed=config["seed"]) in train_func (e.g., tune version of air example), there will be an error make all trial pause/ error:

2022-11-14 16:55:46,257 ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::_Inner.train() (pid=3914833, ip=10.100.77.179, repr=TorchTrainer)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 355, in train
    raise skipped from exception_cause(skipped)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(TypeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=3916273, ip=10.100.77.179, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f60f79f8f10>)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/qhhu/workdir/HPO/hydro/workloads/ray_func_tuner_forSH.py", line 125, in train_func
    train_epoch(train_loader, model, criterion, optimizer, fusion_num)
  File "/home/qhhu/workdir/HPO/hydro/workloads/ray_func_tuner_forSH.py", line 40, in train_epoch
    for batch, (X, y) in enumerate(dataloader):
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/torch/train_loop_utils.py", line 641, in __iter__
    self._prefetch_next_batch()
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/torch/train_loop_utils.py", line 636, in _prefetch_next_batch
    next_batch = next(self.dataloader_iter, None)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 244, in _worker_loop
    init_fn(worker_id)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/torch/train_loop_utils.py", line 433, in wrapper
    worker_init_fn(worker_id)
TypeError: 'NoneType' object is not callable

Versions / Dependencies

Ray 2.1

Reproduction script

import ray.train.torch as ht

def train_func(config):
    ht.accelerate(amp=config["amp"])  # For AMP support
    ht.enable_reproducibility(seed=config["seed"])
    ...


tune_scheduler = ResourceChangingScheduler(
            base_scheduler=tune_scheduler,
            resources_allocation_function=DistributeResources(add_bundles=True),  # default
        )

Issue Severity

No response

@Qinghao-Hu Qinghao-Hu added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 14, 2022
@Qinghao-Hu
Copy link
Contributor Author

Besides, another error (which is not related to enable_reproducibility) is that if I set a large trial number (e.g., 200). Two trials will be paused after one epoch and no trial will continue:

(ResourceChangingScheduler) Using FIFO scheduling algorithm.
Resources requested: 27.0/64 CPUs, 3.0/4 GPUs, 0.0/60.17 GiB heap, 0.0/29.78 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 4e458_00003 with val_acc=0.5092 and parameters={'train_loop_config': {'lr': 0.0551, 'momentum': 0.602, 'batch_size': 512, 'gamma': 0.21, 'model': 'resnet18', 'dataset': 'cifar10', 'seed': 1, 'amp': False}}
Number of trials: 75/200 (2 PAUSED, 70 PENDING, 3 RUNNING)
+--------------------------+----------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+-----------+--------------+
| Trial name               | status   | loc                  |   train_loop_config/ba |   train_loop_config/ga |   train_loop_config/lr |   train_loop_config/mo |   iter |   total time (s) |    loss |   val_acc |   _timestamp |
|                          |          |                      |               tch_size |                    mma |                        |                 mentum |        |                  |         |           |              |
|--------------------------+----------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+-----------+--------------|
| TorchTrainer_4e458_00001 | RUNNING  | 10.100.73.27:4068812 |                    128 |                   0.36 |                 0.0004 |                  0.546 |        |                  |         |           |              |
| TorchTrainer_4e458_00002 | RUNNING  | 10.100.73.27:4068814 |                    128 |                   0.38 |                 0.0478 |                  0.967 |        |                  |         |           |              |
| TorchTrainer_4e458_00004 | RUNNING  | 10.100.73.27:4072302 |                    256 |                   0.39 |                 0.0137 |                  0.956 |        |                  |         |           |              |
| TorchTrainer_4e458_00000 | PAUSED   | 10.100.73.27:4068581 |                    256 |                   0.28 |                 0.0047 |                  0.859 |      1 |          12.605  | 1.37914 |    0.5059 |   1668417525 |
| TorchTrainer_4e458_00003 | PAUSED   | 10.100.73.27:4068816 |                    512 |                   0.21 |                 0.0551 |                  0.602 |      1 |          15.3606 | 1.41452 |    0.5092 |   1668417530 |
| TorchTrainer_4e458_00005 | PENDING  |                      |                    256 |                   0.87 |                 0.5708 |                  0.888 |        |                  |         |           |              |
| TorchTrainer_4e458_00006 | PENDING  |                      |                    256 |                   0.78 |                 0.0018 |                  0.845 |        |                  |         |           |              |
| TorchTrainer_4e458_00007 | PENDING  |                      |                    512 |                   0.79 |                 0.2073 |                  0.914 |        |                  |         |           |              |
| TorchTrainer_4e458_00008 | PENDING  |                      |                    256 |                   0.38 |                 0.0002 |                  0.71  |        |                  |         |           |              |
| TorchTrainer_4e458_00009 | PENDING  |                      |                    128 |                   0.71 |                 0.0006 |                  0.645 |        |                  |         |           |              |

An error is raised after a while.

2022-11-14 17:21:03,370 ERROR tune.py:773 -- Trials did not complete: [TorchTrainer_4e458_00000, TorchTrainer_4e458_00001, TorchTrainer_4e458_00002, TorchTrainer_4e458_00003, TorchTrainer_4e458_00004, TorchTrainer_4e458_00005, TorchTrainer_4e458_00006, TorchTrainer_4e458_00007, TorchTrainer_4e458_00008, TorchTrainer_4e458_00009, TorchTrainer_4e458_00010, TorchTrainer_4e458_00011, TorchTrainer_4e458_00012, TorchTrainer_4e458_00013, TorchTrainer_4e458_00014, TorchTrainer_4e458_00015, TorchTrainer_4e458_00016, TorchTrainer_4e458_00017, TorchTrainer_4e458_00018, TorchTrainer_4e458_00019, TorchTrainer_4e458_00020, TorchTrainer_4e458_00021, TorchTrainer_4e458_00022, TorchTrainer_4e458_00023, TorchTrainer_4e458_00024, TorchTrainer_4e458_00025, TorchTrainer_4e458_00026, TorchTrainer_4e458_00027, TorchTrainer_4e458_00028, TorchTrainer_4e458_00029, TorchTrainer_4e458_00030, TorchTrainer_4e458_00031, TorchTrainer_4e458_00032, TorchTrainer_4e458_00033, TorchTrainer_4e458_00034, TorchTrainer_4e458_00035, TorchTrainer_4e458_00036, TorchTrainer_4e458_00037, TorchTrainer_4e458_00038, TorchTrainer_4e458_00039, TorchTrainer_4e458_00040, TorchTrainer_4e458_00041, TorchTrainer_4e458_00042, TorchTrainer_4e458_00043, TorchTrainer_4e458_00044, TorchTrainer_4e458_00045, TorchTrainer_4e458_00046, TorchTrainer_4e458_00047, TorchTrainer_4e458_00048, TorchTrainer_4e458_00049, TorchTrainer_4e458_00050, TorchTrainer_4e458_00051, TorchTrainer_4e458_00052, TorchTrainer_4e458_00053, TorchTrainer_4e458_00054, TorchTrainer_4e458_00055, TorchTrainer_4e458_00056, TorchTrainer_4e458_00057, TorchTrainer_4e458_00058, TorchTrainer_4e458_00059, TorchTrainer_4e458_00060, TorchTrainer_4e458_00061, TorchTrainer_4e458_00062, TorchTrainer_4e458_00063, TorchTrainer_4e458_00064, TorchTrainer_4e458_00065, TorchTrainer_4e458_00066, TorchTrainer_4e458_00067, TorchTrainer_4e458_00068, TorchTrainer_4e458_00069, TorchTrainer_4e458_00070, TorchTrainer_4e458_00071, TorchTrainer_4e458_00072, TorchTrainer_4e458_00073, TorchTrainer_4e458_00074]
2022-11-14 17:21:03,371 INFO tune.py:777 -- Total run time: 152.43 seconds (152.00 seconds for the tuning loop).
2022-11-14 17:21:03,371 WARNING tune.py:783 -- Experiment has been interrupted, but the most recent state was saved. You can continue running this experiment by passing `resume=True` to `tune.run()`
Result(metrics={'loss': 1.4145198225975038, 'val_acc': 0.5092, '_timestamp': 1668417530, '_time_this_iter_s': 12.510530948638916, '_training_iteration': 1, 'should_checkpoint': True, 'done': False, 'trial_id': '4e458_00003', 'experiment_tag': '3_batch_size=512,gamma=0.2100,lr=0.0551,momentum=0.6020'}, error=None, log_dir=PosixPath('/home/qhhu/workdir/HPO/hydro/ray_results/resnet18_cifar10_s200_e100_fifo_seed1_ela/TorchTrainer_4e458_00003_3_batch_size=512,gamma=0.2100,lr=0.0551,momentum=0.6020_2022-11-14_17-18-33'))
2022-11-14 17:21:03,527 WARNING experiment_analysis.py:542 -- Couldn't read config from 70 paths

@xwjiang2010 xwjiang2010 added tune Tune-related issues air labels Nov 14, 2022
@xwjiang2010
Copy link
Contributor

cc @Yard1

@hora-anyscale hora-anyscale added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 14, 2022
@Yard1 Yard1 self-assigned this Nov 14, 2022
@Yard1
Copy link
Member

Yard1 commented Nov 14, 2022

Hey @Tonyhao96, thanks for reporting this. Is the second issue reproducible using the same example as for the first issue?

It looks like the first problem isn't caused by ResourceChangingScheduler but that it merely triggers the circumstance for it to be raised (pausing and unpausing a trial). Seems to be a simple oversight!

Will look into the second one.

@Yard1
Copy link
Member

Yard1 commented Nov 14, 2022

I will split the second one into it's own issue!

@Yard1 Yard1 changed the title [Air|Tune] enable_reproducibility is not compatible with ResourceChangingScheduler [Air|Tune] enable_reproducibility is not compatible with prepare_data_loader Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants