-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Air|Tune] enable_reproducibility
is not compatible with prepare_data_loader
#30247
Comments
Besides, another error (which is not related to (ResourceChangingScheduler) Using FIFO scheduling algorithm.
Resources requested: 27.0/64 CPUs, 3.0/4 GPUs, 0.0/60.17 GiB heap, 0.0/29.78 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 4e458_00003 with val_acc=0.5092 and parameters={'train_loop_config': {'lr': 0.0551, 'momentum': 0.602, 'batch_size': 512, 'gamma': 0.21, 'model': 'resnet18', 'dataset': 'cifar10', 'seed': 1, 'amp': False}}
Number of trials: 75/200 (2 PAUSED, 70 PENDING, 3 RUNNING)
+--------------------------+----------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+-----------+--------------+
| Trial name | status | loc | train_loop_config/ba | train_loop_config/ga | train_loop_config/lr | train_loop_config/mo | iter | total time (s) | loss | val_acc | _timestamp |
| | | | tch_size | mma | | mentum | | | | | |
|--------------------------+----------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+-----------+--------------|
| TorchTrainer_4e458_00001 | RUNNING | 10.100.73.27:4068812 | 128 | 0.36 | 0.0004 | 0.546 | | | | | |
| TorchTrainer_4e458_00002 | RUNNING | 10.100.73.27:4068814 | 128 | 0.38 | 0.0478 | 0.967 | | | | | |
| TorchTrainer_4e458_00004 | RUNNING | 10.100.73.27:4072302 | 256 | 0.39 | 0.0137 | 0.956 | | | | | |
| TorchTrainer_4e458_00000 | PAUSED | 10.100.73.27:4068581 | 256 | 0.28 | 0.0047 | 0.859 | 1 | 12.605 | 1.37914 | 0.5059 | 1668417525 |
| TorchTrainer_4e458_00003 | PAUSED | 10.100.73.27:4068816 | 512 | 0.21 | 0.0551 | 0.602 | 1 | 15.3606 | 1.41452 | 0.5092 | 1668417530 |
| TorchTrainer_4e458_00005 | PENDING | | 256 | 0.87 | 0.5708 | 0.888 | | | | | |
| TorchTrainer_4e458_00006 | PENDING | | 256 | 0.78 | 0.0018 | 0.845 | | | | | |
| TorchTrainer_4e458_00007 | PENDING | | 512 | 0.79 | 0.2073 | 0.914 | | | | | |
| TorchTrainer_4e458_00008 | PENDING | | 256 | 0.38 | 0.0002 | 0.71 | | | | | |
| TorchTrainer_4e458_00009 | PENDING | | 128 | 0.71 | 0.0006 | 0.645 | | | | | | An error is raised after a while. 2022-11-14 17:21:03,370 ERROR tune.py:773 -- Trials did not complete: [TorchTrainer_4e458_00000, TorchTrainer_4e458_00001, TorchTrainer_4e458_00002, TorchTrainer_4e458_00003, TorchTrainer_4e458_00004, TorchTrainer_4e458_00005, TorchTrainer_4e458_00006, TorchTrainer_4e458_00007, TorchTrainer_4e458_00008, TorchTrainer_4e458_00009, TorchTrainer_4e458_00010, TorchTrainer_4e458_00011, TorchTrainer_4e458_00012, TorchTrainer_4e458_00013, TorchTrainer_4e458_00014, TorchTrainer_4e458_00015, TorchTrainer_4e458_00016, TorchTrainer_4e458_00017, TorchTrainer_4e458_00018, TorchTrainer_4e458_00019, TorchTrainer_4e458_00020, TorchTrainer_4e458_00021, TorchTrainer_4e458_00022, TorchTrainer_4e458_00023, TorchTrainer_4e458_00024, TorchTrainer_4e458_00025, TorchTrainer_4e458_00026, TorchTrainer_4e458_00027, TorchTrainer_4e458_00028, TorchTrainer_4e458_00029, TorchTrainer_4e458_00030, TorchTrainer_4e458_00031, TorchTrainer_4e458_00032, TorchTrainer_4e458_00033, TorchTrainer_4e458_00034, TorchTrainer_4e458_00035, TorchTrainer_4e458_00036, TorchTrainer_4e458_00037, TorchTrainer_4e458_00038, TorchTrainer_4e458_00039, TorchTrainer_4e458_00040, TorchTrainer_4e458_00041, TorchTrainer_4e458_00042, TorchTrainer_4e458_00043, TorchTrainer_4e458_00044, TorchTrainer_4e458_00045, TorchTrainer_4e458_00046, TorchTrainer_4e458_00047, TorchTrainer_4e458_00048, TorchTrainer_4e458_00049, TorchTrainer_4e458_00050, TorchTrainer_4e458_00051, TorchTrainer_4e458_00052, TorchTrainer_4e458_00053, TorchTrainer_4e458_00054, TorchTrainer_4e458_00055, TorchTrainer_4e458_00056, TorchTrainer_4e458_00057, TorchTrainer_4e458_00058, TorchTrainer_4e458_00059, TorchTrainer_4e458_00060, TorchTrainer_4e458_00061, TorchTrainer_4e458_00062, TorchTrainer_4e458_00063, TorchTrainer_4e458_00064, TorchTrainer_4e458_00065, TorchTrainer_4e458_00066, TorchTrainer_4e458_00067, TorchTrainer_4e458_00068, TorchTrainer_4e458_00069, TorchTrainer_4e458_00070, TorchTrainer_4e458_00071, TorchTrainer_4e458_00072, TorchTrainer_4e458_00073, TorchTrainer_4e458_00074]
2022-11-14 17:21:03,371 INFO tune.py:777 -- Total run time: 152.43 seconds (152.00 seconds for the tuning loop).
2022-11-14 17:21:03,371 WARNING tune.py:783 -- Experiment has been interrupted, but the most recent state was saved. You can continue running this experiment by passing `resume=True` to `tune.run()`
Result(metrics={'loss': 1.4145198225975038, 'val_acc': 0.5092, '_timestamp': 1668417530, '_time_this_iter_s': 12.510530948638916, '_training_iteration': 1, 'should_checkpoint': True, 'done': False, 'trial_id': '4e458_00003', 'experiment_tag': '3_batch_size=512,gamma=0.2100,lr=0.0551,momentum=0.6020'}, error=None, log_dir=PosixPath('/home/qhhu/workdir/HPO/hydro/ray_results/resnet18_cifar10_s200_e100_fifo_seed1_ela/TorchTrainer_4e458_00003_3_batch_size=512,gamma=0.2100,lr=0.0551,momentum=0.6020_2022-11-14_17-18-33'))
2022-11-14 17:21:03,527 WARNING experiment_analysis.py:542 -- Couldn't read config from 70 paths |
cc @Yard1 |
Hey @Tonyhao96, thanks for reporting this. Is the second issue reproducible using the same example as for the first issue? It looks like the first problem isn't caused by Will look into the second one. |
I will split the second one into it's own issue! |
enable_reproducibility
is not compatible with ResourceChangingScheduler
enable_reproducibility
is not compatible with prepare_data_loader
What happened + What you expected to happen
If enable
ResourceChangingScheduler
and add one lineenable_reproducibility(seed=config["seed"])
in train_func (e.g., tune version of air example), there will be an error make all trial pause/ error:Versions / Dependencies
Ray 2.1
Reproduction script
Issue Severity
No response
The text was updated successfully, but these errors were encountered: