Skip to content

Commit

Permalink
[tune] Increase volume size for long running pbt failure (#27163) (#2…
Browse files Browse the repository at this point in the history
…7247)

Currently running into an issue:

Cluster startup Failed. Error: RuntimeError: botocore.exceptions.ClientError: An error occurred (InvalidBlockDeviceMapping) when calling the RunInstances operation: Volume of size 202GB is smaller than  snapshot 'snap-02c4e6a0ad06cf3d6', expect size >= 400GB

Co-authored-by: Kai Fricke <[email protected]>
  • Loading branch information
matthewdeng and krfricke authored Jul 29, 2022
1 parent c680837 commit 8671807
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 2 deletions.
2 changes: 1 addition & 1 deletion release/long_running_distributed_tests/compute_tpl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@ aws:
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 202
VolumeSize: 400
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

import ray
from ray import tune
from ray.air.config import RunConfig, ScalingConfig
from ray.air.config import RunConfig, ScalingConfig, FailureConfig
from ray.train.examples.tune_cifar_torch_pbt_example import train_func
from ray.train.torch import TorchConfig, TorchTrainer
from ray.tune.schedulers import PopulationBasedTraining
Expand Down Expand Up @@ -69,6 +69,7 @@
),
run_config=RunConfig(
stop={"training_iteration": 1} if args.smoke_test else None,
failure_config=FailureConfig(max_failures=-1),
callbacks=[FailureInjectorCallback(time_between_checks=90), ProgressCallback()],
),
)
Expand Down

0 comments on commit 8671807

Please sign in to comment.