[Tune] [PBT] [Doc] Fix and clean up PBT examples #29060

justinvyu · 2022-10-04T19:22:41Z

Why are these changes needed?

Many examples in the docs related to PBT either did not have checkpointing or were checkpointing/loading incorrectly. The examples updated are:

PBT User Guide (originally tune-advanced-tutorial) (converted this to a runnable notebook with the fixes)
pbt_function
pbt_example
pbt_memnn_example
pbt_tune_cifar10_with_keras
pb2_example
- This example uses the same training function as pbt_function, so some of the parameters needed to be updated.
tune_cifar_torch_pbt_example
- This one affects the long-running pytorch_pbt_failure release test (it uses the train function defined in this example).

This is especially the case with examples that use the Function Trainable API with PBT, since we require the user to checkpoint themselves via session.report. This makes it a bit harder to align checkpoints with PBT perturbations. The main issue is that it's difficult for the user to keep track of the iteration number themselves (which is what they need to do if they want to copy the checkpoint_frequency functionality that is available when using the class Trainable API).

Consider the following cases:

The starting step needs to be set to 1. Otherwise, checkpointing and perturbation will be out of sync:

step = 0

# Checkpoint every `checkpoint_interval` steps
if step % checkpoint_interval == 0:
    # NOTE: Since we initialized `step=0` above, our checkpointing and perturbing
    # are out of sync by 1 step.
    # Ex: if `checkpoint_interval` = `perturbation_interval` = 3
    # step:                0 (checkpoint)  1     2            3 (checkpoint)
    # training_iteration:  1               2     3 (perturb)  4
    checkpoint = Checkpoint.from_dict({"acc": accuracy, "step": step})
session.report(..., checkpoint=checkpoint)
step += 1

vs.

step = 1

# Checkpoint every `checkpoint_interval` steps
if step % checkpoint_interval == 0:
    # Fixed if we initialize step = 1
    # Ex: if `checkpoint_interval` = `perturbation_interval` = 3
    # step:                1          2     3 (checkpoint)     4
    # training_iteration:  1          2     3 (perturb)        4
    checkpoint = Checkpoint.from_dict({"acc": accuracy, "step": step})
session.report(..., checkpoint=checkpoint)
step += 1

The user can easily start from the wrong step upon restore if they don't increment the checkpointed step by 1.

if session.get_checkpoint():
    state = session.get_checkpoint().to_dict()
    accuracy = state["acc"]
    last_step = state["step"]
    # Current step should be 1 more than the last checkpoint step
    # If we did `step = last_step` instead, we might repeat the step and end up
    # checkpointing more than we want to.
    # Ex: last_step = 4, step = 4 --> if `checkpoint_interval = 4`,
    # then we would checkpoint again, even though we just restored.
    # Should be last_step = 4, step = 5 --> next checkpoint will be step = 8
    step = last_step + 1

Open questions:

Would this issue of requiring the user to manually keep another step counter be solved if we introduced a session.get_training_iteration() API?
- One problem with this is that the user needs to create the checkpoint before calling session.report, and session.report is what increments the training iteration.

Related issue number

Closes #22733

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <[email protected]>

…invyu/train-pbt-test-checkpoint

Signed-off-by: Justin Yu <[email protected]>

…invyu/train-pbt-test-checkpoint

Signed-off-by: Justin Yu <[email protected]>

…invyu/train-pbt-test-checkpoint Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]>

…ss API pbt_dcgan_mnist Signed-off-by: Justin Yu <[email protected]>

Yard1

Let's make sure to use framework-specific checkpoints where applicable

python/ray/train/examples/tune_cifar_torch_pbt_example.py

Signed-off-by: Justin Yu <[email protected]>

…invyu/train-pbt-test-checkpoint

Signed-off-by: Justin Yu <[email protected]>

…rately) Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]>

Yard1 · 2022-10-25T16:16:54Z

python/ray/train/examples/tune_cifar_torch_pbt_example.py

    with FileLock(".ray.lock"):
+        data_dir = config.get("data_dir", "~/data")


can we use something like os.path.expanduser("~/.ray.lock") instead? Ideally tie it to the data_dir. If each worker runs in a separate dir, then they will not use the same lock files.

Yard1 · 2022-10-25T16:17:57Z

python/ray/train/examples/tune_cifar_torch_pbt_example.py

@@ -110,22 +136,25 @@ def train_func(config):
    # Create loss.
    criterion = nn.CrossEntropyLoss()

-    results = []
-    for _ in range(epochs):
+    while True:


what's the reason for using while here? I realize we are defining stop conditions later but the common pattern in examples is to use a for loop anyway.

Yeah, I can change this back.

xwjiang2010 · 2022-10-25T17:32:01Z

python/ray/train/examples/tune_cifar_torch_pbt_example.py

+
+        # Optimizer configs (`lr`, `momentum`) are being mutated by PBT and passed in
+        # through config, so we need to update the optimizer loaded from the checkpoint
+        update_optimizer_config(optimizer, optimizer_config)


hmmm, following this, is that the case that one should probably not use LR scheduler with PBT that also mutates LR stuff?

I think we just need to also save and load the learning rate scheduler state (which holds the epoch information), along with the optimizer. Then, PBT will perturb the LR, but it will still follow the same schedule. We could show how to do it in this example or another example maybe?

xwjiang2010 · 2022-10-25T17:57:05Z

doc/source/tune/examples/includes/pbt_function.rst

@@ -3,4 +3,9 @@
 PBT Function Example
 ~~~~~~~~~~~~~~~~~~~~

+The following script produces the following results. For a population of 8 trials,
+the PBT learning rate schedule roughly matches the optimal learning rate schedule.
+


hmmm, how should I interpret this result?

I reproduced the original pbt_function plots, and they were pretty much the same as these. I think the idea is that the cur_lr plot roughly matches the optimal_lr schedule.

Signed-off-by: Justin Yu <[email protected]>

Yard1

Thanks!

xwjiang2010

Thanks, this is a great improvement!

…invyu/train-pbt-test-checkpoint Signed-off-by: Justin Yu <[email protected]>

…invyu/train-pbt-test-checkpoint

Signed-off-by: Weichen Xu <[email protected]>

justinvyu added 15 commits August 29, 2022 17:38

Add checkpointing to ray train PBT test

18680a8

Signed-off-by: Justin Yu <[email protected]>

Fix test to prepare model first (DDP) before loading

be32f55

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into just…

50bd044

…invyu/train-pbt-test-checkpoint

Update optimizer hyperparams properly in pbt example

ae458f0

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into just…

0cf9f58

…invyu/train-pbt-test-checkpoint

Fix pbt_function example and make sure it reproduces results

f632e88

Signed-off-by: Justin Yu <[email protected]>

Update user guide to PBT to runnable notebook

7be2e0e

Signed-off-by: Justin Yu <[email protected]>

Improve visualizations and notebook wording

493098e

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into just…

9f2342c

…invyu/train-pbt-test-checkpoint Signed-off-by: Justin Yu <[email protected]>

Remove old tutorial rst + add references to notebook

d0b1fdd

Signed-off-by: Justin Yu <[email protected]>

Add figure to pbt_function example

0d22493

Signed-off-by: Justin Yu <[email protected]>

Match class API example to function API example

d12250c

Signed-off-by: Justin Yu <[email protected]>

Some more consistency fixes

2dbc68d

Signed-off-by: Justin Yu <[email protected]>

Add checkpoint_frequency to cifar10 pbt example

f2c09cc

Signed-off-by: Justin Yu <[email protected]>

start -> step

4a6260d

Signed-off-by: Justin Yu <[email protected]>

justinvyu self-assigned this Oct 4, 2022

Fix pb2 missing checkpoint_interval config, fix loaded path for cla…

3ac643a

…ss API pbt_dcgan_mnist Signed-off-by: Justin Yu <[email protected]>

Yard1 reviewed Oct 5, 2022

View reviewed changes

python/ray/train/examples/tune_cifar_torch_pbt_example.py Outdated Show resolved Hide resolved

justinvyu added 9 commits October 5, 2022 18:39

Remove debugging print

d98782b

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into just…

0803612

…invyu/train-pbt-test-checkpoint

Merge branch 'master' of https://github.com/ray-project/ray into just…

22761c6

…invyu/train-pbt-test-checkpoint

Fix PBT guide path loading error in test

1e8e5b3

Signed-off-by: Justin Yu <[email protected]>

Fix broken link

540fc79

Signed-off-by: Justin Yu <[email protected]>

Decrease # iters in examples for tests

1d6cf77

Signed-off-by: Justin Yu <[email protected]>

Fix references to pbt guide

7bda489

Signed-off-by: Justin Yu <[email protected]>

Reduce # of iterations and samples (dcgan example already tested sepa…

7fa9b9b

…rately) Signed-off-by: Justin Yu <[email protected]>

Fix test to use a default data_dir

a46f93e

Signed-off-by: Justin Yu <[email protected]>

justinvyu marked this pull request as ready for review October 25, 2022 16:05

justinvyu requested review from richardliaw and krfricke as code owners October 25, 2022 16:05

justinvyu requested review from xwjiang2010, amogkam, matthewdeng, maxpumperla and a team as code owners October 25, 2022 16:05

justinvyu assigned Yard1 and xwjiang2010 Oct 25, 2022

Yard1 reviewed Oct 25, 2022

View reviewed changes

xwjiang2010 reviewed Oct 25, 2022

View reviewed changes

justinvyu added 2 commits October 25, 2022 12:07

Fix filelock to be shared across trials

4c017f5

Signed-off-by: Justin Yu <[email protected]>

Fix train func to use num epochs

9e2d951

Signed-off-by: Justin Yu <[email protected]>

Yard1 approved these changes Oct 26, 2022

View reviewed changes

xwjiang2010 approved these changes Oct 26, 2022

View reviewed changes

justinvyu added 2 commits October 26, 2022 17:49

Merge branch 'master' of https://github.com/ray-project/ray into just…

ee5d100

…invyu/train-pbt-test-checkpoint Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into just…

475dd61

…invyu/train-pbt-test-checkpoint

richardliaw approved these changes Oct 27, 2022

View reviewed changes

richardliaw merged commit 8c4e6dc into ray-project:master Oct 27, 2022

WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022

[Tune] [PBT] [Doc] Fix and clean up PBT examples (ray-project#29060)

31477cd

Signed-off-by: Weichen Xu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tune] [PBT] [Doc] Fix and clean up PBT examples #29060

[Tune] [PBT] [Doc] Fix and clean up PBT examples #29060

justinvyu commented Oct 4, 2022 •

edited

Loading

Yard1 left a comment

Yard1 Oct 25, 2022

Yard1 Oct 25, 2022 •

edited

Loading

justinvyu Oct 25, 2022

xwjiang2010 Oct 25, 2022

justinvyu Oct 25, 2022

xwjiang2010 Oct 25, 2022

justinvyu Oct 25, 2022

Yard1 left a comment

xwjiang2010 left a comment

		with FileLock(".ray.lock"):
		data_dir = config.get("data_dir", "~/data")

[Tune] [PBT] [Doc] Fix and clean up PBT examples #29060

[Tune] [PBT] [Doc] Fix and clean up PBT examples #29060

Conversation

justinvyu commented Oct 4, 2022 • edited Loading

Why are these changes needed?

Open questions:

Related issue number

Checks

Yard1 left a comment

Choose a reason for hiding this comment

Yard1 Oct 25, 2022

Choose a reason for hiding this comment

Yard1 Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

justinvyu Oct 25, 2022

Choose a reason for hiding this comment

xwjiang2010 Oct 25, 2022

Choose a reason for hiding this comment

justinvyu Oct 25, 2022

Choose a reason for hiding this comment

xwjiang2010 Oct 25, 2022

Choose a reason for hiding this comment

justinvyu Oct 25, 2022

Choose a reason for hiding this comment

Yard1 left a comment

Choose a reason for hiding this comment

xwjiang2010 left a comment

Choose a reason for hiding this comment

justinvyu commented Oct 4, 2022 •

edited

Loading

Yard1 Oct 25, 2022 •

edited

Loading