[RLlib] Algorithm Level Checkpointing with Learner and RL Modules #34717

avnishn · 2023-04-24T17:20:29Z

This PR introduces algorithm level checkpointing with the RL modules stack. It also introduces a test for making sure that the checkpointing runs. Checkpointing however isn't seed reproducible. Upon some inspection by me and @kouroshHakha, there is some portion of the sampler that is not seed reproducible.

That being said, if I take an algorithm, checkpoint it, and then multiple times restore it and train it, the restored versions are seed reproducible with respect to each other. I've added a test that reflects this.

The more I think about it the more I realize that the algorithm won't be seed reproducible across interrupts. This is because when loading from checkpoint, we first construct an algorithm instance, then seed it, then load training state in. We aren't restoring the seeded state at the time that the algorithm was checkpointed, therefore this random state won't carry across checkpoints.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…oducible Signed-off-by: Avnish <[email protected]>

Signed-off-by: Avnish <[email protected]>

rllib/evaluation/rollout_worker.py

rllib/algorithms/algorithm.py

kouroshHakha · 2023-04-25T20:16:55Z

rllib/algorithms/algorithm.py

@@ -2131,6 +2148,17 @@ def load_checkpoint(self, checkpoint: Union[Dict, str]) -> None:
        else:
            checkpoint_data = checkpoint
        self.__setstate__(checkpoint_data)
+        if isinstance(checkpoint, str) and self.config._enable_learner_api:


I don't think the location of this logic ties well to the existing code where checkpoint can take both a dict or str value. You need to map the checkpoint input (str or dict) to a checkpoint data first and then use checkpoint data inside the __setstate__() api to set the state of the learner group.

my question is more like what do you need to do if checkpoint is a dict? when would that happen, and what would that mean for the learner group

ok so this is interesting. upon further inspection, this reason that this is supposed to accept a dict is in the case that trainable.save_checkpoint ever returns a dictionary. However, we don't do this, which means that we don't need to even support this inside of load_checkpoint to begin with. I just ended up removing all the logic related to handling dicts.

rllib/algorithms/ppo/tests/test_ppo_learner.py

rllib/tests/test_algorithm_save_load_checkpoint_learner.py

rllib/algorithms/algorithm.py

…kpointing_learner_group_from_algo

…t this to begin with Signed-off-by: Avnish <[email protected]>

Signed-off-by: Avnish <[email protected]>

…ze feature for torch (ray-project#34189)" This reverts commit 72268e8.

…r_group_from_algo

Signed-off-by: Avnish <[email protected]>

kouroshHakha · 2023-04-26T15:51:48Z

rllib/algorithms/algorithm.py

        if self.config._enable_learner_api:
            learner_state_dir = os.path.join(checkpoint_dir, "learner")
            self.learner_group.save_state(learner_state_dir)
+            state["learner_state_dir"] = "learner/"


state dict has been already dumped into a file when we get to this line. So what's the point of writing new kvs into it?

leftover from experimenting, you're right :)

rllib/algorithms/algorithm_config.py

rllib/policy/sample_batch.py

rllib/tests/test_algorithm_save_load_checkpoint_learner.py

…kpointing_learner_group_from_algo

Signed-off-by: Avnish <[email protected]>

kouroshHakha

approved contingent on tests passing. Thanks @avnishn

…y-project#34717) Signed-off-by: Avnish <[email protected]> Signed-off-by: Jack He <[email protected]>

…y-project#34717) Signed-off-by: Avnish <[email protected]>

Initial commit, checkpointing works on algo level but isn't seed repr…

d585b4c

…oducible Signed-off-by: Avnish <[email protected]>

avnishn requested review from sven1977, gjoliver, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners April 24, 2023 17:20

avnishn assigned gjoliver and kouroshHakha Apr 24, 2023

Check reproducibility across loaded algorithms

f387cf4

Signed-off-by: Avnish <[email protected]>

avnishn assigned sven1977 Apr 24, 2023

avnishn added 3 commits April 24, 2023 11:11

Update build

f5b26ef

Signed-off-by: Avnish <[email protected]>

Fix build syntax error

54e28f3

Signed-off-by: Avnish <[email protected]>

Fix build src -> srcs

38da8df

Signed-off-by: Avnish <[email protected]>

kouroshHakha reviewed Apr 25, 2023

View reviewed changes

avnishn commented Apr 25, 2023

View reviewed changes

rllib/algorithms/algorithm.py Outdated Show resolved Hide resolved

avnishn added 8 commits April 25, 2023 14:12

Merge branch 'master' of https://github.com/ray-project/ray into chec…

27a84d3

…kpointing_learner_group_from_algo

Remove support for loading checkpoints from dict, as we didn't suppor…

cbb748f

…t this to begin with Signed-off-by: Avnish <[email protected]>

Address comments

c735ddf

Signed-off-by: Avnish <[email protected]>

Address comments

fc7a9fb

Signed-off-by: Avnish <[email protected]>

Fix broken tests

d532484

Signed-off-by: Avnish <[email protected]>

Revert "[RLlib] Introduce experimental larger than GPU train batch si…

bee6a8c

…ze feature for torch (ray-project#34189)" This reverts commit 72268e8.

Merge branch 'revert_make_batch_bigger_gpu' into checkpointing_learne…

bc626ff

…r_group_from_algo

Bump test time

14e927f

Signed-off-by: Avnish <[email protected]>

kouroshHakha reviewed Apr 26, 2023

View reviewed changes

avnishn added 4 commits April 26, 2023 11:08

Merge branch 'master' of https://github.com/ray-project/ray into chec…

b843353

…kpointing_learner_group_from_algo

Revert changes caused by revert

8b2708e

Signed-off-by: Avnish <[email protected]>

Revert changes caused by revert

170fca7

Signed-off-by: Avnish <[email protected]>

Address comments

9cd6995

Signed-off-by: Avnish <[email protected]>

kouroshHakha approved these changes Apr 26, 2023

View reviewed changes

gjoliver merged commit 6b59692 into ray-project:master Apr 26, 2023

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023

[RLlib] Algorithm Level Checkpointing with Learner and RL Modules (ra…

ca7782b

…y-project#34717) Signed-off-by: Avnish <[email protected]> Signed-off-by: Jack He <[email protected]>

architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023

[RLlib] Algorithm Level Checkpointing with Learner and RL Modules (ra…

86631b1

…y-project#34717) Signed-off-by: Avnish <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Algorithm Level Checkpointing with Learner and RL Modules #34717

[RLlib] Algorithm Level Checkpointing with Learner and RL Modules #34717

avnishn commented Apr 24, 2023 •

edited

Loading

kouroshHakha Apr 25, 2023

kouroshHakha Apr 25, 2023

avnishn Apr 25, 2023

kouroshHakha Apr 26, 2023

avnishn Apr 26, 2023

kouroshHakha left a comment

[RLlib] Algorithm Level Checkpointing with Learner and RL Modules #34717

[RLlib] Algorithm Level Checkpointing with Learner and RL Modules #34717

Conversation

avnishn commented Apr 24, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

kouroshHakha Apr 25, 2023

Choose a reason for hiding this comment

kouroshHakha Apr 25, 2023

Choose a reason for hiding this comment

avnishn Apr 25, 2023

Choose a reason for hiding this comment

kouroshHakha Apr 26, 2023

Choose a reason for hiding this comment

avnishn Apr 26, 2023

Choose a reason for hiding this comment

kouroshHakha left a comment

Choose a reason for hiding this comment

avnishn commented Apr 24, 2023 •

edited

Loading