Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] APPO on new API stack (w/ EnvRunners). #46216

Merged
merged 15 commits into from
Jun 26, 2024

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Jun 24, 2024

APPO on new API stack (w/ EnvRunners).

  • Unified target net RLModules via a new API. The user only has to override the get_target_net_pairs... method, then the Learner can call the syncmethod on the module to sync either with or without (1.0) a tau value.
  • Removed additional_update entirely (replaced by a more flexible yet simpler API: before_gradient_based_update and after_gradient_based_update, which get called along with update, NOT in sequence anymore).
  • Added initial CartPole and multi-agent CartPole learning tests.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
…_new_api_stack

Signed-off-by: sven1977 <[email protected]>

# Conflicts:
#	rllib/algorithms/appo/appo_learner.py
#	rllib/algorithms/appo/tf/appo_tf_learner.py
#	rllib/algorithms/appo/torch/appo_torch_learner.py
#	rllib/algorithms/dqn/dqn_rainbow_learner.py
#	rllib/algorithms/dqn/dqn_rainbow_rl_module.py
#	rllib/algorithms/sac/torch/sac_torch_rl_module.py
Signed-off-by: sven1977 <[email protected]>
module_id, config, mean_kl_loss_per_module[module_id]
)
@override(Learner)
def _after_gradient_based_update(self, *, timesteps: Dict[str, Any]) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, we get rid of additional_update_for_module and instead support customizing:

Learner.before_gradient_based_update()
Learner.after_gradient_based_update()

These get called along with the regular Learner.update() call, so we won't have the problem of 2x metrics reduction anymore or having to pass around results from the update() call back into the additional_update call (e.g. the KL values, which felt a little clumsy).

We will still have to streamline this API in the future (maybe make it per module?, give them better names, make them public, unify the timesteps arg format).

def _update_module_kl_coeff(
self, module_id: ModuleID, config: APPOConfig, sampled_kl: float
) -> None:
def _update_module_kl_coeff(self, module_id: ModuleID, config: APPOConfig) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll take the KL directly from the metrics now.

Copy link
Collaborator

@simonsays1980 simonsays1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a duplicate target network synch at thesetup step of DQN Rainbow/SAC

lambda mid, module: module.sync_target_networks(tau=1.0)
)
# Initially sync target networks (w/ tau=1.0 -> full overwrite).
self.module.sync_target_networks(tau=1.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sync twice at the beginning now - the TorchDQNRainbowRLModule does sync in its setup().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I was debating this with myself: Should the RLModule perform the initial sync or the Learner?

Since the Learner also controls the regular syncs during training, I felt like we should do it in the Learner, then it's all in one place. The RLModule itself (at least in its inference_only mode) doesn't really care about the target nets anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
@sven1977 sven1977 requested review from maxpumperla and a team as code owners June 25, 2024 09:48
@sven1977 sven1977 enabled auto-merge (squash) June 26, 2024 08:13
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 26, 2024
@github-actions github-actions bot disabled auto-merge June 26, 2024 11:16
@sven1977 sven1977 enabled auto-merge (squash) June 26, 2024 11:32
@sven1977 sven1977 merged commit 3862ab5 into ray-project:master Jun 26, 2024
7 checks passed
@sven1977 sven1977 deleted the appo_new_api_stack branch June 26, 2024 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants