Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] DQN Rainbow on new API stack (w/ EnvRunner):training_step implementation. #43198

Merged
merged 56 commits into from
Mar 22, 2024

Conversation

simonsays1980
Copy link
Collaborator

@simonsays1980 simonsays1980 commented Feb 15, 2024

Why are these changes needed?

We are moving the standard algorithms to our new stack (i.e. RLModule API and EnvRunner API). This PR is one part of moving DQN Rainbow into our new stack. With it comes a training step that enables using the EnvRunner API together with RLModule.

See #43196 for the corresponding learners for DQN Rainbow.

Related issue number

Closes #37777

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…PI'. Furthermore, changed a couple of configurations to make DQN Rainbow run with the new stack (spec. 'RLModule'). In addition I amde minor changes to prioritized replay buffer.

Signed-off-by: Simon Zehnder <[email protected]>
…and implemented the latter into the training steps for PPO and SAC when using the new 'EnvRunner API'.

Signed-off-by: Simon Zehnder <[email protected]>
@sven1977 sven1977 marked this pull request as ready for review February 20, 2024 13:26
@sven1977 sven1977 changed the title DQN Rainbow training_step with new stack and EnvRunner API. [RLlib] DQN Rainbow on new API stack (w/ EnvRunner):training_step implementation. Feb 20, 2024
@sven1977 sven1977 self-assigned this Feb 20, 2024
@@ -276,7 +276,12 @@ def validate(self) -> None:
# Call super's validation method.
super().validate()

if self.exploration_config["type"] == "ParameterNoise":
# TODO (simon): Find a clean solution to deal with
Copy link
Contributor

@sven1977 sven1977 Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: We were going to move SimpleQ into rllib_contrib, but never did b/c of some remaining dependencies. But we will not move this one into the new stack anyways.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright - I was just referring to the problem with the new stack when exploration does take place via a certain exploration strategy and not via an action distribution. In DQN we have to deal with this to get somehow the epsilon into the RLModule

@@ -0,0 +1,34 @@
from ray.rllib.algorithms.dqn import DQNConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesomeness!! Do we have some results already vs the old stack?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, not yet. There were remaining some nits here and there. I was just making sure that the algorithm runs - and it does :)

@@ -307,13 +341,59 @@ def validate(self) -> None:
" used at the same time!"
)

# Validate that we use the corresponding `EpisodeReplayBuffer` when using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a TODO here (and also in SAC in the respective line) that we need to implement a MultiAgentEpisodeReplayBuffer to enable SAC/DQN for multi-agent cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this in an extra PR. I already made an issue for this some time ago: #42872

@@ -374,6 +454,141 @@ def training_step(self) -> ResultDict:
Returns:
The results dict from executing the training iteration.
"""
# New API stack (RLModule, Learner, EnvRunner, ConnectorV2).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

# Run multiple sampling iterations.
for _ in range(store_weight):
with self._timers[SAMPLE_TIMER]:
# TODO (simon): Use `sychnronous_parallel_sample()` here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We continue kicking this can down the road :D
Can we do this in this PR, fix the sychnronous_parallel_sample function? I think it's really just a few lines that would have to be changed in there to make it work with episodes.

Then we can also - in this same PR - fix SAC and PPO for good and remove all these TODOs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One merge away :)

Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speed with which you crank out these algo implementations on the new stack is breathtaking :) Great work.

  • A few comment-related nits.
  • One bigger item still to complete: Could we fix the synchronous sample utility to work on the new stack in this PR? This would close all these open TODOs for good.

…erroring out when the list was empty and key '0' not available. Made multiple tests with SAC and PPO, whcih learn both now.

Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
simonsays1980 and others added 19 commits March 6, 2024 18:12
…t only for 'num_atoms=1'.

Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: sven1977 <[email protected]>
…ray into dqn-rainbow-training-step

Signed-off-by: Simon Zehnder <[email protected]>
…ouble_q' b/c backpropagation does not work ortherwise. Furthermore, adapted tuned examples for new to old stack.

Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
…rioritizedEpisodeReplayBuffer' and run some experiments with new stack against old stack.

Signed-off-by: Simon Zehnder <[email protected]>
…t_encoder_config') and 'TorchDQNRainbowModule' for the double_q case (outputs need to be chunkled).

Signed-off-by: Simon Zehnder <[email protected]>
…rn. Added updating the 'global_num_env_steps_sampled' in the 'SingleAgentEnvRunner' to avoid synching after each sampling loop. Tested on 'FrozenLake-v1' all combinations are running, but 'noisy=True' and 'double_q=True' have not been tested, yet.

Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
…ng step using noisy networks, dueling, double-Q, and distributional learning. SOme performance improvements were made and in case noisy netwokrs are used no epsilon greedy is used. A build test was added to use this new stack together with the 'SingleAgentEnvRunner'

Signed-off-by: Simon Zehnder <[email protected]>
…NRainbowRLModule' and added docstrings.

Signed-off-by: Simon Zehnder <[email protected]>
@simonsays1980 simonsays1980 added rllib RLlib related issues rllib-newstack labels Mar 22, 2024
rllib/BUILD Outdated Show resolved Hide resolved
)

stop = {
"evaluation/sampler_results/episode_reward_mean": 500.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wow! This is very good!

@@ -255,7 +255,7 @@ def _sample_timesteps(
# RLModule forward pass: Explore or not.
if explore:
to_env = self.module.forward_exploration(
to_module, t=self.global_num_env_steps_sampled + ts
to_module, t=self.global_num_env_steps_sampled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! Thanks for fixing this logic. Now each EnvRunner is more robust in itself, keeping these up to date, at least until a new (and better) global count arrives.

Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the PR @simonsays1980 !

@sven1977
Copy link
Contributor

Waiting for tests ... then merge.

@sven1977 sven1977 merged commit 9907a28 into ray-project:master Mar 22, 2024
5 checks passed
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rllib RLlib related issues rllib-newstack
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RLlib] Enabling RLModule by default on DQN
2 participants