-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] DQN Rainbow on new API stack (w/ EnvRunner):training_step
implementation.
#43198
[RLlib] DQN Rainbow on new API stack (w/ EnvRunner):training_step
implementation.
#43198
Conversation
…PI'. Furthermore, changed a couple of configurations to make DQN Rainbow run with the new stack (spec. 'RLModule'). In addition I amde minor changes to prioritized replay buffer. Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
…and implemented the latter into the training steps for PPO and SAC when using the new 'EnvRunner API'. Signed-off-by: Simon Zehnder <[email protected]>
training_step
with new stack and EnvRunner API
.training_step
implementation.
@@ -276,7 +276,12 @@ def validate(self) -> None: | |||
# Call super's validation method. | |||
super().validate() | |||
|
|||
if self.exploration_config["type"] == "ParameterNoise": | |||
# TODO (simon): Find a clean solution to deal with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note: We were going to move SimpleQ into rllib_contrib, but never did b/c of some remaining dependencies. But we will not move this one into the new stack anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright - I was just referring to the problem with the new stack when exploration does take place via a certain exploration strategy and not via an action distribution. In DQN we have to deal with this to get somehow the epsilon
into the RLModule
@@ -0,0 +1,34 @@ | |||
from ray.rllib.algorithms.dqn import DQNConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesomeness!! Do we have some results already vs the old stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, not yet. There were remaining some nits here and there. I was just making sure that the algorithm runs - and it does :)
@@ -307,13 +341,59 @@ def validate(self) -> None: | |||
" used at the same time!" | |||
) | |||
|
|||
# Validate that we use the corresponding `EpisodeReplayBuffer` when using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a TODO here (and also in SAC in the respective line) that we need to implement a MultiAgentEpisodeReplayBuffer to enable SAC/DQN for multi-agent cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do this in an extra PR. I already made an issue for this some time ago: #42872
@@ -374,6 +454,141 @@ def training_step(self) -> ResultDict: | |||
Returns: | |||
The results dict from executing the training iteration. | |||
""" | |||
# New API stack (RLModule, Learner, EnvRunner, ConnectorV2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Signed-off-by: Sven Mika <[email protected]>
rllib/algorithms/dqn/dqn.py
Outdated
# Run multiple sampling iterations. | ||
for _ in range(store_weight): | ||
with self._timers[SAMPLE_TIMER]: | ||
# TODO (simon): Use `sychnronous_parallel_sample()` here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We continue kicking this can down the road :D
Can we do this in this PR, fix the sychnronous_parallel_sample
function? I think it's really just a few lines that would have to be changed in there to make it work with episodes.
Then we can also - in this same PR - fix SAC and PPO for good and remove all these TODOs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One merge away :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speed with which you crank out these algo implementations on the new stack is breathtaking :) Great work.
- A few comment-related nits.
- One bigger item still to complete: Could we fix the synchronous sample utility to work on the new stack in this PR? This would close all these open TODOs for good.
Signed-off-by: Simon Zehnder <[email protected]>
…erroring out when the list was empty and key '0' not available. Made multiple tests with SAC and PPO, whcih learn both now. Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
…t only for 'num_atoms=1'. Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
…rainbow-training-step
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
…ray into dqn-rainbow-training-step Signed-off-by: Simon Zehnder <[email protected]>
…ouble_q' b/c backpropagation does not work ortherwise. Furthermore, adapted tuned examples for new to old stack. Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
…rioritizedEpisodeReplayBuffer' and run some experiments with new stack against old stack. Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
…t_encoder_config') and 'TorchDQNRainbowModule' for the double_q case (outputs need to be chunkled). Signed-off-by: Simon Zehnder <[email protected]>
…rn. Added updating the 'global_num_env_steps_sampled' in the 'SingleAgentEnvRunner' to avoid synching after each sampling loop. Tested on 'FrozenLake-v1' all combinations are running, but 'noisy=True' and 'double_q=True' have not been tested, yet. Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
…ng step using noisy networks, dueling, double-Q, and distributional learning. SOme performance improvements were made and in case noisy netwokrs are used no epsilon greedy is used. A build test was added to use this new stack together with the 'SingleAgentEnvRunner' Signed-off-by: Simon Zehnder <[email protected]>
…NRainbowRLModule' and added docstrings. Signed-off-by: Simon Zehnder <[email protected]>
) | ||
|
||
stop = { | ||
"evaluation/sampler_results/episode_reward_mean": 500.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wow! This is very good!
@@ -255,7 +255,7 @@ def _sample_timesteps( | |||
# RLModule forward pass: Explore or not. | |||
if explore: | |||
to_env = self.module.forward_exploration( | |||
to_module, t=self.global_num_env_steps_sampled + ts | |||
to_module, t=self.global_num_env_steps_sampled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! Thanks for fixing this logic. Now each EnvRunner is more robust in itself, keeping these up to date, at least until a new (and better) global count arrives.
Signed-off-by: Sven Mika <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the PR @simonsays1980 !
Waiting for tests ... then merge. |
Why are these changes needed?
We are moving the standard algorithms to our new stack (i.e.
RLModule API
andEnvRunner API
). This PR is one part of moving DQN Rainbow into our new stack. With it comes a training step that enables using theEnvRunner API
together withRLModule
.See #43196 for the corresponding learners for DQN Rainbow.
Related issue number
Closes #37777
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.