-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Fix bugs in IMPALA/APPO + LSTM (new stack) and activate StatelessCartPole learning tests on new API stack. #47132
[RLlib] Fix bugs in IMPALA/APPO + LSTM (new stack) and activate StatelessCartPole learning tests on new API stack. #47132
Conversation
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
…metimes due to NaN weights). Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
…vate_appo_lstm_learning_tests
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Stateful modules in APPO/IMPALA is tricky. I would like to see also a note in the docs and a comment in the APPORLModule how using an LSTM should be handled by users. I am sure many will have similar questions as our customer.
In addition: how would this look like when a user uses an attention model in her module? Does she also have to override the VTrace calculation or any connector?
rollout_frag_or_episode_len = config.get_rollout_fragment_length() | ||
recurrent_seq_len = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that cannot work, of yourse.
T = tensor.shape[0] // B | ||
if recurrent_seq_len is not None: | ||
assert len(tensor.shape) == 2 | ||
# Swap B and T axes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the customer asked today: is this conversion needed? Isn't it done by a connector? Our base LSTM encoder does not do it as far as I remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a conversion specifically needed for v-trace, which sits inside the APPO/IMPALA loss. So I think it's a good design to make the loss itself do this flip (instead of the connector). Also, we only need to flip a few columns, not all of them.
): | ||
# TODO (sven): This is a little bit of a hack: By expanding the Episode's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tricky :/
# For example, if e1 (id=a len=4) and e2 (id=a len=5) are two chunks of the | ||
# same episode in `episodes`, the resulting batch would have an additional | ||
# timestep in the middle of the episode's "row": | ||
# { "obs": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just concat the episode and then cut them into sequence len pieces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking about that, too. But I think this would cost too much extra time? It would certainly be of advantage b/c it would eliminate the "bad" samples, where we have very short episode chunks (a new episode started close to the end (e.g. 2 timesteps away) of the EnvRunner's rollout fragment length) and then zero-pad that trajectory all the way to the end, where instead we could concatenate that short piece with the longer successor piece to give it "more meaning".
We could try adding this logic to this same connector (for now) and run some experiments with CartPole (generally short episodes) and Pong (generally quite long episodes) to see, whether there is a performance delta ...
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Fix bugs in IMPALA/APPO + LSTM (new stack) and activate StatelessCartPole learning tests on new API stack.
AddOneTsToEpisodesAndTruncate
connector needs to disseminate each episode chunk to NOT add an extra timestep to the middle of an episode (which happens to have two chunks in the training data).Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.