-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] MultiAgentEpisode: Fix/enhance cut()
API.
#44677
[RLlib] MultiAgentEpisode: Fix/enhance cut()
API.
#44677
Conversation
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
successor._hanging_actions_end = copy.deepcopy(self._hanging_actions_end) | ||
successor._hanging_rewards_end = self._hanging_rewards_end.copy() | ||
successor._hanging_extra_model_outputs_end = copy.deepcopy( | ||
# Copy over the hanging (end) values into the hanging (begin) chaches of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"chaches" -> "caches". And can we leave a note here, why we need the _hanging_actions_begin
cache here? Why not writing it into the successor._hanging_actions_end
ones?
…i_agent_episode_fix_cut
Signed-off-by: sven1977 <[email protected]>
MultiAgentEpisode: Fix/enhance
cut()
API.Currently, when cutting a multi-agent episode, we do not properly account for sections in which one (or more) agents are not receiving observations (and are accumulating hanging rewards and one hanging action). These hanging values should be added to a different cache (the "before") cache, rather than the usual cache at the end of the episode. This then helps once we need to concatenate different chunks, e.g. one with a end-cache to the left of another chunk with a begin-cache. The caches have to match (actions) and/or get added (rewards) to yield the correct original MultiAgentEpisode.
NOTE: While this enhancement does help with replay buffers, in which we keep concatenating new chunks (with possible begin caches) to already stored chunks (with possible end caches), it still loses single-agent timesteps in the case where we have a) on-policylearning (no replay buffers) AND b) we
cut()
the MAEpisode (in the EnvRunner) at exactly a timestep, in which one or more agents are not receiving observations. These single-agent timesteps are lost and cannot be learned from.A solution for this problem could be (but will have to be discussed and implemented) to also store the most recent observation as a hanging one in the before-cache.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.