-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] ReplayBuffer API Simple Q #22842
[RLlib] ReplayBuffer API Simple Q #22842
Conversation
…methods, docstrings
minors in other buffer classes
…ayBufferAPI_tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for the PR @ArturNiederfahrenhorst , could you address the 3 questions we had on deprecation decorator, PR beta annealing, and while
loop (train batch size)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 nits
"store_buffer_in_checkpoints": False, | ||
# The number of contiguous environment steps to replay at once. This may | ||
# be set to greater than 1 to support recurrent models. | ||
"replay_sequence_length": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this work? I thought data in RB have already been post-processed. So these samples should all have the necessary state inputs for recurrent models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think state inputs don't live in SampleBatches when they are stored in replay buffers. Recurrent state is passed through the forwad() method of the ModelV2 API and is also initialized by the ModelV2 object via get_initial_state().
This should be taken into consideration on the connector design, right?. @gjoliver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I need to double check the code. it seems like the API for adding a SampleBatch assumes that the batch contains a full episode, and it will slice it up according to replay_sequence_length, and store multiple smaller batches as a result.
am I reading it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I read through everything. our codebase is really a mess.
I believe SampleBatch does carry all the state_in/out columns. if you look at timeslice_along_seq_lens_with_overlap(), it handles the recurrent states correctly.
all those complicated state building logics in Sampler and SimpleListCollector are actually just for rollout. I feel like we should be able to clean up tons of CPU heavy stuff that doesn't do anything today.
btw, if ReplayBuffer is handling the batching of RNN states, how does RNN work for agents like PG that doesn't use ReplayBuffer???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested it out. it simply takes the raw batch with all the state_in and state_out etc.
so still runs fine. 👌
…ayBufferAPI_Simple_Q
…playBufferAPI_Simple_Q
Pulled latest master to fix LINT error. Waiting for tests to finish, then merge ... |
Why are these changes needed?
We need to slowly move the new ReplayBuffer API into the critical path. Starting with Simple Q Learning, this PR moves the MultiAgentPrioritizedReplayBuffer into the critical path and explores what measures to take without impeding other algorithms.
Checks
scripts/format.sh
to lint the changes in this PR.