-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Properly serialize and restore StateBufferConnector states for policy stashing #31372
Conversation
787ec1c
to
3b48398
Compare
@@ -70,11 +89,18 @@ def transform(self, ac_data: AgentConnectorDataType) -> AgentConnectorDataType: | |||
return ac_data | |||
|
|||
def to_state(self): | |||
return StateBufferConnector.__name__, None | |||
# Note(jungong) : it is ok to use cloudpickle here for stats because: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to use cloudpickle over pickle is simply that you are pickling a data-structure that contains lambda function, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we should always use cloudpickle.
It's not officially guaranteed, but there is a less chance of not being able to restore states saved with a higher version of python if we use cloudpickle.
cloudpickle kinda does it automatically for you, using the right pickle library if the version is low (pickle5 etc).
# like stashing then restoring a policy during the rollout of | ||
# a single episode. | ||
# It is ok to ignore the error for most of the cases here. | ||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why you should not error out all the time here? When will you ever pass in a states object in that is ok if it's not unpickled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you recover a policy for serving, we wouldn't need the state buffer state.
actually this comment reminded me, I am letting StateBufferConnector clear state whenever someone switch it into eval mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case we should not pass in the states at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can discuss a bit.
I feel like the only case we need to restore this is when policies are stashed and recovered in the middle of an episode.
we don't need this state even for training as long as we keep the policy in cache throughout an episode.
so maybe a better fix is to ping any "in-use" poclies, and prevent them from being stashed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR seems to solve the problem with 100-policy test at least. I wonder if it still works with the old numbers. @gjoliver Can you confirm that?
yes, it works with original numbers. just very slow right now because we are doing a lot of unnecessary policy stashing. |
53ecfa3
to
6a1ac2e
Compare
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Signed-off-by: Jun Gong <[email protected]>
Signed-off-by: Avnish <[email protected]>
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
… restored during the rollout of an single episode. Signed-off-by: Jun Gong <[email protected]>
reduce 100 policy size since we are doing a lot of excessive policy resotring at this point. Signed-off-by: Jun Gong <[email protected]>
Signed-off-by: Jun Gong <[email protected]>
…Group does not need to initiate a collector for every single policies up-front. Signed-off-by: Jun Gong <[email protected]>
6a1ac2e
to
4f35696
Compare
Signed-off-by: Jun Gong <[email protected]>
4f35696
to
40adca3
Compare
Signed-off-by: Jun Gong <[email protected]>
tests look good finally with and without the flag flip. |
…r policy stashing (#31372) Signed-off-by: Artur Niederfahrenhorst <[email protected]>
…r policy stashing (ray-project#31372) Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: tmynn <[email protected]>
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.