-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Fix memory leak in APEX_DQN #26691
Conversation
There was some sort of memory consumption issue in the queue that we use for reading replay batches in to before placing them on the learner queue. To be honest I can't exactly articulate the bug, but its definitely some issue with this list object. I also made the queue placement operation blocking, which is similar to the other PR I opened this week with regards to making queue placement operations blocking ray-project#26581. Signed-off-by: avnish <[email protected]>
Signed-off-by: avnish <[email protected]>
I kicked off a set of release tests for the tests that might be affected (learning a-e and long running apex), lets see what happens... |
TLDR; it could be an OOM from some misuse of a list containing ray objects, or it could just be that this list that we use to store sample batches from replay buffers before placing them on the learner queue grows unboundedly, which would be a problem anyways |
A-E are marked green but are failing. |
The failure is unrelated (a flaky test that needs to be fixed) The long running apex is still running, but has been running for 15 hours, so I think its safe to assume that it is fine. It was terminating in 30 minutes before this (when it would fail) and can run for 24 hours iirc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a push for always having tests when submitting fixes, but I guess it's gonna be really hard to write a unit test for this PR huh ...
except queue.Full: | ||
break | ||
for item in replay_sample_batches: | ||
self.learner_thread.inqueue.put(item, block=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is essentially back-pressure. like if learner queue if full, and learner has trouble keeping up with the input sample, we stop sampling from RB.
can we print the size of self.replay_sample_batches as a metric in the old world without this fix?
that way, we can easily verify that this list growing un-boundedly was the reason for mem-leak-ish behavior.
…apex_oom Signed-off-by: avnish <[email protected]>
Signed-off-by: avnish <[email protected]>
following up in @gjoliver's request:
here's the size of self.replay_sample_batches over time. The memory bug is directly attributed to this data structure growing in an unbounded fashion |
Signed-off-by: avnish <[email protected]>
following up I removed the change to the yaml file for a separate pr. I don't think we need to write a unit test. The whole point of this pr to begin with was to fix the not passing test, which is checking for the exact fix that we're providing here. @gjoliver |
awesome man!! thanks for figuring this one out. |
Signed-off-by: Rohan138 <[email protected]>
Signed-off-by: Stefan van der Kleij <[email protected]>
There was some sort of memory consumption issue in the queue
that we use for reading replay batches in to before placing them
on the learner queue. To be honest I can't exactly articulate the
bug, but its definitely some issue with this list object.
I also made the queue placement operation blocking, which is similar
to the other PR I opened this week with regards to making queue
placement operations blocking #26581.
Signed-off-by: avnish [email protected]
Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.