Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Fix memory leak in APEX_DQN #26691

Merged
merged 5 commits into from
Jul 19, 2022

Conversation

avnishn
Copy link
Member

@avnishn avnishn commented Jul 18, 2022

There was some sort of memory consumption issue in the queue
that we use for reading replay batches in to before placing them
on the learner queue. To be honest I can't exactly articulate the
bug, but its definitely some issue with this list object.

I also made the queue placement operation blocking, which is similar
to the other PR I opened this week with regards to making queue
placement operations blocking #26581.

Signed-off-by: avnish [email protected]

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

There was some sort of memory consumption issue in the queue
that we use for reading replay batches in to before placing them
on the learner queue. To be honest I can't exactly articulate the
bug, but its definitely some issue with this list object.

I also made the queue placement operation blocking, which is similar
to the other PR I opened this week with regards to making queue
placement operations blocking ray-project#26581.

Signed-off-by: avnish <[email protected]>
@avnishn
Copy link
Member Author

avnishn commented Jul 18, 2022

I kicked off a set of release tests for the tests that might be affected (learning a-e and long running apex), lets see what happens...

@avnishn
Copy link
Member Author

avnishn commented Jul 19, 2022

TLDR; it could be an OOM from some misuse of a list containing ray objects, or it could just be that this list that we use to store sample batches from replay buffers before placing them on the learner queue grows unboundedly, which would be a problem anyways

@ArturNiederfahrenhorst
Copy link
Contributor

A-E are marked green but are failing.

@avnishn
Copy link
Member Author

avnishn commented Jul 19, 2022

The failure is unrelated (a flaky test that needs to be fixed) The long running apex is still running, but has been running for 15 hours, so I think its safe to assume that it is fine. It was terminating in 30 minutes before this (when it would fail) and can run for 24 hours iirc.

Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a push for always having tests when submitting fixes, but I guess it's gonna be really hard to write a unit test for this PR huh ...

rllib/tuned_examples/apex_dqn/pong-apex-dqn.yaml Outdated Show resolved Hide resolved
except queue.Full:
break
for item in replay_sample_batches:
self.learner_thread.inqueue.put(item, block=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is essentially back-pressure. like if learner queue if full, and learner has trouble keeping up with the input sample, we stop sampling from RB.
can we print the size of self.replay_sample_batches as a metric in the old world without this fix?
that way, we can easily verify that this list growing un-boundedly was the reason for mem-leak-ish behavior.

@avnishn
Copy link
Member Author

avnishn commented Jul 19, 2022

following up in @gjoliver's request:

I feel like this is essentially back-pressure. like if learner queue if full, and learner has trouble keeping up with the input sample, we stop sampling from RB.
can we print the size of self.replay_sample_batches as a metric in the old world without this fix?
that way, we can easily verify that this list growing un-boundedly was the reason for mem-leak-ish behavior.

image

here's the size of self.replay_sample_batches over time. The memory bug is directly attributed to this data structure growing in an unbounded fashion

@avnishn
Copy link
Member Author

avnishn commented Jul 19, 2022

following up I removed the change to the yaml file for a separate pr.

I don't think we need to write a unit test. The whole point of this pr to begin with was to fix the not passing test, which is checking for the exact fix that we're providing here. @gjoliver

@avnishn
Copy link
Member Author

avnishn commented Jul 19, 2022

@gjoliver see #26735

@gjoliver
Copy link
Member

awesome man!! thanks for figuring this one out.
yeah, I think these multi threading things are kinda hard to unit test.

Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RLlib Release Tests] APEX long running release test is failing
4 participants