[RLlib] Fix memory leak in APEX_DQN #26691

avnishn · 2022-07-18T23:51:10Z

There was some sort of memory consumption issue in the queue
that we use for reading replay batches in to before placing them
on the learner queue. To be honest I can't exactly articulate the
bug, but its definitely some issue with this list object.

I also made the queue placement operation blocking, which is similar
to the other PR I opened this week with regards to making queue
placement operations blocking #26581.

Signed-off-by: avnish [email protected]

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

There was some sort of memory consumption issue in the queue that we use for reading replay batches in to before placing them on the learner queue. To be honest I can't exactly articulate the bug, but its definitely some issue with this list object. I also made the queue placement operation blocking, which is similar to the other PR I opened this week with regards to making queue placement operations blocking ray-project#26581. Signed-off-by: avnish <[email protected]>

Signed-off-by: avnish <[email protected]>

avnishn · 2022-07-18T23:57:29Z

I kicked off a set of release tests for the tests that might be affected (learning a-e and long running apex), lets see what happens...

avnishn · 2022-07-19T00:06:46Z

TLDR; it could be an OOM from some misuse of a list containing ray objects, or it could just be that this list that we use to store sample batches from replay buffers before placing them on the learner queue grows unboundedly, which would be a problem anyways

ArturNiederfahrenhorst · 2022-07-19T16:49:19Z

A-E are marked green but are failing.

avnishn · 2022-07-19T17:03:06Z

The failure is unrelated (a flaky test that needs to be fixed) The long running apex is still running, but has been running for 15 hours, so I think its safe to assume that it is fine. It was terminating in 30 minutes before this (when it would fail) and can run for 24 hours iirc.

gjoliver

there is a push for always having tests when submitting fixes, but I guess it's gonna be really hard to write a unit test for this PR huh ...

rllib/tuned_examples/apex_dqn/pong-apex-dqn.yaml

gjoliver · 2022-07-19T17:45:09Z

rllib/algorithms/apex_dqn/apex_dqn.py

-            except queue.Full:
-                break
+        for item in replay_sample_batches:
+            self.learner_thread.inqueue.put(item, block=True)


I feel like this is essentially back-pressure. like if learner queue if full, and learner has trouble keeping up with the input sample, we stop sampling from RB.
can we print the size of self.replay_sample_batches as a metric in the old world without this fix?
that way, we can easily verify that this list growing un-boundedly was the reason for mem-leak-ish behavior.

…apex_oom Signed-off-by: avnish <[email protected]>

Signed-off-by: avnish <[email protected]>

avnishn · 2022-07-19T19:58:33Z

following up in @gjoliver's request:

I feel like this is essentially back-pressure. like if learner queue if full, and learner has trouble keeping up with the input sample, we stop sampling from RB.
can we print the size of self.replay_sample_batches as a metric in the old world without this fix?
that way, we can easily verify that this list growing un-boundedly was the reason for mem-leak-ish behavior.

here's the size of self.replay_sample_batches over time. The memory bug is directly attributed to this data structure growing in an unbounded fashion

Signed-off-by: avnish <[email protected]>

avnishn · 2022-07-19T20:01:47Z

following up I removed the change to the yaml file for a separate pr.

I don't think we need to write a unit test. The whole point of this pr to begin with was to fix the not passing test, which is checking for the exact fix that we're providing here. @gjoliver

avnishn · 2022-07-19T20:10:12Z

@gjoliver see #26735

gjoliver · 2022-07-19T20:15:21Z

awesome man!! thanks for figuring this one out.
yeah, I think these multi threading things are kinda hard to unit test.

Signed-off-by: Rohan138 <[email protected]>

Signed-off-by: Stefan van der Kleij <[email protected]>

avnishn requested review from sven1977, gjoliver, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners July 18, 2022 23:51

Remove auto from test

65b1699

Signed-off-by: avnish <[email protected]>

gjoliver requested changes Jul 19, 2022

View reviewed changes

avnishn added 2 commits July 19, 2022 12:09

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

f15b9e5

…apex_oom Signed-off-by: avnish <[email protected]>

Fix merge conflict

8fae5df

Signed-off-by: avnish <[email protected]>

Revert change to yaml file

baf09c6

Signed-off-by: avnish <[email protected]>

gjoliver approved these changes Jul 19, 2022

View reviewed changes

avnishn mentioned this pull request Jul 19, 2022

[RLlib Release Tests] APEX long running release test is failing #26580

Closed

avnishn linked an issue Jul 19, 2022 that may be closed by this pull request

[RLlib Release Tests] APEX long running release test is failing #26580

Closed

richardliaw merged commit 9063cc9 into ray-project:master Jul 19, 2022

Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022

[RLlib] Fix memory leak in APEX_DQN (ray-project#26691)

1aa6c73

Signed-off-by: Rohan138 <[email protected]>

Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022

[RLlib] Fix memory leak in APEX_DQN (ray-project#26691)

825305f

Signed-off-by: Stefan van der Kleij <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Fix memory leak in APEX_DQN #26691

[RLlib] Fix memory leak in APEX_DQN #26691

avnishn commented Jul 18, 2022

avnishn commented Jul 18, 2022

avnishn commented Jul 19, 2022

ArturNiederfahrenhorst commented Jul 19, 2022

avnishn commented Jul 19, 2022

gjoliver left a comment

gjoliver Jul 19, 2022

avnishn commented Jul 19, 2022

avnishn commented Jul 19, 2022

avnishn commented Jul 19, 2022

gjoliver commented Jul 19, 2022

[RLlib] Fix memory leak in APEX_DQN #26691

[RLlib] Fix memory leak in APEX_DQN #26691

Conversation

avnishn commented Jul 18, 2022

Why are these changes needed?

Related issue number

Checks

avnishn commented Jul 18, 2022

avnishn commented Jul 19, 2022

ArturNiederfahrenhorst commented Jul 19, 2022

avnishn commented Jul 19, 2022

gjoliver left a comment

Choose a reason for hiding this comment

gjoliver Jul 19, 2022

Choose a reason for hiding this comment

avnishn commented Jul 19, 2022

avnishn commented Jul 19, 2022

avnishn commented Jul 19, 2022

gjoliver commented Jul 19, 2022